Learn Python Series (#30) - Data Science Part 1 - Pandas

in #utopian-io6 years ago

Learn Python Series (#30) - Data Science Part 1 - Pandas

python-logo.png

Repository

What will I learn?

  • You will learn what kind of toolset the pandas Python package is providing you with, how to install it (if you haven't installed it already in your current Python distribution), and import it into your projects;
  • how to convert data (either passed-in directly or read from another source) to a pandas DataFrame;
  • how to save data from a pandas DataFrame to an external file, such as CSV;
  • how to do some basic pandas data wrangling operations.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.7) distribution, such as (for example) the Anaconda Distribution;
  • The ambition to learn Python programming.

Difficulty

  • Beginner

Curriculum (of the Learn Python Series):

Additional sample code files

The full - and working! - iPython tutorial sample code file is included for you to download and run for yourself right here:
https://github.com/realScipio/learn-python-series/blob/master/lps-030/learn-python-series-030-data-science-pt1-pandas.ipynb

GitHub Account

https://github.com/realScipio

Learn Python Series (#30) - Data Science Part 1 - Pandas

Welcome to already episode #30 of the Learn Python Series! It's been a while since I've published my last (#29) tutorial episode on Python, after which I was busy with a number of projects including co-developing and running UA and @steem-ua together with @holger80.

Not everybody realises that (although I can code) I'm not originally academically educated in Computer Sciences, ergo that I'm writing the Learn Python Series partially as a documentation project on my own Python research, study and development aspirations. By carefully writing these tutorials in a very structured format, almost (or even exactly) "book-like", I'm "cementing" my own Python knowledge and skills. The past months I've gained an interest in learning more about Data Science using Python, and I recently came to the conclusion my own "research notes" were beginning to pile and felt the need to better document my progress. How to do that better by resuming the Learn Python Series? So there you go...! ;-)

About Data Science

Data Science is about gaining insights from (huge) amounts of (structured) data by analysing that data, and also to analytically and algorithmically solve complex problems, which insights and algorithmic solutions also have the potential to generate much value. When you dig into (large / big) data sets, you might be able to discover new insights that were previously hidden. The process of first exploring data, investigating that data to discover data characteristics and patterns, enriching that data with other data, often times requires a combination of both analytical skills and mathematical / business / tech creativity and skill. I suppose data science is positioned in the intersecting areas of those fields, which alligns with my own interests as well; which is why I find Data Science fascinating to learn more about, personally.

About the Python package pandas

pandas is a well-known and actively developed Python package which can be summarised as a "data analysis, wrangling and management toolkit"; I suppose you could call it "Excel for Python" in a way. pandas provides powerful and flexible methods and data formats to aid data science tasks, using Python and it's built on top of numpy ("Numerical Python", which we've already yet briefly talked about in episode #11 of the Learn Python Series).

pandas is positioned (as opposed to NumPy itself) as a more "high level" data analysis / wrangling toolkit, and - like Excel or OpenOffice "Calc" - it works really well with "tabular data". Unlike Excel / Calc, pandas is able to handle really large data sets, with file sizes ranging from hundreds of MegaBytes to even Gigabytes (or more!); try working with (or even opening!) those on a regular Excel / Calc application running on a regular personal computer!

pandas can therefore be used to -1- clean / munge / wrangle data sets, -2- analyse and (re-) model the data set, and -3- organise the data analysis (to plot, display in tabular form, and/or further process).

In short pandas is really powerful and cool, so let's dive right in!

Installing and importing pandas

If you're working with the Anaconda Python distribution, the pandas package is already installed by default, so you only need to import it in your project. If you haven't already installed pandas, that's as simple as:

pip install pandas

Then, create a new Python file, give it a relevant name (for example pandas_tut_1.py) and then simply begin with:

import pandas as pd

pandas Data Frame Basics

A DataFrame is a pandas data structure to represent tabular data (like a CSV file or an Excel spreadsheet with named columns and rows). Shortly hereafter, we'll be covering how to read-in an existing CSV file and convert it to a DataFrame object, but let's begin with creating a simple example DataFrame from scratch.

the .DataFrame() constructor

If we begin with a regular Python data object such as a dictionary, or a list of lists or tuples, pandas provides the .DataFrame() constructor to convert such data objects into a pandas DataFrame, for example like so:

import pandas as pd

weather_dict = {
    'day': ['1/1/2019', '1/2/2019', '1/3/2019', '1/4/2019', '1/5/2019'],
    'temp_celsius': [3, 2, -1, 0, 4]
}

df1 = pd.DataFrame(data=weather_dict)
df1
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4

Explanation: after importing pandas as pd, and declaring a dictionary object with two keys (day and temp_celsius), each containing one list with 5 values, we then converted the weather_dict dictionary object into a DataFrame object (called df1).

Nota bene: as always, I'm writing this tutorial itself using Jupyter Notebook, which contains both a Python interpreter, the markdown content, and a number of built-in Jupyter Notebook-specific methods and mechanisms. Running the above code inside a Jupyter Notebook prints/outputs the df1 DataFrame simply by calling the variable df1. In case you want to print the df1 DataFrame contents from the command line after having coded the above in an external code editor (e.g. Microsoft Visual Studio Code), then you need to do:

print(df1)
        day  temp_celsius
0  1/1/2019             3
1  1/2/2019             2
2  1/3/2019            -1
3  1/4/2019             0
4  1/5/2019             4

(From here on I'm assuming you're following along on a Jupyter Notebook as well, hence I won't be explicitly printing the DataFrame objects every time in the remainder of this and following tutorial(s).)

Nota bene: in this particular (dictionary) example, I've been using a "top down" approach, in which data is converted into a DataFrame object by dictionary keys. However, a more "logical" approach would be to insert that data "row-by-row", as the temperature value of "3 degrees Celsius" belongs to the associated data value "1/1/2019".

Another way to construct the same DataFrame, is via a "list of lists", which are then given column names as an additional constructor argument, like so:

import pandas as pd

weather_list = [
    ['1/1/2019', 3],
    ['1/2/2019', 2],
    ['1/3/2019', -1],
    ['1/4/2019', 0],
    ['1/5/2019', 4]
]

df2 = pd.DataFrame(data=weather_list, columns=['day', 'temp_celsius'])
df2
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4

the read_csv() method

As we've just learned, the DataFrame() constructor needs to be passed a data= argument, which is the Python object holding the (example) data. But of course when dealing with large data sets you're not going to declare all those values manually. Instead, you might have saved them already on disk and you like to read the data from disk to then convert to a DataFrame object.

For exactly that purpose, pandas has the built-in method read_csv() (as well as a number of similar methods for other file types). Suppose in your current working directory exists the CSV file weather.csv, then you can construct the exact same DataFrame object like so:

import pandas as pd
df3 = pd.read_csv('weather.csv')
df3
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4

the to_csv() method

pandas also allows to go the opposite route: to export DataFrame objects and save them to disk as .csv files. The to_csv() is used for that.

Nota bene: in order to save the weather.csv example file (that we just read via read_csv()) from the df2 DataFrame object we constructed before, it's convenient to not save the 0,1,2,3,4 index values to the CSV file (those index values are format-specific, and don't directly belong to the original data set). By default (at least in pandas version 0.24.0; the current version) those index values would be exported to CSV and so are the column names / headers (the first row of the CSV file). While we do want those column name values included in the CSV file, but not the pandas default index values, we set the index= parameter to None (and leave the header= parameter as it is by default: True). As the first argument we pass the file name (and an optional filepath in case you want to save it in another directory as your current working directory):

df2.to_csv('weather.csv', index=None)

After running the above to_csv() code line, your file 'weather.csv' should be saved as a valid CSV file, located in your current working directory.

the .head() and .tail() methods

When working with large data sets, it's often times convenient to quickly inspect the data you're working with, without wanting to "eye ball" big amounts of data. To only display the top 5 lines of your DataFrame (including column names and index numbers) you can use the .head() method, and to only display the bottom 5 lines of your DataFrame you can use .tail().

Nota bene: Please note that our (very simple) example weather.csv data set only contains 5 rows in total for simplicity / explanatino matters, ergo, in thisspecific example case you wouldn't notice a difference when running either ...

  • df3, or
  • df3.head(), or
  • df3.tail()

However, you can also pass an integer N to both .head() and .tail(), to only show N likes either at the top or bottom of your DataFrame, for example:

df3.head(2)
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
df3.tail(2)
day temp_celsius
3 1/4/2019 0
4 1/5/2019 4

In these specific examples, by passing the integer value of 2 to both .head(2) and .tail(2) we only show the top and bottom 2 lines of the DataFrame, respectively.

Index slicing

If you're interested to only use a specific set of DataFrame rows, you can use index slices just like we've learned about already on regular Python lists.

For example, if we only want to work with rows 1 and 2:

df3[1:3]
day temp_celsius
1 1/2/2019 2
2 1/3/2019 -1

Nota bene: the stop parameter is non-inclusive, ergo df3[1:3] means "begin with row 1 and stop at row number 3", hence, it shows rows 1 and 2.

In case you want to work with the entire DataFrame beginning with row number 2, then use:

df3[2:]
day temp_celsius
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4

And in case you want to work with the entire DataFrame until row number 3, then use:

df3[:3]
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1

the columns attribute / property

If you want to assign, return, or print all column names your DataFrame holds, call the columns attribute / property, like so:

df3.columns
Index(['day', 'temp_celsius'], dtype='object')

the shape attribute / property

Calling the shape property returns a tuple of the DataFrames "size" or "shape" in the form of (num_rows, num_columns), like so:

df3.shape
(5, 3)

Nota bene: it's more efficient / faster to call the shape property when determining the total amount of rows in a DataFrame than using len(df3) would be, although that works as well:

num_rows_shape, num_cols_shape = df3.shape
print(f"Number of rows via .shape: {num_rows_shape}")

num_rows_len = len(df3)
print(f"Number of rows via len(): {num_rows_len}")
Number of rows via .shape: 5
Number of rows via len(): 5

Two syntaxes on selecting columns

pandas allows for two syntaxes to selecting individual columns, being with the .column_name dot-property notation, and by using ['column_name'] squared brackets notation. E.g.:

df3.temp_celsius
0    3
1    2
2   -1
3    0
4    4
Name: temp_celsius, dtype: int64

and also:

df3['temp_celsius']
0    3
1    2
2   -1
3    0
4    4
Name: temp_celsius, dtype: int64

Nota bene: I strongly recommend to only use the squared bracket notation, as your DataFrame column names could be identical to pandas built-in attribute / property names. Suppose for example you'd have a column named columns (for some reason or another), then calling df3.columns returns the columns property values, not the content of the df3['columns'] DataFrame column!

Also, in case your column names contain one or more spaces or other non-alphanumerical characters, the dot-property syntax doesn't work.

Selecting multiple DataFrame columns

In case you want to select multiple DataFrame columns, but not all of them, then pass a list of column names instead. Again, on our very simple example file weather.csv we only have 2 columns containing data, so the example I'll give right now will return the same results (in this specific case) as calling the entire DataFrame object. In case there would be another column in the total DataFrame, then of course this technique only selects the column names passed as a list, like so:

df3[['day', 'temp_celsius']]
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4

Selecting specific rows and specific columns

In case you want to select both (one or more) columns and a slice of DataFrame rows, then combine the just explained techniques beginning with the columns, then the row slices, for example:

df3['temp_celsius'][1:3]
1    2
2   -1
Name: temp_celsius, dtype: int64

Appending a new column using vectorised column operations

As pandas is built on top of numpy, which very efficiently uses vectorised data array operations, so does pandas itself. Vectorisation means executing an operation on an entire column / array of data.

Let's say, just to easily explain such a column-wise vectorisation process, we'd want to add another column to the DataFrame - called "temp_plus_one", in which we want to store all the temperature values incremented by 1 celsius. To do that, first we simply name / assign that new / extra column, then reference (in this case) the 'temp_celsius' column and add the value of 1 to it, like so:

df3['temp_plus_one'] = df3['temp_celsius'] + 1
df3
day temp_celsius temp_plus_one
0 1/1/2019 3 4
1 1/2/2019 2 3
2 1/3/2019 -1 0
3 1/4/2019 0 1
4 1/5/2019 4 5

The DataFrame column "temp_plus_one" is now added to the original df3 DataFrame.

Removing an existing column from a DataFrame, using .drop()

If we again like to remove the "temp_plus_one" column from the DataFrame, we can use the .drop() method. By default the .drop() method has an argument set as axis=0 to imply removing one or more rows from the data set. If we want to drop a column, we either pass as argument axis=1 or axis='columns', like so:

df3 = df3.drop('temp_plus_one', axis='columns')
df3
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4

What did we learn, hopefully?

Data Science is (to me it is at least!) an extremely interesting topic, and the pandas Python package provides many powerful, relatively straightforward, and efficient tools for data analysis and manipulation. At pandas' core is the DataFrame object / data format, that you can create from and to regular Python data types (e.g. dictionaries, lists, tuples) and multiple file types (such as CSV, JSON and Excel).

In this episode we covered some pandas basics (of course on every newly covered topics we need to start at the beginning!), and in the following episodes we'll gradually expand on the possibilities pandas has to offer and move on to intermediate and advanced skill techniques.

Thank you for your time!

Sort:  

Thank you for your contribution @scipio.
After reviewing your tutorial we suggest the following points listed below:

  • Welcome back to the category of tutorials. Your tutorial is very well structured and explained. Good work!

  • The curriculum section becomes very large at the beginning of the tutorial. Maybe you should put it at the end of your tutorial.

  • The tutorial needs something graphic, some images in the middle of the contribution not to get too massive.

  • We suggest using the third person in your text.

Thank you for your work in developing this tutorial.
Looking forward to your upcoming tutorials.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Chat with us on Discord.

[utopian-moderator]

Thanks for reviewing @portugalcoin!

Thank you for your review, @portugalcoin! Keep up the good work!

Thank you scipio! You've just received an upvote of 41% by artturtle!


Learn how I will upvote each and every one of your posts



Please come visit me to see my daily report detailing my current upvote power and how much I'm currently upvoting.

Hi @scipio!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your post is eligible for our upvote, thanks to our collaboration with @utopian-io!
Feel free to join our @steem-ua Discord server

Hey, don't I know you from somewhere?? ;-)

Hey, @scipio!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Thanks @utopian-io! [Bleep! Bleep!]