pandas DataFrames¶

Watch a walk-through of this lesson on YouTube
Download all the Jupyter notebooks and other files you need to complete the lessons in this chapter (Chapter 3)
Turn off the GitHub Copilot AI assistant so you can focus on learning Python using your HI (human intelligence). Click the Deactivate Copilot button in the bottom right of VS Code, if it is currently activated.

Questions:¶

How do I read in data from a file?
How can I work with data in tabular format (tables)?
How can I do basic descriptive statistics on tabular data?

Learning Objectives:¶

Select individual values from a Pandas dataframe
Select entire rows or entire columns from a dataframe
Select a subset of both rows and columns from a dataframe in a single operation
Select a subset of a dataframe by a single Boolean criterion
Obtain descriptive statistics for subsets of data within a table
Use the split-apply-combine paradigm to work with data

What is pandas?¶

Bad news first: there are no cute, black-and-white bears here. pandas (whose official name starts with a lower-case “p”) is a Python library for working with data in a tabular format, such as is found in file formats like CSV, Microsoft Excel, and Google Sheets. Unlike Excel or Sheets, pandas is not a point-and click graphical interface for working with these files — everything is done through Python code. But compared to other formats for working with data that we have seen in this workshop, such as lists and dictionaries, pandas may seem more familiar, and it definitely lends itself more naturally to large data sets. Indeed, pandas’ mission statement is, “...to be the fundamental high-level building block for doing practical, real world data analysis in Python”.

The primary units of pandas data storage you will work with are DataFrames (essentially, tables of data organized as rows and columns). DataFrames are actually collections of pandas Series objects, which can be thought of as individual rows or columns (or vectors, or 1D arrays).

Among the things that make pandas so attractive are the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames. As well, pandas functions and methods are written to work intuitively and efficiently with data organized in tables. Most operations are vectorized, which means that they will automatically apply to all values in a DataFrame or Series without the need to write for loops to execute the same operation on a set of cells.

pandas is built on top of the NumPy library. Although we haven’t discussed NumPy in this workshop, it is a powerful and widely used Python library for working with numerical data. However, it’s worth noting for your future reference that most of the methods defined for NumPy Arrays also apply to Pandas Series/DataFrames.

About Python Libraries¶

pandas is an example of a Python library. A library is a collection of files (called modules) that contains functions for use by other programs. Libraries provide ways of extending Python’s functionality in different ways. They may also contain data values (e.g., numerical constants), entire sample data sets, and other things. A library’s contents are supposed to be related, although there’s no actual way to enforce that.

The Python standard library is an extensive suite of modules that comes with Python itself. Everything we’ve done so far in this workshop has been part of the standard library. Many additional libraries are available; CoCalc has a large number of extra libraries already installed.

To use a library in a particular Jupyter notebook or other Python program, we must import it using the import statement, like this:

import pandas

Once a library is imported, we can use functions and methods from it. But, for functions we have to tell Python that the function can be found in a particular library we imported. For example, pandas has a function to import data from CSV (comma-separated value) files, called read_csv. To run this command, we would need to type:

pandas.read_csv()

Since some package names are long, and adding the name to every function can result in a lot of typing, Python also allows us to assign an alias — a shorter name — to a library when we import it. For example, the convention for pandas is to give it the alias pd like this:

import pandas as pd

Then to read a CSV file we could use:

pd.read_csv()

In the cell below, import pandas with the alias pd:

import pandas as pd

Importing data with pandas¶

As noted, we can read a CSV file and use it to create a pandas DataFrame, with the funciton pd.read_csv(). CSV is a text format used for storing tabular data, in which each line of the file corresponds to a row in the table, and columns are separated with commas (“CSV” stands for “comma-separated values”). Often the first row of a CSV file will be the header, containing labels for each column.

The Gapminder data is in CSV format, so let’s load in one of the Gapminder datasets with the command below. Note that when we read in a DataFrame, we need to assign it to a variable name so that we can reference it later. A convention when working with pandas is to call the DataFrame df. This works fine if you only have one DataFrame to work with, although if you are working with multiple DataFrames it is a good idea to give them more meaningful names.

The Gapminder data are stored in a subfolder called data, so as the argument to pd.read_csv() below we give the folder name folled by a slash, then the file name:

df = pd.read_csv('data/gapminder_gdp_europe.csv')

We can view the contents of the DataFrame df by simply typing its name and running the cell. Note that, unlike most of the examples we’ve used in previous lessons, we don’t use the print() function. Although it works, the result is not nicely formatted the way the output is if we just use the name of the data frame.

That is, run this command: df — not print(df) — in the cell below.

df

You’ll see that the rows are numbered in boldface, starting with 0 as is the norm in Python. This boldfaced, leftmost column is called the index of the DataFrame, and provides one way of accessing data by rows. Across the top, you’ll see that the column labels are also in boldface. pandas is pretty smart about automatically detecting when the first row of a CSV file contains header information (column names).

Heads or Tails?¶

We might want to “peek” at the DataFrame without printing out the entire thing, especially if it’s big. We can see the first 5 rows of a DataFrame with the .head() method:

df.head()

...or the last 5 rows with .tail():

df.tail()

We can also see a random sample of rows from the DataFrame with .sample(), giving it a numerical argument to indicate the number of rows we want to see:

df.sample(10)

Note that the .head() and .tail() methods also optionally take a numerical argument, if you want to view a different number of rows from the default of 5.

Accessing values in a DataFrame¶

One thing we often want to do is access a single cell in a DataFrame, or a range of cells. Each cell is uniquely defined by a combination of its row and column locations.

Select a column using `[]`¶

If we want to select an entire column of a pandas DataFrame, we just give the name of the DataFrame followed by the column name in square brackets:

df['gdpPercap_1992']

0      2497.437901
1     27042.018680
2     25575.570690
3      2546.781445
4      6302.623438
5      8447.794873
6     14297.021220
7     26406.739850
8     20647.164990
9     24703.796150
10    26505.303170
11    17541.496340
12    10535.628550
13    25144.392010
14    17558.815550
15    22013.644860
16     7003.339037
17    26790.949610
18    33965.661150
19     7738.881247
20    16207.266630
21     6598.409903
22     9325.068238
23     9498.467723
24    14214.716810
25    18603.064520
26    23880.016830
27    31871.530300
28     5678.348271
29    22705.092540
Name: gdpPercap_1992, dtype: float64

Note that if we ask for a single column the result is a pandas Series, but if we ask for two or more columns, the result is a DataFrame. Pay close attention to the syntax below — if we’re asking for more than one column, we need to provide a list of columns inside the square brackets (so there are two sets of nested sqaure brackets in the code below):

df[['gdpPercap_1982', 'gdpPercap_1992']]

Numerical indexing using `.iloc[]`¶

Often we don’t want to access an entire column, however, but just specific rows within a column (or range of columns). pandas provides two ways of accessing cell locations. One is using the numerical positions in the DataFrame, using the convention of [row, column] — with [0, 0] being the top left cell in the DataFrame. So for a pandas DataFrame with 3 rows and 3 columns, the indices of each cell are as shown:

	col 0	col 1	col 2	col 3
0	[0, 0]	[0, 1]	[0, 2]	[0, 3]
1	[1, 0]	[1, 1]	[1, 2]	[1, 3]
2	[2, 0]	[2, 1]	[2, 2]	[2, 3]
3	[3, 0]	[3, 1]	[3, 2]	[3, 3]

Numerical indexing of DataFrames is done with the .iloc[] method. For example, to access the GDP value for Austria in 1952 — which is located in the second row, third column of our current DataFrame, we would use:

df.iloc[1, 2]

8842.59803

Label-based indexing using `.loc[]`¶

The other way to access a location in a DataFrame is by its index and column labels, using the .loc[] method. As noted earlier, in the DataFrame we imported, the indexes are currently numbers, which were created automatically when we imported the data. The .loc[] method doesn’t work with numerical indexes (that’s what iloc is for — and you can’t mix, say, a numerical row index with a column label), but in the data set we imported, the first column of this CSV file is actually meant to be its index: while all other columns are data values (GDP, in type float), the first column identifies the country with which each row of data is associated.

pandas has a method for setting an index column, .set_index(), where the argument (in the parentheses) would be the name of the column to use as the index. So here we want to run:

df = df.set_index('country')

Note that we need to assign the result of this operation back to df (using df = ), otherwise the change will not actually modify df.

In the cell below, use the .set_index() method to set the index of df to country, and then view the DataFrame again to see how it has changed.

df = df.set_index('country')

Alternatively, if we knew which column we wanted to use as the index before loading in the data file, we could have included the argument index_col= in the pd.read_csv() command:

df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Now that we have defined the index, we can access the 1952 GDP value for Austria by its index and column names:

df.loc['Austria', 'gdpPercap_1952']

6137.076492

Use `:` on its own to mean all columns or all rows.¶

Using Python’s familiar slicing notatio (which we’ve previously used for strings and lists), we can use : with .iloc[] or .loc[], to specify a range in a DataFrame.

For example, to see the GDP of Albania for every year (column) in the DataFrame, we would use:

df.loc["Albania", :]

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

Likewise, we could see the GDP for every country (row) in the year 1957 with:

df.loc[:, 'gdpPercap_1957']

country
Albania                    1942.284244
Austria                    8842.598030
Belgium                    9714.960623
Bosnia and Herzegovina     1353.989176
Bulgaria                   3008.670727
Croatia                    4338.231617
Czech Republic             8256.343918
Denmark                   11099.659350
Finland                    7545.415386
France                     8662.834898
Germany                   10187.826650
Greece                     4916.299889
Hungary                    6040.180011
Iceland                    9244.001412
Ireland                    5599.077872
Italy                      6248.656232
Montenegro                 3682.259903
Netherlands               11276.193440
Norway                    11653.973040
Poland                     4734.253019
Portugal                   3774.571743
Romania                    3943.370225
Serbia                     4981.090891
Slovak Republic            6093.262980
Slovenia                   5862.276629
Spain                      4564.802410
Sweden                     9911.878226
Switzerland               17909.489730
Turkey                     2218.754257
United Kingdom            11283.177950
Name: gdpPercap_1957, dtype: float64

You can also just specify the row index; if you don’t specify anything for the columns, pandas assumes you want all columns:

df.loc["Albania"]

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

However, since the syntax for .iloc[] and .loc[] is [rows, columns], you cannot omit a row index; you need to use : if you want all rows.

Slicing works on DataFrames¶

Slicing using numerical indices works similarly for DataFrames as we previously saw for strings and lists, for example, the following code will print the third through fifth rows of the DataFrame, and the fifth through eighth columns (remember, Python indexing starts at 0, and slicing does not include the “end” index):

df.iloc[2:5, 4:8]

The code below will print from the sixth to second-last row of the DataFrame, and from the ninth to the last column:

df.iloc[5:-1, 8:]

Note however, that when using label-based indexing with .loc[], pandas’ slicing behavior is a bit different. Specifically, the output includes the last item in the range, whereas numerical indexing with .iloc[] does not.

So, considering that the first three rows of the DataFrame correspond to the countries Albania, Austria, and Belgium, and that columns 6 and 7 are for the years 1982 and 1987 respectively, compare the output of:

df.iloc[0:2, 6:7]

with:

df.loc['Albania':'Belgium', 'gdpPercap_1982':'gdpPercap_1988']

The “inclusive” label-based indexing with .loc[] is fairly intuitive, but it’s important to remember that it works differently from numerical indexing.

Use lists to select non-contiguous sections of a DataFrame¶

While slicing can be very useful, sometimes we might want to extract values that aren’t next to each other in a DataFrame. For example, what if we only want values from two specific years (1992 and 2002), for Scandinavian countries (Denmark, Finland, Norway, and Sweden)? Nether these years nor countries are in adjacent columns/rows in the DataFrame. With .loc[], we can use lists, rather than ranges separated by :, as selectors:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], ['gdpPercap_1992', 'gdpPercap_2002']]

We can equivalently write the command over several lines to make it a bit easier to read:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], 
       ['gdpPercap_1992', 'gdpPercap_2002']
      ]

We could also define those lists as variables, and pass the variables to .loc[]. This might be useful if you were going to use the lists more than once, as well as for clarity:

scand_countries = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
years = ['gdpPercap_1992', 'gdpPercap_2002']
df.loc[scand_countries, years]

We can take this a step further, and assign the output of a .loc[] selection like this to a new variable name. This makes a copy of the selected data, stored in a new DataFrame (or Series, if we only select one row or column) with its own name. This allows us to later reference and use that selection.

scand_data = df.loc[scand_countries, years]

It’s easy to do simple math and statistics in DataFrames¶

We prevoiusly learned about methods to get simple statistical values out of a Python list, like .max(), and .min(). pandas includes these and many more methods as well. For example, we can view the mean GDP for Italy across all years (columns) with:

df.loc['Italy'].mean()

Or the largest GDP in 1977 with:

df.loc[:, 'gdpPercap_1977'].max()

Another useful method is .describe(), which prints out a range of descriptive statistics for the range of data you specify. Without any slicing it provides information for each column:

df.describe()

Mini-Exercise¶

In the cell below, use the scand_countries and years variables to view descriptive statistics for all Scandinavian countries in each year.

Evaluate cells based on conditions¶

pandas allows an easy way to identify values in a DataFrame that meet a certain condition, using operators like <, >, and ==. For example, let’s see which countries in a list had a GDP over 10,000 in 1962 and 1992. The result is reported in Booleans (True/False) for each cell.

countries = ['France', 'Germany', 'Italy', 'Spain', 'United Kingdom']
df.loc[countries, ['gdpPercap_1962', 'gdpPercap_1992']] > 10000

Select values or NaN using a Boolean mask.¶

A DataFrame full of Booleans is sometimes called a mask because of how it can be used. A mask removes values that are not True, and replaces them with NaN — a special Python value representing “not a number”. This can be useful because pandas ignores NaN values when doing computations.

We create a mask by assigning the output of a conditional statement to a variable name:

mask = scand_data > 30000

Then we can apply the mask to the DataFrame to get only the values that meet the criterion:

scand_data[mask]

As an example of how this might be used, the steps above would now allow us to find the lowest GDP value in each year, that was above 30,000:

scand_data[mask].min()

Split-Apply-Combine¶

A common task in data science is to split data into meaningful subgroups, apply an operation to each subgroup (e.g., compute the mean), and then combine the results into a single output, such as a table or a new DataFrame. This paradigm was famously described by Hadley Wickham in a 2011 paper.

pandas provides methods and grouping operations that are very efficient (vectorized) for split-apply-combine operations.

As an example, let’s say that we wanted to compare the average GDP for different regions of Europe, divided as northern, southern, eastern, and western. To do this, we first have to create lists defining the countries belonging to each of these regions:

northern = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
southern = ['Greece', 'Italy', 'Portugal', 'Spain']
eastern = ['Albania', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 
            'Czech Republic', 'Hungary', 'Montenegro', 'Poland', 'Romania', 
            'Serbia', 'Slovak Republic', 'Slovenia']
western = ['Austria',  'Belgium', 'France', 'Germany', 'Ireland', 
            'Netherlands', 'Switzerland', 'United Kingdom']

Next we can make a new column simply by using .loc[] with the rows specified by one of the lists we just defined, a column name that doesn’t already exist (in this case, we’ll call it “region”), then assigning a region label to that combination of rows and column. We need to do this separately for each region. Note that when we first create the new column (“region”), pandas fills it with NaN values in any rows that were not defined by the assignment. For example, in the code below, the first line will create the column “region”, and fill it with “northern” for any row in the northern list, and NaN to every other row.

df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'

df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'

Let’s look at the last two columns of df to see the region column we created:

df.iloc[:, -2:]

Split¶

Now we can use this “region” column to split the data into groups, using a pandas method called .groupby()

grouped_countries = df.groupby('region')

Note that this step doesn’t create a new DataFrame, it creates a special kind of pandas object that points to a grouping in the original DataFrame:

grouped_countries = df.groupby('region')

Apply¶

Now that we have split the data, we can apply a function separately to each group. Here we’ll compute the mean GDP for each region, for each year:

mean_gdp_by_region = grouped_countries.mean()

Combine¶

The combine step actually occurred with the apply step above — the result is automatically combined into a table of mean values organized by region. But since our apply step (.mean()) saved the result to a variable, we can view the resulting table as the output of the combine step:

mean_gdp_by_region

Chaining¶

In Python, chaining refers to combining a number of operations in one command, using a sequence of methods. We can perform the above split-apply-combine procedure in a single step as follows. Note that because we don’t assign the output to a variable name, it is displayed as output but not saved.

df.groupby('region').mean()

Exercises¶

Selecting Individual Values¶

Write an expression to find the Per Capita GDP of Serbia in 2007.

Click the button to reveal!

print(df.loc[‘Serbia’, ‘gdpPercap_2007’])

Extent of Slicing¶

Do the two statements below produce the same output? (Hint: you might want to use the .head() method to remind yourself of the structure of the DataFrame)
Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

Click the button to reveal the solution

Solution¶

No, they do not produce the same output! The output of the first statement is:

        gdpPercap_1952  gdpPercap_1957
country                                
Albania     1601.056136     1942.284244
Austria     6137.076492     8842.598030

The second statement gives:

        gdpPercap_1952  gdpPercap_1957  gdpPercap_1962
country                                                
Albania     1601.056136     1942.284244     2312.888958
Austria     6137.076492     8842.598030    10750.721110
Belgium     8343.105127     9714.960623    10991.206760

Clearly, the second statement produces an additional column and an additional row compared to the first statement. What conclusion can we draw? We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided, while a named slice, 'gdpPercap_1952':'gdpPercap_1962', includes the final element.

Reconstructing Data¶

Explain what each line in the following short program does: what is in df1, df2, etc.?

df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')
df2 = df1[df1['continent'] == 'Americas']
df3 = df2.drop('Puerto Rico')
df4 = df3.drop('continent', axis = 1)
df4.to_csv('result.csv')

Click the button to reveal the solution

Let’s go through this piece of code line by line.

df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')

This line loads the dataset containing the GDP data from all countries into a dataframe called df1. The index_col='country' parameter selects which column to use as the row labels in the dataframe.

df2 = df1[df1['continent'] == 'Americas']

This line makes a selection: only those rows of df1 for which the ‘continent’ column matches ‘Americas’ are extracted. Notice how the Boolean expression inside the brackets, df1['continent'] == 'Americas', is used to select only those rows where the expression is true. Try printing this expression! Can you print also its individual True/False elements? (hint: first assign the expression to a variable)

df3 = df2.drop('Puerto Rico')

As the syntax suggests, this line drops the row from df2 where the label is ‘Puerto Rico’. The resulting dataframe df3 has one row less than the original dataframe df2.

df4 = df3.drop('continent', axis = 1)

Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis parameter (we want to drop the second column which has index 1).

df4.to_csv('result.csv')

The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the to_csv() function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.

Selecting Indices¶

Explain in simple terms what idxmin and idxmax do. When would you use these methods?

data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.idxmin()

data.idxmax()

Click the button to reveal the solution

For each column in data, idxmin will return the index value corresponding to each column’s minimum; idxmax will do accordingly the same for each column’s maximum value.

You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.

Practice with Selection¶

From the previous exercise, the Gapminder GDP data for Europe should be loaded in as data. Using this DataFrame, write an expression to select each of the following:

GDP per capita for all countries in 1982

Click the button to reveal the solution

data['gdpPercap_1982']

GDP per capita for Denmark for all years

Click the button to reveal the solution

data.loc['Denmark',:]

data.loc['Denmark',:'gdpPercap_2007']

GDP per capita for all countries for years after 1985

Click the button to reveal the solution

data.loc[:,'gdpPercap_1985':]

Note that pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named gdpPercap_1985 actually exists. This is useful if new columns are added to the CSV file later.

GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952

Click the button to reveal the solution

data[‘gdpPercap_2007’]/data[‘gdpPercap_1952’]

Summary of Key Points:¶

pandas DataFrames are a powerful way of storing and working with tabular (row/column) data
pandas columns and rows can have names
pandas row names are called indexes which are numeric by default, but can be given other labels
Use the .iloc[] method with a DataFrame to select values by integer location, using [row, column] format
Use the .loc[] method with a DataFrame to select rows and/or columns, using named slices
Use : on its own to mean all columns or all rows
Result of slicing can be used in further operations
Use comparisons to select data based on value
Select values or NaN using a Boolean mask
use split-apply-combine to derive analytics from groupings within a DataFrame

This lesson is adapted from the Software Carpentry Plotting and Programming in Python workshop.

References¶

Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1). 10.18637/jss.v040.i01

pandas DataFrames

pandas DataFrames¶

Questions:¶

Learning Objectives:¶

What is pandas?¶

About Python Libraries¶

Importing data with pandas¶

Heads or Tails?¶

Accessing values in a DataFrame¶

Select a column using []¶

Numerical indexing using .iloc[]¶

Label-based indexing using .loc[]¶

Use : on its own to mean all columns or all rows.¶

Slicing works on DataFrames¶

Use lists to select non-contiguous sections of a DataFrame¶

It’s easy to do simple math and statistics in DataFrames¶

Mini-Exercise¶

Evaluate cells based on conditions¶

Select values or NaN using a Boolean mask.¶

Split-Apply-Combine¶

Split¶

Apply¶

Combine¶

Chaining¶

Exercises¶

Selecting Individual Values¶

Extent of Slicing¶

Solution¶

Reconstructing Data¶

Selecting Indices¶

Practice with Selection¶

Summary of Key Points:¶

Select a column using `[]`¶

Numerical indexing using `.iloc[]`¶

Label-based indexing using `.loc[]`¶

Use `:` on its own to mean all columns or all rows.¶