Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

pandas DataFrames

pandas DataFrames


Questions:

  • How do I read in data from a file?

  • How can I work with data in tabular format (tables)?

  • How can I do basic descriptive statistics on tabular data?

Learning Objectives:

  • Select individual values from a Pandas dataframe

  • Select entire rows or entire columns from a dataframe

  • Select a subset of both rows and columns from a dataframe in a single operation

  • Select a subset of a dataframe by a single Boolean criterion

  • Obtain descriptive statistics for subsets of data within a table

  • Use the split-apply-combine paradigm to work with data


What is pandas?

Bad news first: there are no cute, black-and-white bears here. pandas (whose official name starts with a lower-case “p”) is a Python library for working with data in a tabular format, such as is found in file formats like CSV, Microsoft Excel, and Google Sheets. Unlike Excel or Sheets, pandas is not a point-and click graphical interface for working with these files — everything is done through Python code. But compared to other formats for working with data that we have seen in this workshop, such as lists and dictionaries, pandas may seem more familiar, and it definitely lends itself more naturally to large data sets. Indeed, pandas’ mission statement is, “...to be the fundamental high-level building block for doing practical, real world data analysis in Python”.

The primary units of pandas data storage you will work with are DataFrames (essentially, tables of data organized as rows and columns). DataFrames are actually collections of pandas Series objects, which can be thought of as individual rows or columns (or vectors, or 1D arrays).

Among the things that make pandas so attractive are the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames. As well, pandas functions and methods are written to work intuitively and efficiently with data organized in tables. Most operations are vectorized, which means that they will automatically apply to all values in a DataFrame or Series without the need to write for loops to execute the same operation on a set of cells.

pandas is built on top of the NumPy library. Although we haven’t discussed NumPy in this workshop, it is a powerful and widely used Python library for working with numerical data. However, it’s worth noting for your future reference that most of the methods defined for NumPy Arrays also apply to Pandas Series/DataFrames.

About Python Libraries

pandas is an example of a Python library. A library is a collection of files (called modules) that contains functions for use by other programs. Libraries provide ways of extending Python’s functionality in different ways. They may also contain data values (e.g., numerical constants), entire sample data sets, and other things. A library’s contents are supposed to be related, although there’s no actual way to enforce that.

The Python standard library is an extensive suite of modules that comes with Python itself. Everything we’ve done so far in this workshop has been part of the standard library. Many additional libraries are available; CoCalc has a large number of extra libraries already installed.

To use a library in a particular Jupyter notebook or other Python program, we must import it using the import statement, like this:

import pandas

Once a library is imported, we can use functions and methods from it. But, for functions we have to tell Python that the function can be found in a particular library we imported. For example, pandas has a function to import data from CSV (comma-separated value) files, called read_csv. To run this command, we would need to type:

pandas.read_csv()

Since some package names are long, and adding the name to every function can result in a lot of typing, Python also allows us to assign an alias — a shorter name — to a library when we import it. For example, the convention for pandas is to give it the alias pd like this:

import pandas as pd

Then to read a CSV file we could use:

pd.read_csv()

In the cell below, import pandas with the alias pd:

import pandas as pd

Importing data with pandas

As noted, we can read a CSV file and use it to create a pandas DataFrame, with the funciton pd.read_csv(). CSV is a text format used for storing tabular data, in which each line of the file corresponds to a row in the table, and columns are separated with commas (“CSV” stands for “comma-separated values”). Often the first row of a CSV file will be the header, containing labels for each column.

The Gapminder data is in CSV format, so let’s load in one of the Gapminder datasets with the command below. Note that when we read in a DataFrame, we need to assign it to a variable name so that we can reference it later. A convention when working with pandas is to call the DataFrame df. This works fine if you only have one DataFrame to work with, although if you are working with multiple DataFrames it is a good idea to give them more meaningful names.

The Gapminder data are stored in a subfolder called data, so as the argument to pd.read_csv() below we give the folder name folled by a slash, then the file name:

df = pd.read_csv('data/gapminder_gdp_europe.csv')

We can view the contents of the DataFrame df by simply typing its name and running the cell. Note that, unlike most of the examples we’ve used in previous lessons, we don’t use the print() function. Although it works, the result is not nicely formatted the way the output is if we just use the name of the data frame.

That is, run this command: df — not print(df) — in the cell below.

df
Loading...

You’ll see that the rows are numbered in boldface, starting with 0 as is the norm in Python. This boldfaced, leftmost column is called the index of the DataFrame, and provides one way of accessing data by rows. Across the top, you’ll see that the column labels are also in boldface. pandas is pretty smart about automatically detecting when the first row of a CSV file contains header information (column names).

Heads or Tails?

We might want to “peek” at the DataFrame without printing out the entire thing, especially if it’s big. We can see the first 5 rows of a DataFrame with the .head() method:

df.head()
Loading...

...or the last 5 rows with .tail():

df.tail()
Loading...

We can also see a random sample of rows from the DataFrame with .sample(), giving it a numerical argument to indicate the number of rows we want to see:

df.sample(10)
Loading...

Note that the .head() and .tail() methods also optionally take a numerical argument, if you want to view a different number of rows from the default of 5.

Accessing values in a DataFrame

One thing we often want to do is access a single cell in a DataFrame, or a range of cells. Each cell is uniquely defined by a combination of its row and column locations.

Select a column using []

If we want to select an entire column of a pandas DataFrame, we just give the name of the DataFrame followed by the column name in square brackets:

df['gdpPercap_1992']
0 2497.437901 1 27042.018680 2 25575.570690 3 2546.781445 4 6302.623438 5 8447.794873 6 14297.021220 7 26406.739850 8 20647.164990 9 24703.796150 10 26505.303170 11 17541.496340 12 10535.628550 13 25144.392010 14 17558.815550 15 22013.644860 16 7003.339037 17 26790.949610 18 33965.661150 19 7738.881247 20 16207.266630 21 6598.409903 22 9325.068238 23 9498.467723 24 14214.716810 25 18603.064520 26 23880.016830 27 31871.530300 28 5678.348271 29 22705.092540 Name: gdpPercap_1992, dtype: float64

Note that if we ask for a single column the result is a pandas Series, but if we ask for two or more columns, the result is a DataFrame. Pay close attention to the syntax below — if we’re asking for more than one column, we need to provide a list of columns inside the square brackets (so there are two sets of nested sqaure brackets in the code below):

df[['gdpPercap_1982', 'gdpPercap_1992']]
Loading...

Numerical indexing using .iloc[]

Often we don’t want to access an entire column, however, but just specific rows within a column (or range of columns). pandas provides two ways of accessing cell locations. One is using the numerical positions in the DataFrame, using the convention of [row, column] — with [0, 0] being the top left cell in the DataFrame. So for a pandas DataFrame with 3 rows and 3 columns, the indices of each cell are as shown:

col 0col 1col 2col 3
0[0, 0][0, 1][0, 2][0, 3]
1[1, 0][1, 1][1, 2][1, 3]
2[2, 0][2, 1][2, 2][2, 3]
3[3, 0][3, 1][3, 2][3, 3]

Numerical indexing of DataFrames is done with the .iloc[] method. For example, to access the GDP value for Austria in 1952 — which is located in the second row, third column of our current DataFrame, we would use:

df.iloc[1, 2]
8842.59803

Label-based indexing using .loc[]

The other way to access a location in a DataFrame is by its index and column labels, using the .loc[] method. As noted earlier, in the DataFrame we imported, the indexes are currently numbers, which were created automatically when we imported the data. The .loc[] method doesn’t work with numerical indexes (that’s what iloc is for — and you can’t mix, say, a numerical row index with a column label), but in the data set we imported, the first column of this CSV file is actually meant to be its index: while all other columns are data values (GDP, in type float), the first column identifies the country with which each row of data is associated.

pandas has a method for setting an index column, .set_index(), where the argument (in the parentheses) would be the name of the column to use as the index. So here we want to run:

df = df.set_index('country')

Note that we need to assign the result of this operation back to df (using df = ), otherwise the change will not actually modify df.

In the cell below, use the .set_index() method to set the index of df to country, and then view the DataFrame again to see how it has changed.

df = df.set_index('country')

Alternatively, if we knew which column we wanted to use as the index before loading in the data file, we could have included the argument index_col= in the pd.read_csv() command:

df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Now that we have defined the index, we can access the 1952 GDP value for Austria by its index and column names:

df.loc['Austria', 'gdpPercap_1952']
6137.076492

Use : on its own to mean all columns or all rows.

Using Python’s familiar slicing notatio (which we’ve previously used for strings and lists), we can use : with .iloc[] or .loc[], to specify a range in a DataFrame.

For example, to see the GDP of Albania for every year (column) in the DataFrame, we would use:

df.loc["Albania", :]
gdpPercap_1952 1601.056136 gdpPercap_1957 1942.284244 gdpPercap_1962 2312.888958 gdpPercap_1967 2760.196931 gdpPercap_1972 3313.422188 gdpPercap_1977 3533.003910 gdpPercap_1982 3630.880722 gdpPercap_1987 3738.932735 gdpPercap_1992 2497.437901 gdpPercap_1997 3193.054604 gdpPercap_2002 4604.211737 gdpPercap_2007 5937.029526 Name: Albania, dtype: float64

Likewise, we could see the GDP for every country (row) in the year 1957 with:

df.loc[:, 'gdpPercap_1957']
country Albania 1942.284244 Austria 8842.598030 Belgium 9714.960623 Bosnia and Herzegovina 1353.989176 Bulgaria 3008.670727 Croatia 4338.231617 Czech Republic 8256.343918 Denmark 11099.659350 Finland 7545.415386 France 8662.834898 Germany 10187.826650 Greece 4916.299889 Hungary 6040.180011 Iceland 9244.001412 Ireland 5599.077872 Italy 6248.656232 Montenegro 3682.259903 Netherlands 11276.193440 Norway 11653.973040 Poland 4734.253019 Portugal 3774.571743 Romania 3943.370225 Serbia 4981.090891 Slovak Republic 6093.262980 Slovenia 5862.276629 Spain 4564.802410 Sweden 9911.878226 Switzerland 17909.489730 Turkey 2218.754257 United Kingdom 11283.177950 Name: gdpPercap_1957, dtype: float64

You can also just specify the row index; if you don’t specify anything for the columns, pandas assumes you want all columns:

df.loc["Albania"]
gdpPercap_1952 1601.056136 gdpPercap_1957 1942.284244 gdpPercap_1962 2312.888958 gdpPercap_1967 2760.196931 gdpPercap_1972 3313.422188 gdpPercap_1977 3533.003910 gdpPercap_1982 3630.880722 gdpPercap_1987 3738.932735 gdpPercap_1992 2497.437901 gdpPercap_1997 3193.054604 gdpPercap_2002 4604.211737 gdpPercap_2007 5937.029526 Name: Albania, dtype: float64

However, since the syntax for .iloc[] and .loc[] is [rows, columns], you cannot omit a row index; you need to use : if you want all rows.

Slicing works on DataFrames

Slicing using numerical indices works similarly for DataFrames as we previously saw for strings and lists, for example, the following code will print the third through fifth rows of the DataFrame, and the fifth through eighth columns (remember, Python indexing starts at 0, and slicing does not include the “end” index):

df.iloc[2:5, 4:8]

The code below will print from the sixth to second-last row of the DataFrame, and from the ninth to the last column:

df.iloc[5:-1, 8:]
Loading...

Note however, that when using label-based indexing with .loc[], pandas’ slicing behavior is a bit different. Specifically, the output includes the last item in the range, whereas numerical indexing with .iloc[] does not.

So, considering that the first three rows of the DataFrame correspond to the countries Albania, Austria, and Belgium, and that columns 6 and 7 are for the years 1982 and 1987 respectively, compare the output of:

df.iloc[0:2, 6:7]
Loading...

with:

df.loc['Albania':'Belgium', 'gdpPercap_1982':'gdpPercap_1988']
Loading...

The “inclusive” label-based indexing with .loc[] is fairly intuitive, but it’s important to remember that it works differently from numerical indexing.

Use lists to select non-contiguous sections of a DataFrame

While slicing can be very useful, sometimes we might want to extract values that aren’t next to each other in a DataFrame. For example, what if we only want values from two specific years (1992 and 2002), for Scandinavian countries (Denmark, Finland, Norway, and Sweden)? Nether these years nor countries are in adjacent columns/rows in the DataFrame. With .loc[], we can use lists, rather than ranges separated by :, as selectors:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], ['gdpPercap_1992', 'gdpPercap_2002']]

We can equivalently write the command over several lines to make it a bit easier to read:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], 
       ['gdpPercap_1992', 'gdpPercap_2002']
      ]

We could also define those lists as variables, and pass the variables to .loc[]. This might be useful if you were going to use the lists more than once, as well as for clarity:

scand_countries = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
years = ['gdpPercap_1992', 'gdpPercap_2002']
df.loc[scand_countries, years]

We can take this a step further, and assign the output of a .loc[] selection like this to a new variable name. This makes a copy of the selected data, stored in a new DataFrame (or Series, if we only select one row or column) with its own name. This allows us to later reference and use that selection.

scand_data = df.loc[scand_countries, years]

It’s easy to do simple math and statistics in DataFrames

We prevoiusly learned about methods to get simple statistical values out of a Python list, like .max(), and .min(). pandas includes these and many more methods as well. For example, we can view the mean GDP for Italy across all years (columns) with:

df.loc['Italy'].mean()

Or the largest GDP in 1977 with:

df.loc[:, 'gdpPercap_1977'].max()

Another useful method is .describe(), which prints out a range of descriptive statistics for the range of data you specify. Without any slicing it provides information for each column:

df.describe()

Mini-Exercise

In the cell below, use the scand_countries and years variables to view descriptive statistics for all Scandinavian countries in each year.

Evaluate cells based on conditions

pandas allows an easy way to identify values in a DataFrame that meet a certain condition, using operators like <, >, and ==. For example, let’s see which countries in a list had a GDP over 10,000 in 1962 and 1992. The result is reported in Booleans (True/False) for each cell.

countries = ['France', 'Germany', 'Italy', 'Spain', 'United Kingdom']
df.loc[countries, ['gdpPercap_1962', 'gdpPercap_1992']] > 10000

Select values or NaN using a Boolean mask.

A DataFrame full of Booleans is sometimes called a mask because of how it can be used. A mask removes values that are not True, and replaces them with NaN — a special Python value representing “not a number”. This can be useful because pandas ignores NaN values when doing computations.

We create a mask by assigning the output of a conditional statement to a variable name:

mask = scand_data > 30000

Then we can apply the mask to the DataFrame to get only the values that meet the criterion:

scand_data[mask]

As an example of how this might be used, the steps above would now allow us to find the lowest GDP value in each year, that was above 30,000:

scand_data[mask].min()

Split-Apply-Combine

A common task in data science is to split data into meaningful subgroups, apply an operation to each subgroup (e.g., compute the mean), and then combine the results into a single output, such as a table or a new DataFrame. This paradigm was famously described by Hadley Wickham in a 2011 paper.

pandas provides methods and grouping operations that are very efficient (vectorized) for split-apply-combine operations.

As an example, let’s say that we wanted to compare the average GDP for different regions of Europe, divided as northern, southern, eastern, and western. To do this, we first have to create lists defining the countries belonging to each of these regions:

northern = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
southern = ['Greece', 'Italy', 'Portugal', 'Spain']
eastern = ['Albania', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 
            'Czech Republic', 'Hungary', 'Montenegro', 'Poland', 'Romania', 
            'Serbia', 'Slovak Republic', 'Slovenia']
western = ['Austria',  'Belgium', 'France', 'Germany', 'Ireland', 
            'Netherlands', 'Switzerland', 'United Kingdom']

Next we can make a new column simply by using .loc[] with the rows specified by one of the lists we just defined, a column name that doesn’t already exist (in this case, we’ll call it “region”), then assigning a region label to that combination of rows and column. We need to do this separately for each region. Note that when we first create the new column (“region”), pandas fills it with NaN values in any rows that were not defined by the assignment. For example, in the code below, the first line will create the column “region”, and fill it with “northern” for any row in the northern list, and NaN to every other row.

df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'
df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'

Let’s look at the last two columns of df to see the region column we created:

df.iloc[:, -2:]

Split

Now we can use this “region” column to split the data into groups, using a pandas method called .groupby()

grouped_countries = df.groupby('region')

Note that this step doesn’t create a new DataFrame, it creates a special kind of pandas object that points to a grouping in the original DataFrame:

grouped_countries = df.groupby('region')

Apply

Now that we have split the data, we can apply a function separately to each group. Here we’ll compute the mean GDP for each region, for each year:

mean_gdp_by_region = grouped_countries.mean()

Combine

The combine step actually occurred with the apply step above — the result is automatically combined into a table of mean values organized by region. But since our apply step (.mean()) saved the result to a variable, we can view the resulting table as the output of the combine step:

mean_gdp_by_region
Loading...

Chaining

In Python, chaining refers to combining a number of operations in one command, using a sequence of methods. We can perform the above split-apply-combine procedure in a single step as follows. Note that because we don’t assign the output to a variable name, it is displayed as output but not saved.

df.groupby('region').mean()
Loading...

Exercises

Selecting Individual Values

Write an expression to find the Per Capita GDP of Serbia in 2007.

Extent of Slicing

  1. Do the two statements below produce the same output? (Hint: you might want to use the .head() method to remind yourself of the structure of the DataFrame)

  2. Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

Reconstructing Data

Explain what each line in the following short program does: what is in df1, df2, etc.?

df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')
df2 = df1[df1['continent'] == 'Americas']
df3 = df2.drop('Puerto Rico')
df4 = df3.drop('continent', axis = 1)
df4.to_csv('result.csv')

Selecting Indices

Explain in simple terms what idxmin and idxmax do. When would you use these methods?

data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.idxmin()
data.idxmax()

Practice with Selection

From the previous exercise, the Gapminder GDP data for Europe should be loaded in as data. Using this DataFrame, write an expression to select each of the following:

  • GDP per capita for all countries in 1982

  • GDP per capita for Denmark for all years

  • GDP per capita for all countries for years after 1985

Note that pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named gdpPercap_1985 actually exists. This is useful if new columns are added to the CSV file later.

  • GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952


Summary of Key Points:

  • pandas DataFrames are a powerful way of storing and working with tabular (row/column) data

  • pandas columns and rows can have names

  • pandas row names are called indexes which are numeric by default, but can be given other labels

  • Use the .iloc[] method with a DataFrame to select values by integer location, using [row, column] format

  • Use the .loc[] method with a DataFrame to select rows and/or columns, using named slices

  • Use : on its own to mean all columns or all rows

  • Result of slicing can be used in further operations

  • Use comparisons to select data based on value

  • Select values or NaN using a Boolean mask

  • use split-apply-combine to derive analytics from groupings within a DataFrame


This lesson is adapted from the Software Carpentry Plotting and Programming in Python workshop.

References
  1. Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1). 10.18637/jss.v040.i01