pandas DataFrames#


Questions:#

  • How do I read in data from a file?

  • How can I work with data in tabular format (tables)?

  • How can I do basic descriptive statistics on tabular data?

Learning Objectives:#

  • Select individual values from a Pandas dataframe

  • Select entire rows or entire columns from a dataframe

  • Select a subset of both rows and columns from a dataframe in a single operation

  • Select a subset of a dataframe by a single Boolean criterion

  • Obtain descriptive statistics for subsets of data within a table

  • Use the split-apply-combine paradigm to work with data


What is pandas?#

Bad news first: there are no cute, black-and-white bears here. pandas (whose official name starts with a lower-case “p”) is a Python library for working with data in a tabular format, such as is found in file formats like CSV, Microsoft Excel, and Google Sheets. Unlike Excel or Sheets, pandas is not a point-and click graphical interface for working with these files — everything is done through Python code. But compared to other formats for working with data that we have seen in this workshop, such as lists and dictionaries, pandas may seem more familiar, and it definitely lends itself more naturally to large data sets. Indeed, pandas’ mission statement is, “…to be the fundamental high-level building block for doing practical, real world data analysis in Python”.

The primary units of pandas data storage you will work with are DataFrames (essentially, tables of data organized as rows and columns). DataFrames are actually collections of pandas Series objects, which can be thought of as individual rows or columns (or vectors, or 1D arrays).

Among the things that make pandas so attractive are the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames. As well, pandas functions and methods are written to work intuitively and efficiently with data organized in tables. Most operations are vectorized, which means that they will automatically apply to all values in a DataFrame or Series without the need to write for loops to execute the same operation on a set of cells.

pandas is built on top of the NumPy library. Although we haven’t discussed NumPy in this workshop, it is a powerful and widely used Python library for working with numerical data. However, it’s worth noting for your future reference that most of the methods defined for NumPy Arrays also apply to Pandas Series/DataFrames.

About Python Libraries#

pandas is an example of a Python library. A library is a collection of files (called modules) that contains functions for use by other programs. Libraries provide ways of extending Python’s functionality in different ways. They may also contain data values (e.g., numerical constants), entire sample data sets, and other things. A library’s contents are supposed to be related, although there’s no actual way to enforce that.

The Python standard library is an extensive suite of modules that comes with Python itself. Everything we’ve done so far in this workshop has been part of the standard library. Many additional libraries are available; CoCalc has a large number of extra libraries already installed.

To use a library in a particular Jupyter notebook or other Python program, we must import it using the import statement, like this:

import pandas

Once a library is imported, we can use functions and methods from it. But, for functions we have to tell Python that the function can be found in a particular library we imported. For example, pandas has a function to import data from CSV (comma-separated value) files, called read_csv. To run this command, we would need to type:

pandas.read_csv()

Since some package names are long, and adding the name to every function can result in a lot of typing, Python also allows us to assign an alias — a shorter name — to a library when we import it. For example, the convention for pandas is to give it the alias pd like this:

import pandas as pd

Then to read a CSV file we could use:

pd.read_csv()

In the cell below, import pandas with the alias pd:

import pandas as pd

Importing data with pandas#

As noted, we can read a CSV file and use it to create a pandas DataFrame, with the funciton pd.read_csv(). CSV is a text format used for storing tabular data, in which each line of the file corresponds to a row in the table, and columns are separated with commas (“CSV” stands for “comma-separated values”). Often the first row of a CSV file will be the header, containing labels for each column.

The Gapminder data is in CSV format, so let’s load in one of the Gapminder datasets with the command below. Note that when we read in a DataFrame, we need to assign it to a variable name so that we can reference it later. A convention when working with pandas is to call the DataFrame df. This works fine if you only have one DataFrame to work with, although if you are working with multiple DataFrames it is a good idea to give them more meaningful names.

The Gapminder data are stored in a subfolder called data, so as the argument to pd.read_csv() below we give the folder name folled by a slash, then the file name:

df = pd.read_csv('data/gapminder_gdp_europe.csv')

We can view the contents of the DataFrame df by simply typing its name and running the cell. Note that, unlike most of the examples we’ve used in previous lessons, we don’t use the print() function. Although it works, the result is not nicely formatted the way the output is if we just use the name of the data frame.

That is, run this command: df — not print(df) — in the cell below.

df
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
0 Albania 1601.056136 1942.284244 2312.888958 2760.196931 3313.422188 3533.003910 3630.880722 3738.932735 2497.437901 3193.054604 4604.211737 5937.029526
1 Austria 6137.076492 8842.598030 10750.721110 12834.602400 16661.625600 19749.422300 21597.083620 23687.826070 27042.018680 29095.920660 32417.607690 36126.492700
2 Belgium 8343.105127 9714.960623 10991.206760 13149.041190 16672.143560 19117.974480 20979.845890 22525.563080 25575.570690 27561.196630 30485.883750 33692.605080
3 Bosnia and Herzegovina 973.533195 1353.989176 1709.683679 2172.352423 2860.169750 3528.481305 4126.613157 4314.114757 2546.781445 4766.355904 6018.975239 7446.298803
4 Bulgaria 2444.286648 3008.670727 4254.337839 5577.002800 6597.494398 7612.240438 8224.191647 8239.854824 6302.623438 5970.388760 7696.777725 10680.792820
5 Croatia 3119.236520 4338.231617 5477.890018 6960.297861 9164.090127 11305.385170 13221.821840 13822.583940 8447.794873 9875.604515 11628.388950 14619.222720
6 Czech Republic 6876.140250 8256.343918 10136.867130 11399.444890 13108.453600 14800.160620 15377.228550 16310.443400 14297.021220 16048.514240 17596.210220 22833.308510
7 Denmark 9692.385245 11099.659350 13583.313510 15937.211230 18866.207210 20422.901500 21688.040480 25116.175810 26406.739850 29804.345670 32166.500060 35278.418740
8 Finland 6424.519071 7545.415386 9371.842561 10921.636260 14358.875900 15605.422830 18533.157610 21141.012230 20647.164990 23723.950200 28204.590570 33207.084400
9 France 7029.809327 8662.834898 10560.485530 12999.917660 16107.191710 18292.635140 20293.897460 22066.442140 24703.796150 25889.784870 28926.032340 30470.016700
10 Germany 7144.114393 10187.826650 12902.462910 14745.625610 18016.180270 20512.921230 22031.532740 24639.185660 26505.303170 27788.884160 30035.801980 32170.374420
11 Greece 3530.690067 4916.299889 6017.190733 8513.097016 12724.829570 14195.524280 15268.420890 16120.528390 17541.496340 18747.698140 22514.254800 27538.411880
12 Hungary 5263.673816 6040.180011 7550.359877 9326.644670 10168.656110 11674.837370 12545.990660 12986.479980 10535.628550 11712.776800 14843.935560 18008.944440
13 Iceland 7267.688428 9244.001412 10350.159060 13319.895680 15798.063620 19654.962470 23269.607500 26923.206280 25144.392010 28061.099660 31163.201960 36180.789190
14 Ireland 5210.280328 5599.077872 6631.597314 7655.568963 9530.772896 11150.981130 12618.321410 13872.866520 17558.815550 24521.947130 34077.049390 40675.996350
15 Italy 4931.404155 6248.656232 8243.582340 10022.401310 12269.273780 14255.984750 16537.483500 19207.234820 22013.644860 24675.024460 27968.098170 28569.719700
16 Montenegro 2647.585601 3682.259903 4649.593785 5907.850937 7778.414017 9595.929905 11222.587620 11732.510170 7003.339037 6465.613349 6557.194282 9253.896111
17 Netherlands 8941.571858 11276.193440 12790.849560 15363.251360 18794.745670 21209.059200 21399.460460 23651.323610 26790.949610 30246.130630 33724.757780 36797.933320
18 Norway 10095.421720 11653.973040 13450.401510 16361.876470 18965.055510 23311.349390 26298.635310 31540.974800 33965.661150 41283.164330 44683.975250 49357.190170
19 Poland 4029.329699 4734.253019 5338.752143 6557.152776 8006.506993 9508.141454 8451.531004 9082.351172 7738.881247 10159.583680 12002.239080 15389.924680
20 Portugal 3068.319867 3774.571743 4727.954889 6361.517993 9022.247417 10172.485720 11753.842910 13039.308760 16207.266630 17641.031560 19970.907870 20509.647770
21 Romania 3144.613186 3943.370225 4734.997586 6470.866545 8011.414402 9356.397240 9605.314053 9696.273295 6598.409903 7346.547557 7885.360081 10808.475610
22 Serbia 3581.459448 4981.090891 6289.629157 7991.707066 10522.067490 12980.669560 15181.092700 15870.878510 9325.068238 7914.320304 7236.075251 9786.534714
23 Slovak Republic 5074.659104 6093.262980 7481.107598 8412.902397 9674.167626 10922.664040 11348.545850 12037.267580 9498.467723 12126.230650 13638.778370 18678.314350
24 Slovenia 4215.041741 5862.276629 7402.303395 9405.489397 12383.486200 15277.030170 17866.721750 18678.534920 14214.716810 17161.107350 20660.019360 25768.257590
25 Spain 3834.034742 4564.802410 5693.843879 7993.512294 10638.751310 13236.921170 13926.169970 15764.983130 18603.064520 20445.298960 24835.471660 28821.063700
26 Sweden 8527.844662 9911.878226 12329.441920 15258.296970 17832.024640 18855.725210 20667.381250 23586.929270 23880.016830 25266.594990 29341.630930 33859.748350
27 Switzerland 14734.232750 17909.489730 20431.092700 22966.144320 27195.113040 26982.290520 28397.715120 30281.704590 31871.530300 32135.323010 34480.957710 37506.419070
28 Turkey 1969.100980 2218.754257 2322.869908 2826.356387 3450.696380 4269.122326 4241.356344 5089.043686 5678.348271 6601.429915 6508.085718 8458.276384
29 United Kingdom 9979.508487 11283.177950 12477.177070 14142.850890 15895.116410 17428.748460 18232.424520 21664.787670 22705.092540 26074.531360 29478.999190 33203.261280

You’ll see that the rows are numbered in boldface, starting with 0 as is the norm in Python. This boldfaced, leftmost column is called the index of the DataFrame, and provides one way of accessing data by rows. Across the top, you’ll see that the column labels are also in boldface. pandas is pretty smart about automatically detecting when the first row of a CSV file contains header information (column names).

Heads or Tails?#

We might want to “peek” at the DataFrame without printing out the entire thing, especially if it’s big. We can see the first 5 rows of a DataFrame with the .head() method:

df.head()
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
0 Albania 1601.056136 1942.284244 2312.888958 2760.196931 3313.422188 3533.003910 3630.880722 3738.932735 2497.437901 3193.054604 4604.211737 5937.029526
1 Austria 6137.076492 8842.598030 10750.721110 12834.602400 16661.625600 19749.422300 21597.083620 23687.826070 27042.018680 29095.920660 32417.607690 36126.492700
2 Belgium 8343.105127 9714.960623 10991.206760 13149.041190 16672.143560 19117.974480 20979.845890 22525.563080 25575.570690 27561.196630 30485.883750 33692.605080
3 Bosnia and Herzegovina 973.533195 1353.989176 1709.683679 2172.352423 2860.169750 3528.481305 4126.613157 4314.114757 2546.781445 4766.355904 6018.975239 7446.298803
4 Bulgaria 2444.286648 3008.670727 4254.337839 5577.002800 6597.494398 7612.240438 8224.191647 8239.854824 6302.623438 5970.388760 7696.777725 10680.792820

…or the last 5 rows with .tail():

df.tail()
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
25 Spain 3834.034742 4564.802410 5693.843879 7993.512294 10638.75131 13236.921170 13926.169970 15764.983130 18603.064520 20445.298960 24835.471660 28821.063700
26 Sweden 8527.844662 9911.878226 12329.441920 15258.296970 17832.02464 18855.725210 20667.381250 23586.929270 23880.016830 25266.594990 29341.630930 33859.748350
27 Switzerland 14734.232750 17909.489730 20431.092700 22966.144320 27195.11304 26982.290520 28397.715120 30281.704590 31871.530300 32135.323010 34480.957710 37506.419070
28 Turkey 1969.100980 2218.754257 2322.869908 2826.356387 3450.69638 4269.122326 4241.356344 5089.043686 5678.348271 6601.429915 6508.085718 8458.276384
29 United Kingdom 9979.508487 11283.177950 12477.177070 14142.850890 15895.11641 17428.748460 18232.424520 21664.787670 22705.092540 26074.531360 29478.999190 33203.261280

We can also see a random sample of rows from the DataFrame with .sample(), giving it a numerical argument to indicate the number of rows we want to see:

df.sample(10)
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
27 Switzerland 14734.232750 17909.489730 20431.092700 22966.144320 27195.113040 26982.29052 28397.715120 30281.704590 31871.530300 32135.323010 34480.957710 37506.41907
26 Sweden 8527.844662 9911.878226 12329.441920 15258.296970 17832.024640 18855.72521 20667.381250 23586.929270 23880.016830 25266.594990 29341.630930 33859.74835
17 Netherlands 8941.571858 11276.193440 12790.849560 15363.251360 18794.745670 21209.05920 21399.460460 23651.323610 26790.949610 30246.130630 33724.757780 36797.93332
25 Spain 3834.034742 4564.802410 5693.843879 7993.512294 10638.751310 13236.92117 13926.169970 15764.983130 18603.064520 20445.298960 24835.471660 28821.06370
24 Slovenia 4215.041741 5862.276629 7402.303395 9405.489397 12383.486200 15277.03017 17866.721750 18678.534920 14214.716810 17161.107350 20660.019360 25768.25759
5 Croatia 3119.236520 4338.231617 5477.890018 6960.297861 9164.090127 11305.38517 13221.821840 13822.583940 8447.794873 9875.604515 11628.388950 14619.22272
11 Greece 3530.690067 4916.299889 6017.190733 8513.097016 12724.829570 14195.52428 15268.420890 16120.528390 17541.496340 18747.698140 22514.254800 27538.41188
21 Romania 3144.613186 3943.370225 4734.997586 6470.866545 8011.414402 9356.39724 9605.314053 9696.273295 6598.409903 7346.547557 7885.360081 10808.47561
13 Iceland 7267.688428 9244.001412 10350.159060 13319.895680 15798.063620 19654.96247 23269.607500 26923.206280 25144.392010 28061.099660 31163.201960 36180.78919
12 Hungary 5263.673816 6040.180011 7550.359877 9326.644670 10168.656110 11674.83737 12545.990660 12986.479980 10535.628550 11712.776800 14843.935560 18008.94444

Note that the .head() and .tail() methods also optionally take a numerical argument, if you want to view a different number of rows from the default of 5.

Accessing values in a DataFrame#

One thing we often want to do is access a single cell in a DataFrame, or a range of cells. Each cell is uniquely defined by a combination of its row and column locations.

Select a column using []#

If we want to select an entire column of a pandas DataFrame, we just give the name of the DataFrame followed by the column name in square brackets:

df['gdpPercap_1992']
0      2497.437901
1     27042.018680
2     25575.570690
3      2546.781445
4      6302.623438
5      8447.794873
6     14297.021220
7     26406.739850
8     20647.164990
9     24703.796150
10    26505.303170
11    17541.496340
12    10535.628550
13    25144.392010
14    17558.815550
15    22013.644860
16     7003.339037
17    26790.949610
18    33965.661150
19     7738.881247
20    16207.266630
21     6598.409903
22     9325.068238
23     9498.467723
24    14214.716810
25    18603.064520
26    23880.016830
27    31871.530300
28     5678.348271
29    22705.092540
Name: gdpPercap_1992, dtype: float64

Note that if we ask for a single column the result is a pandas Series, but if we ask for two or more columns, the result is a DataFrame. Pay close attention to the syntax below — if we’re asking for more than one column, we need to provide a list of columns inside the square brackets (so there are two sets of nested sqaure brackets in the code below):

df[['gdpPercap_1982', 'gdpPercap_1992']]
gdpPercap_1982 gdpPercap_1992
0 3630.880722 2497.437901
1 21597.083620 27042.018680
2 20979.845890 25575.570690
3 4126.613157 2546.781445
4 8224.191647 6302.623438
5 13221.821840 8447.794873
6 15377.228550 14297.021220
7 21688.040480 26406.739850
8 18533.157610 20647.164990
9 20293.897460 24703.796150
10 22031.532740 26505.303170
11 15268.420890 17541.496340
12 12545.990660 10535.628550
13 23269.607500 25144.392010
14 12618.321410 17558.815550
15 16537.483500 22013.644860
16 11222.587620 7003.339037
17 21399.460460 26790.949610
18 26298.635310 33965.661150
19 8451.531004 7738.881247
20 11753.842910 16207.266630
21 9605.314053 6598.409903
22 15181.092700 9325.068238
23 11348.545850 9498.467723
24 17866.721750 14214.716810
25 13926.169970 18603.064520
26 20667.381250 23880.016830
27 28397.715120 31871.530300
28 4241.356344 5678.348271
29 18232.424520 22705.092540

Numerical indexing using .iloc[]#

Often we don’t want to access an entire column, however, but just specific rows within a column (or range of columns). pandas provides two ways of accessing cell locations. One is using the numerical positions in the DataFrame, using the convention of [row, column] — with [0, 0] being the top left cell in the DataFrame. So for a pandas DataFrame with 3 rows and 3 columns, the indices of each cell are as shown:

col 0

col 1

col 2

col 3

0

[0, 0]

[0, 1]

[0, 2]

[0, 3]

1

[1, 0]

[1, 1]

[1, 2]

[1, 3]

2

[2, 0]

[2, 1]

[2, 2]

[2, 3]

3

[3, 0]

[3, 1]

[3, 2]

[3, 3]

Numerical indexing of DataFrames is done with the .iloc[] method. For example, to access the GDP value for Austria in 1952 — which is located in the second row, third column of our current DataFrame, we would use:

df.iloc[1, 2]
8842.59803

Label-based indexing using .loc[]#

The other way to access a location in a DataFrame is by its index and column labels, using the .loc[] method. As noted earlier, in the DataFrame we imported, the indexes are currently numbers, which were created automatically when we imported the data. The .loc[] method doesn’t work with numerical indexes (that’s what iloc is for — and you can’t mix, say, a numerical row index with a column label), but in the data set we imported, the first column of this CSV file is actually meant to be its index: while all other columns are data values (GDP, in type float), the first column identifies the country with which each row of data is associated.

pandas has a method for setting an index column, .set_index(), where the argument (in the parentheses) would be the name of the column to use as the index. So here we want to run:

df = df.set_index('country')

Note that we need to assign the result of this operation back to df (using df = ), otherwise the change will not actually modify df.

In the cell below, use the .set_index() method to set the index of df to country, and then view the DataFrame again to see how it has changed.

df = df.set_index('country')

Alternatively, if we knew which column we wanted to use as the index before loading in the data file, we could have included the argument index_col= in the pd.read_csv() command:

df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Now that we have defined the index, we can access the 1952 GDP value for Austria by its index and column names:

df.loc['Austria', 'gdpPercap_1952']
6137.076492

Use : on its own to mean all columns or all rows.#

Using Python’s familiar slicing notatio (which we’ve previously used for strings and lists), we can use : with .iloc[] or .loc[], to specify a range in a DataFrame.

For example, to see the GDP of Albania for every year (column) in the DataFrame, we would use:

df.loc["Albania", :]
gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

Likewise, we could see the GDP for every country (row) in the year 1957 with:

df.loc[:, 'gdpPercap_1957']
country
Albania                    1942.284244
Austria                    8842.598030
Belgium                    9714.960623
Bosnia and Herzegovina     1353.989176
Bulgaria                   3008.670727
Croatia                    4338.231617
Czech Republic             8256.343918
Denmark                   11099.659350
Finland                    7545.415386
France                     8662.834898
Germany                   10187.826650
Greece                     4916.299889
Hungary                    6040.180011
Iceland                    9244.001412
Ireland                    5599.077872
Italy                      6248.656232
Montenegro                 3682.259903
Netherlands               11276.193440
Norway                    11653.973040
Poland                     4734.253019
Portugal                   3774.571743
Romania                    3943.370225
Serbia                     4981.090891
Slovak Republic            6093.262980
Slovenia                   5862.276629
Spain                      4564.802410
Sweden                     9911.878226
Switzerland               17909.489730
Turkey                     2218.754257
United Kingdom            11283.177950
Name: gdpPercap_1957, dtype: float64

You can also just specify the row index; if you don’t specify anything for the columns, pandas assumes you want all columns:

df.loc["Albania"]
gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

However, since the syntax for .iloc[] and .loc[] is [rows, columns], you cannot omit a row index; you need to use : if you want all rows.

Slicing works on DataFrames#

Slicing using numerical indices works similarly for DataFrames as we previously saw for strings and lists, for example, the following code will print the third through fifth rows of the DataFrame, and the fifth through eighth columns (remember, Python indexing starts at 0, and slicing does not include the “end” index):

df.iloc[2:5, 4:8]
gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987
country
Belgium 16672.143560 19117.974480 20979.845890 22525.563080
Bosnia and Herzegovina 2860.169750 3528.481305 4126.613157 4314.114757
Bulgaria 6597.494398 7612.240438 8224.191647 8239.854824

The code below will print from the sixth to second-last row of the DataFrame, and from the ninth to the last column:

df.iloc[5:-1, 8:]
gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Croatia 8447.794873 9875.604515 11628.388950 14619.222720
Czech Republic 14297.021220 16048.514240 17596.210220 22833.308510
Denmark 26406.739850 29804.345670 32166.500060 35278.418740
Finland 20647.164990 23723.950200 28204.590570 33207.084400
France 24703.796150 25889.784870 28926.032340 30470.016700
Germany 26505.303170 27788.884160 30035.801980 32170.374420
Greece 17541.496340 18747.698140 22514.254800 27538.411880
Hungary 10535.628550 11712.776800 14843.935560 18008.944440
Iceland 25144.392010 28061.099660 31163.201960 36180.789190
Ireland 17558.815550 24521.947130 34077.049390 40675.996350
Italy 22013.644860 24675.024460 27968.098170 28569.719700
Montenegro 7003.339037 6465.613349 6557.194282 9253.896111
Netherlands 26790.949610 30246.130630 33724.757780 36797.933320
Norway 33965.661150 41283.164330 44683.975250 49357.190170
Poland 7738.881247 10159.583680 12002.239080 15389.924680
Portugal 16207.266630 17641.031560 19970.907870 20509.647770
Romania 6598.409903 7346.547557 7885.360081 10808.475610
Serbia 9325.068238 7914.320304 7236.075251 9786.534714
Slovak Republic 9498.467723 12126.230650 13638.778370 18678.314350
Slovenia 14214.716810 17161.107350 20660.019360 25768.257590
Spain 18603.064520 20445.298960 24835.471660 28821.063700
Sweden 23880.016830 25266.594990 29341.630930 33859.748350
Switzerland 31871.530300 32135.323010 34480.957710 37506.419070
Turkey 5678.348271 6601.429915 6508.085718 8458.276384

Note however, that when using label-based indexing with .loc[], pandas’ slicing behavior is a bit different. Specifically, the output includes the last item in the range, whereas numerical indexing with .iloc[] does not.

So, considering that the first three rows of the DataFrame correspond to the countries Albania, Austria, and Belgium, and that columns 6 and 7 are for the years 1982 and 1987 respectively, compare the output of:

df.iloc[0:2, 6:7]
gdpPercap_1982
country
Albania 3630.880722
Austria 21597.083620

with:

df.loc['Albania':'Belgium', 'gdpPercap_1982':'gdpPercap_1988']
gdpPercap_1982 gdpPercap_1987
country
Albania 3630.880722 3738.932735
Austria 21597.083620 23687.826070
Belgium 20979.845890 22525.563080

The “inclusive” label-based indexing with .loc[] is fairly intuitive, but it’s important to remember that it works differently from numerical indexing.

Use lists to select non-contiguous sections of a DataFrame#

While slicing can be very useful, sometimes we might want to extract values that aren’t next to each other in a DataFrame. For example, what if we only want values from two specific years (1992 and 2002), for Scandinavian countries (Denmark, Finland, Norway, and Sweden)? Nether these years nor countries are in adjacent columns/rows in the DataFrame. With .loc[], we can use lists, rather than ranges separated by :, as selectors:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], ['gdpPercap_1992', 'gdpPercap_2002']]
gdpPercap_1992 gdpPercap_2002
country
Denmark 26406.73985 32166.50006
Finland 20647.16499 28204.59057
Norway 33965.66115 44683.97525
Sweden 23880.01683 29341.63093

We can equivalently write the command over several lines to make it a bit easier to read:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], 
       ['gdpPercap_1992', 'gdpPercap_2002']
      ]
gdpPercap_1992 gdpPercap_2002
country
Denmark 26406.73985 32166.50006
Finland 20647.16499 28204.59057
Norway 33965.66115 44683.97525
Sweden 23880.01683 29341.63093

We could also define those lists as variables, and pass the variables to .loc[]. This might be useful if you were going to use the lists more than once, as well as for clarity:

scand_countries = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
years = ['gdpPercap_1992', 'gdpPercap_2002']
df.loc[scand_countries, years]
gdpPercap_1992 gdpPercap_2002
country
Denmark 26406.73985 32166.50006
Finland 20647.16499 28204.59057
Iceland 25144.39201 31163.20196
Norway 33965.66115 44683.97525
Sweden 23880.01683 29341.63093

We can take this a step further, and assign the output of a .loc[] selection like this to a new variable name. This makes a copy of the selected data, stored in a new DataFrame (or Series, if we only select one row or column) with its own name. This allows us to later reference and use that selection.

scand_data = df.loc[scand_countries, years]

It’s easy to do simple math and statistics in DataFrames#

We prevoiusly learned about methods to get simple statistical values out of a Python list, like .max(), and .min(). pandas includes these and many more methods as well. For example, we can view the mean GDP for Italy across all years (columns) with:

df.loc['Italy'].mean()
16245.20900641667

Or the largest GDP in 1977 with:

df.loc[:, 'gdpPercap_1977'].max()
26982.29052

Another useful method is .describe(), which prints out a range of descriptive statistics for the range of data you specify. Without any slicing it provides information for each column:

df.describe()
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
count 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000
mean 5661.057435 6963.012816 8365.486814 10143.823757 12479.575246 14283.979110 15617.896551 17214.310727 17061.568084 19076.781802 21711.732422 25054.481636
std 3114.060493 3677.950146 4199.193906 4724.983889 5509.691411 5874.464896 6453.234827 7482.957960 9109.804361 10065.457716 11197.355517 11800.339811
min 973.533195 1353.989176 1709.683679 2172.352423 2860.169750 3528.481305 3630.880722 3738.932735 2497.437901 3193.054604 4604.211737 5937.029526
25% 3241.132406 4394.874315 5373.536612 6657.939047 9057.708094 10360.030300 11449.870115 12274.570680 8667.113214 9946.599306 11721.851483 14811.898210
50% 5142.469716 6066.721495 7515.733737 9366.067033 12326.379990 14225.754515 15322.824720 16215.485895 17550.155945 19596.498550 23674.863230 28054.065790
75% 7236.794919 9597.220820 10931.085348 13277.182058 16523.017127 19052.412163 20901.729730 23321.587723 25034.243045 27189.530312 30373.363307 33817.962533
max 14734.232750 17909.489730 20431.092700 22966.144320 27195.113040 26982.290520 28397.715120 31540.974800 33965.661150 41283.164330 44683.975250 49357.190170

Mini-Exercise#

In the cell below, use the scand_countries and years variables to view descriptive statistics for all Scandinavian countries in each year.

Evaluate cells based on conditions#

pandas allows an easy way to identify values in a DataFrame that meet a certain condition, using operators like <, >, and ==. For example, let’s see which countries in a list had a GDP over 10,000 in 1962 and 1992. The result is reported in Booleans (True/False) for each cell.

countries = ['France', 'Germany', 'Italy', 'Spain', 'United Kingdom']
df.loc[countries, ['gdpPercap_1962', 'gdpPercap_1992']] > 10000
gdpPercap_1962 gdpPercap_1992
country
France True True
Germany True True
Italy False True
Spain False True
United Kingdom True True

Select values or NaN using a Boolean mask.#

A DataFrame full of Booleans is sometimes called a mask because of how it can be used. A mask removes values that are not True, and replaces them with NaN — a special Python value representing “not a number”. This can be useful because pandas ignores NaN values when doing computations.

We create a mask by assigning the output of a conditional statement to a variable name:

mask = scand_data > 30000

Then we can apply the mask to the DataFrame to get only the values that meet the criterion:

scand_data[mask]
gdpPercap_1992 gdpPercap_2002
country
Denmark NaN 32166.50006
Finland NaN NaN
Iceland NaN 31163.20196
Norway 33965.66115 44683.97525
Sweden NaN NaN

As an example of how this might be used, the steps above would now allow us to find the lowest GDP value in each year, that was above 30,000:

scand_data[mask].min()
gdpPercap_1992    33965.66115
gdpPercap_2002    31163.20196
dtype: float64

Split-Apply-Combine#

A common task in data science is to split data into meaningful subgroups, apply an operation to each subgroup (e.g., compute the mean), and then combine the results into a single output, such as a table or a new DataFrame. This paradigm was famously described by Hadley Wickham in a 2011 paper.

pandas provides methods and grouping operations that are very efficient (vectorized) for split-apply-combine operations.

As an example, let’s say that we wanted to compare the average GDP for different regions of Europe, divided as northern, southern, eastern, and western. To do this, we first have to create lists defining the countries belonging to each of these regions:

northern = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
southern = ['Greece', 'Italy', 'Portugal', 'Spain']
eastern = ['Albania', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 
            'Czech Republic', 'Hungary', 'Montenegro', 'Poland', 'Romania', 
            'Serbia', 'Slovak Republic', 'Slovenia']
western = ['Austria',  'Belgium', 'France', 'Germany', 'Ireland', 
            'Netherlands', 'Switzerland', 'United Kingdom']

Next we can make a new column simply by using .loc[] with the rows specified by one of the lists we just defined, a column name that doesn’t already exist (in this case, we’ll call it “region”), then assigning a region label to that combination of rows and column. We need to do this separately for each region. Note that when we first create the new column (“region”), pandas fills it with NaN values in any rows that were not defined by the assignment. For example, in the code below, the first line will create the column “region”, and fill it with “northern” for any row in the northern list, and NaN to every other row.

df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'
df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'

Let’s look at the last two columns of df to see the region column we created:

df.iloc[:, -2:]
gdpPercap_2007 region
country
Albania 5937.029526 eastern
Austria 36126.492700 western
Belgium 33692.605080 western
Bosnia and Herzegovina 7446.298803 eastern
Bulgaria 10680.792820 eastern
Croatia 14619.222720 eastern
Czech Republic 22833.308510 eastern
Denmark 35278.418740 northern
Finland 33207.084400 northern
France 30470.016700 western
Germany 32170.374420 western
Greece 27538.411880 southern
Hungary 18008.944440 eastern
Iceland 36180.789190 northern
Ireland 40675.996350 western
Italy 28569.719700 southern
Montenegro 9253.896111 eastern
Netherlands 36797.933320 western
Norway 49357.190170 northern
Poland 15389.924680 eastern
Portugal 20509.647770 southern
Romania 10808.475610 eastern
Serbia 9786.534714 eastern
Slovak Republic 18678.314350 eastern
Slovenia 25768.257590 eastern
Spain 28821.063700 southern
Sweden 33859.748350 northern
Switzerland 37506.419070 western
Turkey 8458.276384 NaN
United Kingdom 33203.261280 western

Split#

Now we can use this “region” column to split the data into groups, using a pandas method called .groupby()

grouped_countries = df.groupby('region')

Note that this step doesn’t create a new DataFrame, it creates a special kind of pandas object that points to a grouping in the original DataFrame:

grouped_countries = df.groupby('region')

Apply#

Now that we have split the data, we can apply a function separately to each group. Here we’ll compute the mean GDP for each region, for each year:

mean_gdp_by_region = grouped_countries.mean()

Combine#

The combine step actually occurred with the apply step above — the result is automatically combined into a table of mean values organized by region. But since our apply step (.mean()) saved the result to a variable, we can view the resulting table as the output of the combine step:

mean_gdp_by_region
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
region
eastern 3580.884612 4519.684445 5611.534264 6911.825724 8465.695242 10007.911765 10900.209963 11375.852107 8250.514199 9395.008143 10864.013821 14100.916656
northern 8401.571825 9890.985483 11817.031712 14359.783322 17164.045376 19570.072280 22091.364430 25661.659678 26008.794966 29627.830970 33111.979754 37576.646170
southern 3841.112208 4876.082568 6170.642960 8222.632153 11163.775519 12965.228980 14371.479317 16033.013775 18591.368087 20377.263280 23822.183125 26359.710762
western 8439.962345 10434.519899 12191.949119 14232.125299 17359.111144 19305.504057 20693.785153 22798.712417 25344.134586 27914.214806 31703.386229 35080.387365

Chaining#

In Python, chaining refers to combining a number of operations in one command, using a sequence of methods. We can perform the above split-apply-combine procedure in a single step as follows. Note that because we don’t assign the output to a variable name, it is displayed as output but not saved.

df.groupby('region').mean()
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
region
eastern 3580.884612 4519.684445 5611.534264 6911.825724 8465.695242 10007.911765 10900.209963 11375.852107 8250.514199 9395.008143 10864.013821 14100.916656
northern 8401.571825 9890.985483 11817.031712 14359.783322 17164.045376 19570.072280 22091.364430 25661.659678 26008.794966 29627.830970 33111.979754 37576.646170
southern 3841.112208 4876.082568 6170.642960 8222.632153 11163.775519 12965.228980 14371.479317 16033.013775 18591.368087 20377.263280 23822.183125 26359.710762
western 8439.962345 10434.519899 12191.949119 14232.125299 17359.111144 19305.504057 20693.785153 22798.712417 25344.134586 27914.214806 31703.386229 35080.387365

Exercises#

Selecting Individual Values#

Write an expression to find the Per Capita GDP of Serbia in 2007.

Extent of Slicing#

  1. Do the two statements below produce the same output? (Hint: you might want to use the .head() method to remind yourself of the structure of the DataFrame)

  2. Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

Reconstructing Data#

Explain what each line in the following short program does: what is in df1, df2, etc.?

df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')
df2 = df1[df1['continent'] == 'Americas']
df3 = df2.drop('Puerto Rico')
df4 = df3.drop('continent', axis = 1)
df4.to_csv('result.csv')

Selecting Indices#

Explain in simple terms what idxmin and idxmax do. When would you use these methods?

data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.idxmin()
data.idxmax()

Practice with Selection#

From the previous exercise, the Gapminder GDP data for Europe should be loaded in as data. Using this DataFrame, write an expression to select each of the following:

  • GDP per capita for all countries in 1982

  • GDP per capita for Denmark for all years

  • GDP per capita for all countries for years after 1985

Note that pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named gdpPercap_1985 actually exists. This is useful if new columns are added to the CSV file later.

  • GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952


Summary of Key Points:#

  • pandas DataFrames are a powerful way of storing and working with tabular (row/column) data

  • pandas columns and rows can have names

  • pandas row names are called indexes which are numeric by default, but can be given other labels

  • Use the .iloc[] method with a DataFrame to select values by integer location, using [row, column] format

  • Use the .loc[] method with a DataFrame to select rows and/or columns, using named slices

  • Use : on its own to mean all columns or all rows

  • Result of slicing can be used in further operations

  • Use comparisons to select data based on value

  • Select values or NaN using a Boolean mask

  • use split-apply-combine to derive analytics from groupings within a DataFrame


This lesson is adapted from the Software Carpentry Plotting and Programming in Python workshop.