pandas DataFrames

pandas DataFrames#

Watch a walk-through of this lesson on YouTube
Download all the Jupyter notebooks and other files you need to complete the lessons in this chapter (Chapter 3)
Turn off the GitHub Copilot AI assistant so you can focus on learning Python using your HI (human intelligence). Click the Deactivate Copilot button in the bottom right of VS Code, if it is currently activated.

Questions:#

How do I read in data from a file?
How can I work with data in tabular format (tables)?
How can I do basic descriptive statistics on tabular data?

Learning Objectives:#

Select individual values from a Pandas dataframe
Select entire rows or entire columns from a dataframe
Select a subset of both rows and columns from a dataframe in a single operation
Select a subset of a dataframe by a single Boolean criterion
Obtain descriptive statistics for subsets of data within a table
Use the split-apply-combine paradigm to work with data

What is pandas?#

Bad news first: there are no cute, black-and-white bears here. pandas (whose official name starts with a lower-case “p”) is a Python library for working with data in a tabular format, such as is found in file formats like CSV, Microsoft Excel, and Google Sheets. Unlike Excel or Sheets, pandas is not a point-and click graphical interface for working with these files — everything is done through Python code. But compared to other formats for working with data that we have seen in this workshop, such as lists and dictionaries, pandas may seem more familiar, and it definitely lends itself more naturally to large data sets. Indeed, pandas’ mission statement is, “…to be the fundamental high-level building block for doing practical, real world data analysis in Python”.

The primary units of pandas data storage you will work with are DataFrames (essentially, tables of data organized as rows and columns). DataFrames are actually collections of pandas Series objects, which can be thought of as individual rows or columns (or vectors, or 1D arrays).

Among the things that make pandas so attractive are the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames. As well, pandas functions and methods are written to work intuitively and efficiently with data organized in tables. Most operations are vectorized, which means that they will automatically apply to all values in a DataFrame or Series without the need to write for loops to execute the same operation on a set of cells.

pandas is built on top of the NumPy library. Although we haven’t discussed NumPy in this workshop, it is a powerful and widely used Python library for working with numerical data. However, it’s worth noting for your future reference that most of the methods defined for NumPy Arrays also apply to Pandas Series/DataFrames.

About Python Libraries#

pandas is an example of a Python library. A library is a collection of files (called modules) that contains functions for use by other programs. Libraries provide ways of extending Python’s functionality in different ways. They may also contain data values (e.g., numerical constants), entire sample data sets, and other things. A library’s contents are supposed to be related, although there’s no actual way to enforce that.

The Python standard library is an extensive suite of modules that comes with Python itself. Everything we’ve done so far in this workshop has been part of the standard library. Many additional libraries are available; CoCalc has a large number of extra libraries already installed.

To use a library in a particular Jupyter notebook or other Python program, we must import it using the import statement, like this:

import pandas

Once a library is imported, we can use functions and methods from it. But, for functions we have to tell Python that the function can be found in a particular library we imported. For example, pandas has a function to import data from CSV (comma-separated value) files, called read_csv. To run this command, we would need to type:

pandas.read_csv()

Since some package names are long, and adding the name to every function can result in a lot of typing, Python also allows us to assign an alias — a shorter name — to a library when we import it. For example, the convention for pandas is to give it the alias pd like this:

import pandas as pd

Then to read a CSV file we could use:

pd.read_csv()

In the cell below, import pandas with the alias pd:

import pandas as pd

Importing data with pandas#

As noted, we can read a CSV file and use it to create a pandas DataFrame, with the funciton pd.read_csv(). CSV is a text format used for storing tabular data, in which each line of the file corresponds to a row in the table, and columns are separated with commas (“CSV” stands for “comma-separated values”). Often the first row of a CSV file will be the header, containing labels for each column.

The Gapminder data is in CSV format, so let’s load in one of the Gapminder datasets with the command below. Note that when we read in a DataFrame, we need to assign it to a variable name so that we can reference it later. A convention when working with pandas is to call the DataFrame df. This works fine if you only have one DataFrame to work with, although if you are working with multiple DataFrames it is a good idea to give them more meaningful names.

The Gapminder data are stored in a subfolder called data, so as the argument to pd.read_csv() below we give the folder name folled by a slash, then the file name:

df = pd.read_csv('data/gapminder_gdp_europe.csv')

We can view the contents of the DataFrame df by simply typing its name and running the cell. Note that, unlike most of the examples we’ve used in previous lessons, we don’t use the print() function. Although it works, the result is not nicely formatted the way the output is if we just use the name of the data frame.

That is, run this command: df — not print(df) — in the cell below.

df

	country	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
0	Albania	1601.056136	1942.284244	2312.888958	2760.196931	3313.422188	3533.003910	3630.880722	3738.932735	2497.437901	3193.054604	4604.211737	5937.029526
1	Austria	6137.076492	8842.598030	10750.721110	12834.602400	16661.625600	19749.422300	21597.083620	23687.826070	27042.018680	29095.920660	32417.607690	36126.492700
2	Belgium	8343.105127	9714.960623	10991.206760	13149.041190	16672.143560	19117.974480	20979.845890	22525.563080	25575.570690	27561.196630	30485.883750	33692.605080
3	Bosnia and Herzegovina	973.533195	1353.989176	1709.683679	2172.352423	2860.169750	3528.481305	4126.613157	4314.114757	2546.781445	4766.355904	6018.975239	7446.298803
4	Bulgaria	2444.286648	3008.670727	4254.337839	5577.002800	6597.494398	7612.240438	8224.191647	8239.854824	6302.623438	5970.388760	7696.777725	10680.792820
5	Croatia	3119.236520	4338.231617	5477.890018	6960.297861	9164.090127	11305.385170	13221.821840	13822.583940	8447.794873	9875.604515	11628.388950	14619.222720
6	Czech Republic	6876.140250	8256.343918	10136.867130	11399.444890	13108.453600	14800.160620	15377.228550	16310.443400	14297.021220	16048.514240	17596.210220	22833.308510
7	Denmark	9692.385245	11099.659350	13583.313510	15937.211230	18866.207210	20422.901500	21688.040480	25116.175810	26406.739850	29804.345670	32166.500060	35278.418740
8	Finland	6424.519071	7545.415386	9371.842561	10921.636260	14358.875900	15605.422830	18533.157610	21141.012230	20647.164990	23723.950200	28204.590570	33207.084400
9	France	7029.809327	8662.834898	10560.485530	12999.917660	16107.191710	18292.635140	20293.897460	22066.442140	24703.796150	25889.784870	28926.032340	30470.016700
10	Germany	7144.114393	10187.826650	12902.462910	14745.625610	18016.180270	20512.921230	22031.532740	24639.185660	26505.303170	27788.884160	30035.801980	32170.374420
11	Greece	3530.690067	4916.299889	6017.190733	8513.097016	12724.829570	14195.524280	15268.420890	16120.528390	17541.496340	18747.698140	22514.254800	27538.411880
12	Hungary	5263.673816	6040.180011	7550.359877	9326.644670	10168.656110	11674.837370	12545.990660	12986.479980	10535.628550	11712.776800	14843.935560	18008.944440
13	Iceland	7267.688428	9244.001412	10350.159060	13319.895680	15798.063620	19654.962470	23269.607500	26923.206280	25144.392010	28061.099660	31163.201960	36180.789190
14	Ireland	5210.280328	5599.077872	6631.597314	7655.568963	9530.772896	11150.981130	12618.321410	13872.866520	17558.815550	24521.947130	34077.049390	40675.996350
15	Italy	4931.404155	6248.656232	8243.582340	10022.401310	12269.273780	14255.984750	16537.483500	19207.234820	22013.644860	24675.024460	27968.098170	28569.719700
16	Montenegro	2647.585601	3682.259903	4649.593785	5907.850937	7778.414017	9595.929905	11222.587620	11732.510170	7003.339037	6465.613349	6557.194282	9253.896111
17	Netherlands	8941.571858	11276.193440	12790.849560	15363.251360	18794.745670	21209.059200	21399.460460	23651.323610	26790.949610	30246.130630	33724.757780	36797.933320
18	Norway	10095.421720	11653.973040	13450.401510	16361.876470	18965.055510	23311.349390	26298.635310	31540.974800	33965.661150	41283.164330	44683.975250	49357.190170
19	Poland	4029.329699	4734.253019	5338.752143	6557.152776	8006.506993	9508.141454	8451.531004	9082.351172	7738.881247	10159.583680	12002.239080	15389.924680
20	Portugal	3068.319867	3774.571743	4727.954889	6361.517993	9022.247417	10172.485720	11753.842910	13039.308760	16207.266630	17641.031560	19970.907870	20509.647770
21	Romania	3144.613186	3943.370225	4734.997586	6470.866545	8011.414402	9356.397240	9605.314053	9696.273295	6598.409903	7346.547557	7885.360081	10808.475610
22	Serbia	3581.459448	4981.090891	6289.629157	7991.707066	10522.067490	12980.669560	15181.092700	15870.878510	9325.068238	7914.320304	7236.075251	9786.534714
23	Slovak Republic	5074.659104	6093.262980	7481.107598	8412.902397	9674.167626	10922.664040	11348.545850	12037.267580	9498.467723	12126.230650	13638.778370	18678.314350
24	Slovenia	4215.041741	5862.276629	7402.303395	9405.489397	12383.486200	15277.030170	17866.721750	18678.534920	14214.716810	17161.107350	20660.019360	25768.257590
25	Spain	3834.034742	4564.802410	5693.843879	7993.512294	10638.751310	13236.921170	13926.169970	15764.983130	18603.064520	20445.298960	24835.471660	28821.063700
26	Sweden	8527.844662	9911.878226	12329.441920	15258.296970	17832.024640	18855.725210	20667.381250	23586.929270	23880.016830	25266.594990	29341.630930	33859.748350
27	Switzerland	14734.232750	17909.489730	20431.092700	22966.144320	27195.113040	26982.290520	28397.715120	30281.704590	31871.530300	32135.323010	34480.957710	37506.419070
28	Turkey	1969.100980	2218.754257	2322.869908	2826.356387	3450.696380	4269.122326	4241.356344	5089.043686	5678.348271	6601.429915	6508.085718	8458.276384
29	United Kingdom	9979.508487	11283.177950	12477.177070	14142.850890	15895.116410	17428.748460	18232.424520	21664.787670	22705.092540	26074.531360	29478.999190	33203.261280

You’ll see that the rows are numbered in boldface, starting with 0 as is the norm in Python. This boldfaced, leftmost column is called the index of the DataFrame, and provides one way of accessing data by rows. Across the top, you’ll see that the column labels are also in boldface. pandas is pretty smart about automatically detecting when the first row of a CSV file contains header information (column names).

Heads or Tails?#

We might want to “peek” at the DataFrame without printing out the entire thing, especially if it’s big. We can see the first 5 rows of a DataFrame with the .head() method:

df.head()

	country	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
0	Albania	1601.056136	1942.284244	2312.888958	2760.196931	3313.422188	3533.003910	3630.880722	3738.932735	2497.437901	3193.054604	4604.211737	5937.029526
1	Austria	6137.076492	8842.598030	10750.721110	12834.602400	16661.625600	19749.422300	21597.083620	23687.826070	27042.018680	29095.920660	32417.607690	36126.492700
2	Belgium	8343.105127	9714.960623	10991.206760	13149.041190	16672.143560	19117.974480	20979.845890	22525.563080	25575.570690	27561.196630	30485.883750	33692.605080
3	Bosnia and Herzegovina	973.533195	1353.989176	1709.683679	2172.352423	2860.169750	3528.481305	4126.613157	4314.114757	2546.781445	4766.355904	6018.975239	7446.298803
4	Bulgaria	2444.286648	3008.670727	4254.337839	5577.002800	6597.494398	7612.240438	8224.191647	8239.854824	6302.623438	5970.388760	7696.777725	10680.792820

…or the last 5 rows with .tail():

df.tail()

	country	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
25	Spain	3834.034742	4564.802410	5693.843879	7993.512294	10638.75131	13236.921170	13926.169970	15764.983130	18603.064520	20445.298960	24835.471660	28821.063700
26	Sweden	8527.844662	9911.878226	12329.441920	15258.296970	17832.02464	18855.725210	20667.381250	23586.929270	23880.016830	25266.594990	29341.630930	33859.748350
27	Switzerland	14734.232750	17909.489730	20431.092700	22966.144320	27195.11304	26982.290520	28397.715120	30281.704590	31871.530300	32135.323010	34480.957710	37506.419070
28	Turkey	1969.100980	2218.754257	2322.869908	2826.356387	3450.69638	4269.122326	4241.356344	5089.043686	5678.348271	6601.429915	6508.085718	8458.276384
29	United Kingdom	9979.508487	11283.177950	12477.177070	14142.850890	15895.11641	17428.748460	18232.424520	21664.787670	22705.092540	26074.531360	29478.999190	33203.261280

We can also see a random sample of rows from the DataFrame with .sample(), giving it a numerical argument to indicate the number of rows we want to see:

df.sample(10)

	country	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
27	Switzerland	14734.232750	17909.489730	20431.092700	22966.144320	27195.113040	26982.29052	28397.715120	30281.704590	31871.530300	32135.323010	34480.957710	37506.41907
26	Sweden	8527.844662	9911.878226	12329.441920	15258.296970	17832.024640	18855.72521	20667.381250	23586.929270	23880.016830	25266.594990	29341.630930	33859.74835
17	Netherlands	8941.571858	11276.193440	12790.849560	15363.251360	18794.745670	21209.05920	21399.460460	23651.323610	26790.949610	30246.130630	33724.757780	36797.93332
25	Spain	3834.034742	4564.802410	5693.843879	7993.512294	10638.751310	13236.92117	13926.169970	15764.983130	18603.064520	20445.298960	24835.471660	28821.06370
24	Slovenia	4215.041741	5862.276629	7402.303395	9405.489397	12383.486200	15277.03017	17866.721750	18678.534920	14214.716810	17161.107350	20660.019360	25768.25759
5	Croatia	3119.236520	4338.231617	5477.890018	6960.297861	9164.090127	11305.38517	13221.821840	13822.583940	8447.794873	9875.604515	11628.388950	14619.22272
11	Greece	3530.690067	4916.299889	6017.190733	8513.097016	12724.829570	14195.52428	15268.420890	16120.528390	17541.496340	18747.698140	22514.254800	27538.41188
21	Romania	3144.613186	3943.370225	4734.997586	6470.866545	8011.414402	9356.39724	9605.314053	9696.273295	6598.409903	7346.547557	7885.360081	10808.47561
13	Iceland	7267.688428	9244.001412	10350.159060	13319.895680	15798.063620	19654.96247	23269.607500	26923.206280	25144.392010	28061.099660	31163.201960	36180.78919
12	Hungary	5263.673816	6040.180011	7550.359877	9326.644670	10168.656110	11674.83737	12545.990660	12986.479980	10535.628550	11712.776800	14843.935560	18008.94444

Note that the .head() and .tail() methods also optionally take a numerical argument, if you want to view a different number of rows from the default of 5.

Accessing values in a DataFrame#

One thing we often want to do is access a single cell in a DataFrame, or a range of cells. Each cell is uniquely defined by a combination of its row and column locations.

Select a column using `[]`#

If we want to select an entire column of a pandas DataFrame, we just give the name of the DataFrame followed by the column name in square brackets:

df['gdpPercap_1992']

    2497.437901
   27042.018680
   25575.570690
    2546.781445
    6302.623438
    8447.794873
   14297.021220
   26406.739850
   20647.164990
   24703.796150
  26505.303170
  17541.496340
  10535.628550
  25144.392010
  17558.815550
  22013.644860
   7003.339037
  26790.949610
  33965.661150
   7738.881247
  16207.266630
   6598.409903
   9325.068238
   9498.467723
  14214.716810
  18603.064520
  23880.016830
  31871.530300
   5678.348271
  22705.092540
Name: gdpPercap_1992, dtype: float64

Note that if we ask for a single column the result is a pandas Series, but if we ask for two or more columns, the result is a DataFrame. Pay close attention to the syntax below — if we’re asking for more than one column, we need to provide a list of columns inside the square brackets (so there are two sets of nested sqaure brackets in the code below):

df[['gdpPercap_1982', 'gdpPercap_1992']]

	gdpPercap_1982	gdpPercap_1992
0	3630.880722	2497.437901
1	21597.083620	27042.018680
2	20979.845890	25575.570690
3	4126.613157	2546.781445
4	8224.191647	6302.623438
5	13221.821840	8447.794873
6	15377.228550	14297.021220
7	21688.040480	26406.739850
8	18533.157610	20647.164990
9	20293.897460	24703.796150
10	22031.532740	26505.303170
11	15268.420890	17541.496340
12	12545.990660	10535.628550
13	23269.607500	25144.392010
14	12618.321410	17558.815550
15	16537.483500	22013.644860
16	11222.587620	7003.339037
17	21399.460460	26790.949610
18	26298.635310	33965.661150
19	8451.531004	7738.881247
20	11753.842910	16207.266630
21	9605.314053	6598.409903
22	15181.092700	9325.068238
23	11348.545850	9498.467723
24	17866.721750	14214.716810
25	13926.169970	18603.064520
26	20667.381250	23880.016830
27	28397.715120	31871.530300
28	4241.356344	5678.348271
29	18232.424520	22705.092540

Numerical indexing using `.iloc[]`#

Often we don’t want to access an entire column, however, but just specific rows within a column (or range of columns). pandas provides two ways of accessing cell locations. One is using the numerical positions in the DataFrame, using the convention of [row, column] — with [0, 0] being the top left cell in the DataFrame. So for a pandas DataFrame with 3 rows and 3 columns, the indices of each cell are as shown:

	col 0	col 1	col 2	col 3
0	[0, 0]	[0, 1]	[0, 2]	[0, 3]
1	[1, 0]	[1, 1]	[1, 2]	[1, 3]
2	[2, 0]	[2, 1]	[2, 2]	[2, 3]
3	[3, 0]	[3, 1]	[3, 2]	[3, 3]

Numerical indexing of DataFrames is done with the .iloc[] method. For example, to access the GDP value for Austria in 1952 — which is located in the second row, third column of our current DataFrame, we would use:

df.iloc[1, 2]

8842.59803

Label-based indexing using `.loc[]`#

The other way to access a location in a DataFrame is by its index and column labels, using the .loc[] method. As noted earlier, in the DataFrame we imported, the indexes are currently numbers, which were created automatically when we imported the data. The .loc[] method doesn’t work with numerical indexes (that’s what iloc is for — and you can’t mix, say, a numerical row index with a column label), but in the data set we imported, the first column of this CSV file is actually meant to be its index: while all other columns are data values (GDP, in type float), the first column identifies the country with which each row of data is associated.

pandas has a method for setting an index column, .set_index(), where the argument (in the parentheses) would be the name of the column to use as the index. So here we want to run:

df = df.set_index('country')

Note that we need to assign the result of this operation back to df (using df = ), otherwise the change will not actually modify df.

In the cell below, use the .set_index() method to set the index of df to country, and then view the DataFrame again to see how it has changed.

df = df.set_index('country')

Alternatively, if we knew which column we wanted to use as the index before loading in the data file, we could have included the argument index_col= in the pd.read_csv() command:

df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Now that we have defined the index, we can access the 1952 GDP value for Austria by its index and column names:

df.loc['Austria', 'gdpPercap_1952']

6137.076492

Use `:` on its own to mean all columns or all rows.#

Using Python’s familiar slicing notatio (which we’ve previously used for strings and lists), we can use : with .iloc[] or .loc[], to specify a range in a DataFrame.

For example, to see the GDP of Albania for every year (column) in the DataFrame, we would use:

df.loc["Albania", :]

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

Likewise, we could see the GDP for every country (row) in the year 1957 with:

df.loc[:, 'gdpPercap_1957']

country
Albania                    1942.284244
Austria                    8842.598030
Belgium                    9714.960623
Bosnia and Herzegovina     1353.989176
Bulgaria                   3008.670727
Croatia                    4338.231617
Czech Republic             8256.343918
Denmark                   11099.659350
Finland                    7545.415386
France                     8662.834898
Germany                   10187.826650
Greece                     4916.299889
Hungary                    6040.180011
Iceland                    9244.001412
Ireland                    5599.077872
Italy                      6248.656232
Montenegro                 3682.259903
Netherlands               11276.193440
Norway                    11653.973040
Poland                     4734.253019
Portugal                   3774.571743
Romania                    3943.370225
Serbia                     4981.090891
Slovak Republic            6093.262980
Slovenia                   5862.276629
Spain                      4564.802410
Sweden                     9911.878226
Switzerland               17909.489730
Turkey                     2218.754257
United Kingdom            11283.177950
Name: gdpPercap_1957, dtype: float64

You can also just specify the row index; if you don’t specify anything for the columns, pandas assumes you want all columns:

df.loc["Albania"]

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

However, since the syntax for .iloc[] and .loc[] is [rows, columns], you cannot omit a row index; you need to use : if you want all rows.

Slicing works on DataFrames#

Slicing using numerical indices works similarly for DataFrames as we previously saw for strings and lists, for example, the following code will print the third through fifth rows of the DataFrame, and the fifth through eighth columns (remember, Python indexing starts at 0, and slicing does not include the “end” index):

df.iloc[2:5, 4:8]

	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987
country
Belgium	16672.143560	19117.974480	20979.845890	22525.563080
Bosnia and Herzegovina	2860.169750	3528.481305	4126.613157	4314.114757
Bulgaria	6597.494398	7612.240438	8224.191647	8239.854824

The code below will print from the sixth to second-last row of the DataFrame, and from the ninth to the last column:

df.iloc[5:-1, 8:]

	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
country
Croatia	8447.794873	9875.604515	11628.388950	14619.222720
Czech Republic	14297.021220	16048.514240	17596.210220	22833.308510
Denmark	26406.739850	29804.345670	32166.500060	35278.418740
Finland	20647.164990	23723.950200	28204.590570	33207.084400
France	24703.796150	25889.784870	28926.032340	30470.016700
Germany	26505.303170	27788.884160	30035.801980	32170.374420
Greece	17541.496340	18747.698140	22514.254800	27538.411880
Hungary	10535.628550	11712.776800	14843.935560	18008.944440
Iceland	25144.392010	28061.099660	31163.201960	36180.789190
Ireland	17558.815550	24521.947130	34077.049390	40675.996350
Italy	22013.644860	24675.024460	27968.098170	28569.719700
Montenegro	7003.339037	6465.613349	6557.194282	9253.896111
Netherlands	26790.949610	30246.130630	33724.757780	36797.933320
Norway	33965.661150	41283.164330	44683.975250	49357.190170
Poland	7738.881247	10159.583680	12002.239080	15389.924680
Portugal	16207.266630	17641.031560	19970.907870	20509.647770
Romania	6598.409903	7346.547557	7885.360081	10808.475610
Serbia	9325.068238	7914.320304	7236.075251	9786.534714
Slovak Republic	9498.467723	12126.230650	13638.778370	18678.314350
Slovenia	14214.716810	17161.107350	20660.019360	25768.257590
Spain	18603.064520	20445.298960	24835.471660	28821.063700
Sweden	23880.016830	25266.594990	29341.630930	33859.748350
Switzerland	31871.530300	32135.323010	34480.957710	37506.419070
Turkey	5678.348271	6601.429915	6508.085718	8458.276384

Note however, that when using label-based indexing with .loc[], pandas’ slicing behavior is a bit different. Specifically, the output includes the last item in the range, whereas numerical indexing with .iloc[] does not.

So, considering that the first three rows of the DataFrame correspond to the countries Albania, Austria, and Belgium, and that columns 6 and 7 are for the years 1982 and 1987 respectively, compare the output of:

df.iloc[0:2, 6:7]

	gdpPercap_1982
country
Albania	3630.880722
Austria	21597.083620

with:

df.loc['Albania':'Belgium', 'gdpPercap_1982':'gdpPercap_1988']

	gdpPercap_1982	gdpPercap_1987
country
Albania	3630.880722	3738.932735
Austria	21597.083620	23687.826070
Belgium	20979.845890	22525.563080

The “inclusive” label-based indexing with .loc[] is fairly intuitive, but it’s important to remember that it works differently from numerical indexing.

Use lists to select non-contiguous sections of a DataFrame#

While slicing can be very useful, sometimes we might want to extract values that aren’t next to each other in a DataFrame. For example, what if we only want values from two specific years (1992 and 2002), for Scandinavian countries (Denmark, Finland, Norway, and Sweden)? Nether these years nor countries are in adjacent columns/rows in the DataFrame. With .loc[], we can use lists, rather than ranges separated by :, as selectors:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], ['gdpPercap_1992', 'gdpPercap_2002']]

	gdpPercap_1992	gdpPercap_2002
country
Denmark	26406.73985	32166.50006
Finland	20647.16499	28204.59057
Norway	33965.66115	44683.97525
Sweden	23880.01683	29341.63093

We can equivalently write the command over several lines to make it a bit easier to read:

df.loc[['Denmark', 'Finland', 'Norway', 'Sweden'], 
       ['gdpPercap_1992', 'gdpPercap_2002']
      ]

	gdpPercap_1992	gdpPercap_2002
country
Denmark	26406.73985	32166.50006
Finland	20647.16499	28204.59057
Norway	33965.66115	44683.97525
Sweden	23880.01683	29341.63093

We could also define those lists as variables, and pass the variables to .loc[]. This might be useful if you were going to use the lists more than once, as well as for clarity:

scand_countries = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
years = ['gdpPercap_1992', 'gdpPercap_2002']
df.loc[scand_countries, years]

	gdpPercap_1992	gdpPercap_2002
country
Denmark	26406.73985	32166.50006
Finland	20647.16499	28204.59057
Iceland	25144.39201	31163.20196
Norway	33965.66115	44683.97525
Sweden	23880.01683	29341.63093

We can take this a step further, and assign the output of a .loc[] selection like this to a new variable name. This makes a copy of the selected data, stored in a new DataFrame (or Series, if we only select one row or column) with its own name. This allows us to later reference and use that selection.

scand_data = df.loc[scand_countries, years]

It’s easy to do simple math and statistics in DataFrames#

We prevoiusly learned about methods to get simple statistical values out of a Python list, like .max(), and .min(). pandas includes these and many more methods as well. For example, we can view the mean GDP for Italy across all years (columns) with:

df.loc['Italy'].mean()

16245.20900641667

Or the largest GDP in 1977 with:

df.loc[:, 'gdpPercap_1977'].max()

26982.29052

Another useful method is .describe(), which prints out a range of descriptive statistics for the range of data you specify. Without any slicing it provides information for each column:

df.describe()

	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
count	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000	30.000000
mean	5661.057435	6963.012816	8365.486814	10143.823757	12479.575246	14283.979110	15617.896551	17214.310727	17061.568084	19076.781802	21711.732422	25054.481636
std	3114.060493	3677.950146	4199.193906	4724.983889	5509.691411	5874.464896	6453.234827	7482.957960	9109.804361	10065.457716	11197.355517	11800.339811
min	973.533195	1353.989176	1709.683679	2172.352423	2860.169750	3528.481305	3630.880722	3738.932735	2497.437901	3193.054604	4604.211737	5937.029526
25%	3241.132406	4394.874315	5373.536612	6657.939047	9057.708094	10360.030300	11449.870115	12274.570680	8667.113214	9946.599306	11721.851483	14811.898210
50%	5142.469716	6066.721495	7515.733737	9366.067033	12326.379990	14225.754515	15322.824720	16215.485895	17550.155945	19596.498550	23674.863230	28054.065790
75%	7236.794919	9597.220820	10931.085348	13277.182058	16523.017127	19052.412163	20901.729730	23321.587723	25034.243045	27189.530312	30373.363307	33817.962533
max	14734.232750	17909.489730	20431.092700	22966.144320	27195.113040	26982.290520	28397.715120	31540.974800	33965.661150	41283.164330	44683.975250	49357.190170

Mini-Exercise#

In the cell below, use the scand_countries and years variables to view descriptive statistics for all Scandinavian countries in each year.

Evaluate cells based on conditions#

pandas allows an easy way to identify values in a DataFrame that meet a certain condition, using operators like <, >, and ==. For example, let’s see which countries in a list had a GDP over 10,000 in 1962 and 1992. The result is reported in Booleans (True/False) for each cell.

countries = ['France', 'Germany', 'Italy', 'Spain', 'United Kingdom']
df.loc[countries, ['gdpPercap_1962', 'gdpPercap_1992']] > 10000

	gdpPercap_1962	gdpPercap_1992
country
France	True	True
Germany	True	True
Italy	False	True
Spain	False	True
United Kingdom	True	True

Select values or NaN using a Boolean mask.#

A DataFrame full of Booleans is sometimes called a mask because of how it can be used. A mask removes values that are not True, and replaces them with NaN — a special Python value representing “not a number”. This can be useful because pandas ignores NaN values when doing computations.

We create a mask by assigning the output of a conditional statement to a variable name:

mask = scand_data > 30000

Then we can apply the mask to the DataFrame to get only the values that meet the criterion:

scand_data[mask]

	gdpPercap_1992	gdpPercap_2002
country
Denmark	NaN	32166.50006
Finland	NaN	NaN
Iceland	NaN	31163.20196
Norway	33965.66115	44683.97525
Sweden	NaN	NaN

As an example of how this might be used, the steps above would now allow us to find the lowest GDP value in each year, that was above 30,000:

scand_data[mask].min()

gdpPercap_1992    33965.66115
gdpPercap_2002    31163.20196
dtype: float64

Split-Apply-Combine#

A common task in data science is to split data into meaningful subgroups, apply an operation to each subgroup (e.g., compute the mean), and then combine the results into a single output, such as a table or a new DataFrame. This paradigm was famously described by Hadley Wickham in a 2011 paper.

pandas provides methods and grouping operations that are very efficient (vectorized) for split-apply-combine operations.

As an example, let’s say that we wanted to compare the average GDP for different regions of Europe, divided as northern, southern, eastern, and western. To do this, we first have to create lists defining the countries belonging to each of these regions:

northern = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
southern = ['Greece', 'Italy', 'Portugal', 'Spain']
eastern = ['Albania', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 
            'Czech Republic', 'Hungary', 'Montenegro', 'Poland', 'Romania', 
            'Serbia', 'Slovak Republic', 'Slovenia']
western = ['Austria',  'Belgium', 'France', 'Germany', 'Ireland', 
            'Netherlands', 'Switzerland', 'United Kingdom']

Next we can make a new column simply by using .loc[] with the rows specified by one of the lists we just defined, a column name that doesn’t already exist (in this case, we’ll call it “region”), then assigning a region label to that combination of rows and column. We need to do this separately for each region. Note that when we first create the new column (“region”), pandas fills it with NaN values in any rows that were not defined by the assignment. For example, in the code below, the first line will create the column “region”, and fill it with “northern” for any row in the northern list, and NaN to every other row.

df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'

df.loc[northern, 'region'] = 'northern'
df.loc[southern, 'region'] = 'southern'
df.loc[eastern, 'region'] = 'eastern'
df.loc[western, 'region'] = 'western'

Let’s look at the last two columns of df to see the region column we created:

df.iloc[:, -2:]

	gdpPercap_2007	region
country
Albania	5937.029526	eastern
Austria	36126.492700	western
Belgium	33692.605080	western
Bosnia and Herzegovina	7446.298803	eastern
Bulgaria	10680.792820	eastern
Croatia	14619.222720	eastern
Czech Republic	22833.308510	eastern
Denmark	35278.418740	northern
Finland	33207.084400	northern
France	30470.016700	western
Germany	32170.374420	western
Greece	27538.411880	southern
Hungary	18008.944440	eastern
Iceland	36180.789190	northern
Ireland	40675.996350	western
Italy	28569.719700	southern
Montenegro	9253.896111	eastern
Netherlands	36797.933320	western
Norway	49357.190170	northern
Poland	15389.924680	eastern
Portugal	20509.647770	southern
Romania	10808.475610	eastern
Serbia	9786.534714	eastern
Slovak Republic	18678.314350	eastern
Slovenia	25768.257590	eastern
Spain	28821.063700	southern
Sweden	33859.748350	northern
Switzerland	37506.419070	western
Turkey	8458.276384	NaN
United Kingdom	33203.261280	western

Split#

Now we can use this “region” column to split the data into groups, using a pandas method called .groupby()

grouped_countries = df.groupby('region')

Note that this step doesn’t create a new DataFrame, it creates a special kind of pandas object that points to a grouping in the original DataFrame:

grouped_countries = df.groupby('region')

Apply#

Now that we have split the data, we can apply a function separately to each group. Here we’ll compute the mean GDP for each region, for each year:

mean_gdp_by_region = grouped_countries.mean()

Combine#

The combine step actually occurred with the apply step above — the result is automatically combined into a table of mean values organized by region. But since our apply step (.mean()) saved the result to a variable, we can view the resulting table as the output of the combine step:

mean_gdp_by_region

	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
region
eastern	3580.884612	4519.684445	5611.534264	6911.825724	8465.695242	10007.911765	10900.209963	11375.852107	8250.514199	9395.008143	10864.013821	14100.916656
northern	8401.571825	9890.985483	11817.031712	14359.783322	17164.045376	19570.072280	22091.364430	25661.659678	26008.794966	29627.830970	33111.979754	37576.646170
southern	3841.112208	4876.082568	6170.642960	8222.632153	11163.775519	12965.228980	14371.479317	16033.013775	18591.368087	20377.263280	23822.183125	26359.710762
western	8439.962345	10434.519899	12191.949119	14232.125299	17359.111144	19305.504057	20693.785153	22798.712417	25344.134586	27914.214806	31703.386229	35080.387365

Chaining#

In Python, chaining refers to combining a number of operations in one command, using a sequence of methods. We can perform the above split-apply-combine procedure in a single step as follows. Note that because we don’t assign the output to a variable name, it is displayed as output but not saved.

df.groupby('region').mean()

	gdpPercap_1952	gdpPercap_1957	gdpPercap_1962	gdpPercap_1967	gdpPercap_1972	gdpPercap_1977	gdpPercap_1982	gdpPercap_1987	gdpPercap_1992	gdpPercap_1997	gdpPercap_2002	gdpPercap_2007
region
eastern	3580.884612	4519.684445	5611.534264	6911.825724	8465.695242	10007.911765	10900.209963	11375.852107	8250.514199	9395.008143	10864.013821	14100.916656
northern	8401.571825	9890.985483	11817.031712	14359.783322	17164.045376	19570.072280	22091.364430	25661.659678	26008.794966	29627.830970	33111.979754	37576.646170
southern	3841.112208	4876.082568	6170.642960	8222.632153	11163.775519	12965.228980	14371.479317	16033.013775	18591.368087	20377.263280	23822.183125	26359.710762
western	8439.962345	10434.519899	12191.949119	14232.125299	17359.111144	19305.504057	20693.785153	22798.712417	25344.134586	27914.214806	31703.386229	35080.387365

Exercises#

Selecting Individual Values#

Write an expression to find the Per Capita GDP of Serbia in 2007.

Click the button to reveal!

print(df.loc[‘Serbia’, ‘gdpPercap_2007’])

Extent of Slicing#

Do the two statements below produce the same output? (Hint: you might want to use the .head() method to remind yourself of the structure of the DataFrame)
Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

Click the button to reveal the solution

Solution

No, they do not produce the same output! The output of the first statement is:

        gdpPercap_1952  gdpPercap_1957
country                                
Albania     1601.056136     1942.284244
Austria     6137.076492     8842.598030

The second statement gives:

        gdpPercap_1952  gdpPercap_1957  gdpPercap_1962
country                                                
Albania     1601.056136     1942.284244     2312.888958
Austria     6137.076492     8842.598030    10750.721110
Belgium     8343.105127     9714.960623    10991.206760

Clearly, the second statement produces an additional column and an additional row compared to the first statement. What conclusion can we draw? We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided, while a named slice, 'gdpPercap_1952':'gdpPercap_1962', includes the final element.

Reconstructing Data#

Explain what each line in the following short program does: what is in df1, df2, etc.?

df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')
df2 = df1[df1['continent'] == 'Americas']
df3 = df2.drop('Puerto Rico')
df4 = df3.drop('continent', axis = 1)
df4.to_csv('result.csv')

Click the button to reveal the solution

Let’s go through this piece of code line by line.

df1 = pd.read_csv('data/gapminder_all.csv', index_col='country')

This line loads the dataset containing the GDP data from all countries into a dataframe called df1. The index_col='country' parameter selects which column to use as the row labels in the dataframe.

df2 = df1[df1['continent'] == 'Americas']

This line makes a selection: only those rows of df1 for which the ‘continent’ column matches ‘Americas’ are extracted. Notice how the Boolean expression inside the brackets, df1['continent'] == 'Americas', is used to select only those rows where the expression is true. Try printing this expression! Can you print also its individual True/False elements? (hint: first assign the expression to a variable)

df3 = df2.drop('Puerto Rico')

As the syntax suggests, this line drops the row from df2 where the label is ‘Puerto Rico’. The resulting dataframe df3 has one row less than the original dataframe df2.

df4 = df3.drop('continent', axis = 1)

Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis parameter (we want to drop the second column which has index 1).

df4.to_csv('result.csv')

The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the to_csv() function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.

Selecting Indices#

Explain in simple terms what idxmin and idxmax do. When would you use these methods?

data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.idxmin()

data.idxmax()

Click the button to reveal the solution

For each column in data, idxmin will return the index value corresponding to each column’s minimum; idxmax will do accordingly the same for each column’s maximum value.

You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.

Practice with Selection#

From the previous exercise, the Gapminder GDP data for Europe should be loaded in as data. Using this DataFrame, write an expression to select each of the following:

GDP per capita for all countries in 1982

Click the button to reveal the solution

data['gdpPercap_1982']

GDP per capita for Denmark for all years

Click the button to reveal the solution

data.loc['Denmark',:]

or

data.loc['Denmark',:'gdpPercap_2007']

GDP per capita for all countries for years after 1985

Click the button to reveal the solution

data.loc[:,'gdpPercap_1985':]

Note that pandas is smart enough to recognize the number at the end of the column label and does not give you an error, although no column named gdpPercap_1985 actually exists. This is useful if new columns are added to the CSV file later.

GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952

Click the button to reveal the solution

data[‘gdpPercap_2007’]/data[‘gdpPercap_1952’]

Summary of Key Points:#

pandas DataFrames are a powerful way of storing and working with tabular (row/column) data
pandas columns and rows can have names
pandas row names are called indexes which are numeric by default, but can be given other labels
Use the .iloc[] method with a DataFrame to select values by integer location, using [row, column] format
Use the .loc[] method with a DataFrame to select rows and/or columns, using named slices
Use : on its own to mean all columns or all rows
Result of slicing can be used in further operations
Use comparisons to select data based on value
Select values or NaN using a Boolean mask
use split-apply-combine to derive analytics from groupings within a DataFrame

This lesson is adapted from the Software Carpentry Plotting and Programming in Python workshop.

pandas DataFrames

Contents

pandas DataFrames#

Questions:#

Learning Objectives:#

What is pandas?#

About Python Libraries#

Importing data with pandas#

Heads or Tails?#

Accessing values in a DataFrame#

Select a column using []#

Numerical indexing using .iloc[]#

Label-based indexing using .loc[]#

Use : on its own to mean all columns or all rows.#

Slicing works on DataFrames#

Use lists to select non-contiguous sections of a DataFrame#

It’s easy to do simple math and statistics in DataFrames#

Mini-Exercise#

Evaluate cells based on conditions#

Select values or NaN using a Boolean mask.#

Split-Apply-Combine#

Split#

Apply#

Combine#

Chaining#

Exercises#

Selecting Individual Values#

Extent of Slicing#

Solution

Reconstructing Data#

Selecting Indices#

Practice with Selection#

Summary of Key Points:#

Select a column using `[]`#

Numerical indexing using `.iloc[]`#

Label-based indexing using `.loc[]`#

Use `:` on its own to mean all columns or all rows.#