Looping Over Data Files

Looping Over Data Files#

Watch a walk-through of this lesson on YouTube
Download all the Jupyter notebooks and other files you need to complete the lessons in this chapter (Chapter 3)
Turn off the GitHub Copilot AI assistant so you can focus on learning Python using your HI (human intelligence). Click the Deactivate Copilot button in the bottom right of VS Code, if it is currently activated.

Questions:#

How can I efficiently read in many data sets from different files?
How can I combine data from different files into one pandas DataFrame?

Learning Objectives:#

Be able to write “globbing” expressions that match sets of files
Use glob to create lists of files
Write for loops to perform operations on many files
Write list comprehensions to perform operations on many files
combine pandas DataFrames

Use a `for` loop to process files given a list of their names#

We can use a for loop to read in a set of data files, and do a thing for each one. In this case, we’ll print the minimum value in each file:

import pandas as pd

data_files = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']

for filename in data_files:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165876
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64

Use `glob.glob` to find sets of files whose names match a pattern.#

In Unix, the term globbing means matching a set of files with a pattern.
The most common patterns are:
- * meaning match zero or more characters
- ? meaning match exactly one character
Python’s standard library contains the glob module to provide pattern matching functionality
The glob module contains a function also called glob to match file patterns
E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.
Result is a list of strings.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

all csv files in data directory: ['data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_europe.csv', 'data/gapminder_all.csv', 'data/gapminder_gdp_oceania.csv', 'data/gapminder_gdp_africa.csv', 'data/s2.csv', 'data/s3.csv', 'data/s1.csv', 'data/gapminder_life_expectancy_years.csv', 'data/gapminder_gdp_asia.csv']

Use `glob` and `for` to process batches of files.#

It’s good practice to name your files systematically. As you’ve learned, Python is very precise about things like capitalization, so if your file names are inconsistent (e.g., Gapminder_Europe.csv, gapminder_americas.csv, gapminder_Oceania.csv), then it is harder to write code with glob that works correctly.

For the Gapminder data, fortunately the file names are quite systematic and consistent (as are the names of the columns inside each file), so we can use the following to read in each one and print the minimum GDP from 1952:

for filename in glob.glob('data/gapminder_gdp*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_asia.csv 331.0

Appending Files to a Single DataFrame#

Often we don’t just want to open a file and extract a small bit of data (such as the minimum value in examples above). Rather, we might want to open a set of related data files and combine them into one big DataFrame. For example, in psychology and neuroscience most experiments involve multiple participants. For each participant, when we run the experiment we get a data file. To analyze the data across participants, we would want to read in all participants’ data files and combined them into one DataFrame.

pandas has a few methods that allow us to combine DataFrames, including:

We will focus here on the first one. concat stands for “concatenate” which essentially means combine files by “stacking” them. That is, start with one DataFrame, and add a new data frame to the bottom of it, creating additional rows. In what we’ll do here, we assume that all of the data files we’re reading have the same columns. For example, in the Gapminder GDP data sets, each file has a column for country plus a series of columns for GDP in different years — and the same years are in the columns of all the data sets.

Reading in data from multiple experimental participants#

Let’s say we have data from an experiment in which we ran three human participants (sometimes called “people”) on different days. For each participant, we have a data file. The columns in all the files are the same, because the files were generated by a computer program that ran the experiment.

We give the participants anonymized ID codes to protect their privacy, and allow for a simple, systematic naming convention for the files. the first participant’s data is saved in a file called s1.csv, the second’s in s2.csv, etc..

We can glob the data folder in which the files are stored, to find all the CSV files whose names start with an s followed by a single character, followed by .csv. We’ll save the result to a list that we can loop through later:

filenames = glob.glob('data/s?.csv')

Next, we create an empty list that we will store the DataFrames from each participant in. It will end up being a list of DataFrames (remember, lists can contain just about any other Python data type), and once we have read in all the data, we will combine them into one DataFrame. This is a trick that’s important to use in pandas. The reason has to do with how pandas combines DataFrames and stores them in memory. In simple terms, each time we concatenate DataFrames, pandas does a lot of internal checking to make sure there are no errors. Doing this checking once, when combining many DataFrames, is far more efficient (and thus faster) than doing it many times. Likewise, when a DataFrame is created, an appropriate amount of memory space is allocated for it on the computer. Each time we append additional data, we have to create a new, bigger block of memory. Allocating new blocks of memory, many times, takes more time than just doing it once.

df_list = []

Finally, use a for loop to read the files in. This will cycle through the items in the filenames list; each time through the loop, filename has the value of the current file name, and we use the list append() method to add the data from that file to df_list:

for f in filenames:
    df_list.append(pd.read_csv(f))

When we view the contents of the list, we see each data set, with its two columns (with headers saying what they are), and commas separating the list entries, as is typical of a list.

df_list

[  participantID  trial        RT
          s2      1  0.433094
          s2      2  0.392526
          s2      3  0.396831
          s2      4  0.417988
          s2      5  0.371810
          s2      6  0.659228
          s2      7  0.411051
          s2      8  0.409580
          s2      9  0.486828
          s2     10  0.468912,
   participantID  trial        RT
          s3      1  0.322099
          s3      2  0.396106
          s3      3  0.384297
          s3      4  0.364524
          s3      5  0.454075
          s3      6  0.494156
          s3      7  0.492787
          s3      8  0.506836
          s3      9  0.340722
          s3     10  0.704491,
   participantID  trial        RT
          s1      1  0.508971
          s1      2  0.389858
          s1      3  0.404175
          s1      4  0.269520
          s1      5  0.437765
          s1      6  0.368142
          s1      7  0.400544
          s1      8  0.335198
          s1      9  0.341722
          s1     10  0.439583]

Reading multiple files using list comprehension#

While the for loop above works fine, there is an alternative way to do this, using list comprehension. Recall that list comprehensions are basically just a compact version of a for loop, but they have some advantages:

they are more pythonic: they only require one line of code, whereas the for loop above required two
they are more efficient: list comprehensions actually run faster. This may not be an issue in the small examples here, but can make a big difference when working with real, large data sets

df_list = [pd.read_csv(f) for f in filenames]
df_list

[  participantID  trial        RT
          s2      1  0.433094
          s2      2  0.392526
          s2      3  0.396831
          s2      4  0.417988
          s2      5  0.371810
          s2      6  0.659228
          s2      7  0.411051
          s2      8  0.409580
          s2      9  0.486828
          s2     10  0.468912,
   participantID  trial        RT
          s3      1  0.322099
          s3      2  0.396106
          s3      3  0.384297
          s3      4  0.364524
          s3      5  0.454075
          s3      6  0.494156
          s3      7  0.492787
          s3      8  0.506836
          s3      9  0.340722
          s3     10  0.704491,
   participantID  trial        RT
          s1      1  0.508971
          s1      2  0.389858
          s1      3  0.404175
          s1      4  0.269520
          s1      5  0.437765
          s1      6  0.368142
          s1      7  0.400544
          s1      8  0.335198
          s1      9  0.341722
          s1     10  0.439583]

Combining DataFrames#

At this point, we’ve read each input file in and stored it as a DataFrame, but we have a list of three distinct DataFrames. In most cases, we’ll want to combine these in some way. Having built our list of DataFrames through reading a set of files, we can combine them into a single DataFrame using the pandas .concat() method:

df = pd.concat(df_list)

# Confirm this worked by viewing a random sample of rows
df.sample(8)

	participantID	trial	RT
7	s1	8	0.335198
6	s2	7	0.411051
9	s2	10	0.468912
5	s3	6	0.494156
4	s2	5	0.371810
4	s1	5	0.437765
0	s2	1	0.433094
9	s3	10	0.704491

Setting the index column#

Recall that row labels in pands are called indexes. We can convert any column to an index using the .set_index() method. For this data, an appropriate index is the participant ID, which is in the column Participant. Note that we need to assign the result of the .set_index() operation back to df for the change to be stored:

df = df.set_index('participantID')
df.sample(8)

	trial	RT
participantID
s3	2	0.396106
s3	8	0.506836
s1	1	0.508971
s1	9	0.341722
s2	2	0.392526
s3	7	0.492787
s1	10	0.439583
s1	4	0.269520

Exercises#

Determining Matches#

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

data/gapminder_gdp_africa.csv
data/gapminder_gdp_americas.csv
data/gapminder_gdp_asia.csv

Click the button to reveal the answer

1 is not matched. The string as occurs in both americas and asia

Globbing files#

Fill in the blanks so that the code below does the following:

Find all of the CSV files in the data folder that contain GDP data
Read these files in using a for loop
Concatenate the data files into a single pandas DataFrame
Print out the first 10 lines of the final combined DataFrame

Note that not all the Gapminder data files contain GDP data, but the file names will indicate which ones do.

import glob
import pandas as pd

data_files = ___(___)

df_list = []

for ____ in ____:
    df_list.___(pd.read_csv(f))
    
df = ___

df.___(10)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 4
      1 import glob
      2 import pandas as pd
----> 4 data_files = ___(___)
      6 df_list = []
      8 for ____ in ____:

TypeError: 'list' object is not callable

Click the button to reveal the solution

import glob
import pandas as pd

data_files = glob.glob('data/*gdp*.csv')

df_list = []

for f in data_files:
    df_list.append(pd.read_csv(f))
    
df = pd.concat(df_list)

df.head(10)

List comprehension#

Now rewrite the code above to use list comprehension rather than a for loop, and only two lines of code total (excluding the import commands and viewing the first 10 lines of the result).

Click the button to reveal the solution

Done in two lines of code:

df_list = [pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')]
    
df = pd.concat(df_list)

df.head(10)

For an even bigger challenge, see if you can reduce the code to a single line!

Click the button to reveal the solution

Done in one line of code:

df = pd.concat([pd.read_csv(f) for f in glob.glob('data/*gdp*.csv')])
    
df.head(10)

Summary of Key Points:#

Use a for loop to process files given a list of their names
Use glob.glob to find sets of files whose names match a pattern
List comprehension can replace a for loop, resulting in more compact and efficient code
Naming your files in a consistent manner is just as important in data science, as writing the code to read them
When you want to combine multiple files into one pandas DataFrame, read each one in to a list of DataFrames, then run pd.concat() only once

This lesson is adapted from the Software Carpentry Plotting and Programming in Python workshop.

Looping Over Data Files

Contents

Looping Over Data Files#

Questions:#

Learning Objectives:#

Use a for loop to process files given a list of their names#

Use glob.glob to find sets of files whose names match a pattern.#

Use glob and for to process batches of files.#

Appending Files to a Single DataFrame#

Reading in data from multiple experimental participants#

Reading multiple files using list comprehension#

Combining DataFrames#

Setting the index column#

Exercises#

Determining Matches#

Globbing files#

List comprehension#

Summary of Key Points:#

Use a `for` loop to process files given a list of their names#

Use `glob.glob` to find sets of files whose names match a pattern.#

Use `glob` and `for` to process batches of files.#