Data Files and pandas DataFrames#

In the previous section we tried some very basic things with Copilot. Now let’s try a more complex example. In the previous chapter, we worked with the Gapminder dataset, which is a CSV file. We used the pandas library to read the CSV file into a DataFrame, and then used the DataFrame to do some analysis. Let’s do that again, but this time we’ll use Copilot to help us.

First, let’s prompt Copilot to read the CSV file containing the data for Europe. I’ve deliberately started with a very minimal prompt, to help illustrate how the wording of the prompt can affect the results.

# read the gapminder data file for Europe
gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 # read the gapminder data file for Europe
----> 2 gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')

NameError: name 'pd' is not defined

That generated a scary looking error message! This is human error, not AI error. But before we sort that out, let’s marvel at what Copilot did there. We gave it a pretty poor prompt (“the gapminder data for europe”), and it guessed that the file was named gapminder_europe.csv. It also guessed that we wanted to use the pandas library (pd) to read the CSV file into a DataFrame. That’s pretty impressive!

Fortunately, the error is pretty self-explanatory. If we want to work with pandas, we always need to first import the pandas library. By convention we give it the alias pd. So let’s add that to our prompt, and see what happens. Note that Copilot may generate code one line at a time, so you may have to accept the suggestion for the first line of code before it generates the next line of code.

# import the pandas library as pd and then read the gapminder data file for Europe
import pandas as pd
gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 3
      1 # import the pandas library as pd and then read the gapminder data file for Europe
      2 import pandas as pd
----> 3 gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'gapminder_gdp_europe.csv'

More scary error messages! This time, the last line is the most informative. It’s telling us that it can’t find the file we asked it to read. That’s because we haven’t told it where to look for the file. When we just list the name of a file, pd.read_csv() assumes that the file is in the same directory as the notebook. But in this case, the file is in a subdirectory called data. So let’s add that to our prompt, and see what happens.

# import the pandas library as pd and then read the gapminder data file for Europe. 
# The file is in a subfolder called data
import pandas as pd
gapminder_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Tip

In many cases, Copilot might not have enough information to correctly guess the name of a file you want to load. In such cases, you could include the name of the file, as well as its path, in your prompt.

Now we’re getting somewhere! Let’s view the first few lines of the DataFrame to make sure it looks right.

# show me the first few lines of the gapminder europe data
print(gapminder_europe.head())
                        gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                                  
Albania                    1601.056136     1942.284244     2312.888958   
Austria                    6137.076492     8842.598030    10750.721110   
Belgium                    8343.105127     9714.960623    10991.206760   
Bosnia and Herzegovina      973.533195     1353.989176     1709.683679   
Bulgaria                   2444.286648     3008.670727     4254.337839   

                        gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  \
country                                                                  
Albania                    2760.196931     3313.422188     3533.003910   
Austria                   12834.602400    16661.625600    19749.422300   
Belgium                   13149.041190    16672.143560    19117.974480   
Bosnia and Herzegovina     2172.352423     2860.169750     3528.481305   
Bulgaria                   5577.002800     6597.494398     7612.240438   

                        gdpPercap_1982  gdpPercap_1987  gdpPercap_1992  \
country                                                                  
Albania                    3630.880722     3738.932735     2497.437901   
Austria                   21597.083620    23687.826070    27042.018680   
Belgium                   20979.845890    22525.563080    25575.570690   
Bosnia and Herzegovina     4126.613157     4314.114757     2546.781445   
Bulgaria                   8224.191647     8239.854824     6302.623438   

                        gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                 
Albania                    3193.054604     4604.211737     5937.029526  
Austria                   29095.920660    32417.607690    36126.492700  
Belgium                   27561.196630    30485.883750    33692.605080  
Bosnia and Herzegovina     4766.355904     6018.975239     7446.298803  
Bulgaria                   5970.388760     7696.777725    10680.792820  

Again, we used a pretty minimal prompt to get the output. That is, we didn’t need to use the exact name of the DataFrame (gapminder_europe), nor the exact name of the method (head()).

The code that we got in developing this lesson was print(gapminder_europe.head()). However, the output of passing a DataFrame to print() is not formatted as nicely as if we ask the DataFrame to print itself. So let’s try modifying our prompt to get nicely formatted output:

# show me the first few lines of the gapminder europe data
# format the table nicely
print(gapminder_europe.head().to_string())
                        gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007
country                                                                                                                                                                                                               
Albania                    1601.056136     1942.284244     2312.888958     2760.196931     3313.422188     3533.003910     3630.880722     3738.932735     2497.437901     3193.054604     4604.211737     5937.029526
Austria                    6137.076492     8842.598030    10750.721110    12834.602400    16661.625600    19749.422300    21597.083620    23687.826070    27042.018680    29095.920660    32417.607690    36126.492700
Belgium                    8343.105127     9714.960623    10991.206760    13149.041190    16672.143560    19117.974480    20979.845890    22525.563080    25575.570690    27561.196630    30485.883750    33692.605080
Bosnia and Herzegovina      973.533195     1353.989176     1709.683679     2172.352423     2860.169750     3528.481305     4126.613157     4314.114757     2546.781445     4766.355904     6018.975239     7446.298803
Bulgaria                   2444.286648     3008.670727     4254.337839     5577.002800     6597.494398     7612.240438     8224.191647     8239.854824     6302.623438     5970.388760     7696.777725    10680.792820

Still not what we wanted (athough at least each row in the DataFrame is now one row in the output) — and Copilot didn’t have any alternative suggestions. This is where prompt engineering becomes important. Let’s try a different prompt.

# show me the first few lines of the gapminder europe data
# the output should be a table
gapminder_europe.head()
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Albania 1601.056136 1942.284244 2312.888958 2760.196931 3313.422188 3533.003910 3630.880722 3738.932735 2497.437901 3193.054604 4604.211737 5937.029526
Austria 6137.076492 8842.598030 10750.721110 12834.602400 16661.625600 19749.422300 21597.083620 23687.826070 27042.018680 29095.920660 32417.607690 36126.492700
Belgium 8343.105127 9714.960623 10991.206760 13149.041190 16672.143560 19117.974480 20979.845890 22525.563080 25575.570690 27561.196630 30485.883750 33692.605080
Bosnia and Herzegovina 973.533195 1353.989176 1709.683679 2172.352423 2860.169750 3528.481305 4126.613157 4314.114757 2546.781445 4766.355904 6018.975239 7446.298803
Bulgaria 2444.286648 3008.670727 4254.337839 5577.002800 6597.494398 7612.240438 8224.191647 8239.854824 6302.623438 5970.388760 7696.777725 10680.792820

That did it! We just needed to be a bit more specific about what we wanted to do, by asking for the output as a table.

It’s worth noting that others working through this lesson got gapminder_europe.head() to the original prompt above, but we kept this series of outputs to illustrate how the wording of the prompt can affect the results.

Split-Apply-Combine#

In the previous chapter, we learned about the split-apply-combine strategy for data analysis. That’s when we split the data into different groups (e.g., according to a particular variable), apply an operation to each group separately, and then combine the results back into a single table. Let’s try that again, but this time we’ll use Copilot to help us.

Let’s replicate the split-apply-combine analysis we did in the previous chapter, where we calculate the average GDP per capita for each of four regions in Europe: Northern Europe, Southern Europe, Eastern Europe, and Western Europe. We’ll start with a very high-level prompt, and see if we get what we want.

Note that in the code cell below, I only typed the first three-line prompt. Copilot generated the rest of the code, including a series of prompts (comments) to help me fill in the details. In order to get it to generate additional prompts and code, I had to hit enter twice after the first prompt. Apparently Copilot is insistent that your code be nicely formatted, with empty lines between different sections of code that do different steps. This is a good thing, but if you forget to hit Enter twice, you may get nothing an think Copilot isn’t working. Just hit Enter twice, and you should get the prompts you need. Otherwise, try going back to the end of your prompt and hitting Enter again.

# use split-apply-combine to calculate the average gdp per capita 
# for each of four regions in Europe: Northern Europe, Southern Europe, 
# Eastern Europe, and Western Europe

# split the data into four regions
regions = ['Northern Europe', 'Southern Europe', 'Eastern Europe', 'Western Europe']
grouped = gapminder_europe.groupby(['group', 'year'])

# apply the mean function to calculate the average gdp per capita for each region
gdp_per_capita = grouped['gdpPercap'].mean()

# combine the results into a new table
gdp_per_capita.head()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 7
      1 # use split-apply-combine to calculate the average gdp per capita 
      2 # for each of four regions in Europe: Northern Europe, Southern Europe, 
      3 # Eastern Europe, and Western Europe
      4 
      5 # split the data into four regions
      6 regions = ['Northern Europe', 'Southern Europe', 'Eastern Europe', 'Western Europe']
----> 7 grouped = gapminder_europe.groupby(['group', 'year'])
      9 # apply the mean function to calculate the average gdp per capita for each region
     10 gdp_per_capita = grouped['gdpPercap'].mean()

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/frame.py:9183, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
   9180 if level is None and by is None:
   9181     raise TypeError("You have to supply one of 'by' and 'level'")
-> 9183 return DataFrameGroupBy(
   9184     obj=self,
   9185     keys=by,
   9186     axis=axis,
   9187     level=level,
   9188     as_index=as_index,
   9189     sort=sort,
   9190     group_keys=group_keys,
   9191     observed=observed,
   9192     dropna=dropna,
   9193 )

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1329, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
   1326 self.dropna = dropna
   1328 if grouper is None:
-> 1329     grouper, exclusions, obj = get_grouper(
   1330         obj,
   1331         keys,
   1332         axis=axis,
   1333         level=level,
   1334         sort=sort,
   1335         observed=False if observed is lib.no_default else observed,
   1336         dropna=self.dropna,
   1337     )
   1339 if observed is lib.no_default:
   1340     if any(ping._passed_categorical for ping in grouper.groupings):

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:1043, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
   1041         in_axis, level, gpr = False, gpr, None
   1042     else:
-> 1043         raise KeyError(gpr)
   1044 elif isinstance(gpr, Grouper) and gpr.key is not None:
   1045     # Add key to exclusions
   1046     exclusions.add(gpr.key)

KeyError: 'group'

Prompt Engineering — Writing Better Prompts#

Can you figure out why this error happened? Ideally, you should be able to do this by re-reading the code the Copilot generated. But, the error message also provides guidance, because it tells us there is an error on line 7 (this information is near the top of the error message), and the error is KeyError: 'group'. That means that gapminder_europe doesn’t have a column named group. Indeed, if we look at line 7 we see it’s attempting to group the data by two columns, year and group. But gapminder_europe doesn’t have a column named group, which is why we got an error.

There are actually a few problems with this code. Not only does group not exist as a column, but it’s not a great name for a column, because the column should indicate region of Europe. So a better name would be region. Another issue is that the .groupby() method was given another variable, year. But our instructions didn’t say anything about year, so we don’t want to group by year. We just want to group by region.

Let’s try fix those issues through better prompt engineering. One thing you’ll encounter frequently is that you give Copilot too big a job, and doing what you’re asking will require a number of intermediate steps. Copilot isn’t great about long-term planning. Remember that LLMs are just advanced versions of the same kind of language model that suggests words when you’re typing on your phone. It’s often good at predicting the next word you want to type, but if you keep just picking the “best” suggestion, you will get a sentence of gibberish more times than not. The same is true with Copilot. If you give it a big job, it will often generate code that doesn’t work, because it’s trying to do too much at once.

So you need to break the job down into smaller steps. In this case, we’re doing split-apply-combine, so let’s try writing our own prompts for each of those conceptual steps. So let’s start with the first step, ‘split’, and see if we can get Copilot to generate the code to group the data by region.

# create a new column in the gapminder_europe data frame called 'region'. 
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'

# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',  
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'

AI may be seeming less magical now. Firstly, it repeated the entire prompt before generating any code. Secondly, it has labelled every country as Western Europe, which is wrong. In the previous chapter, we had to manually create lists for each region label, containing the names of each country in that region. We could do that here to help Copilot along, but hopefully AI is smart enough to know what geographical region each European country is in. This is a reasonable prediction in this case, especially because the open-source Gapminder dataset is widely used in teaching data science, so there should be many versions of this specific example in Copilot’s training set. We just need to figure out how to get our prompting right.

Break the Problem Down into Smaller Steps#

Although our prompt may be readily interpretable by a human brain, in terms of communicating what we want, apparently it’s too complex for Copilot’s AI. If we think about the task we’re asking Copilot to do, there are several sub-tasks. That is, in one prompt we’re asking Copilot to create four different lists. What if, instead, we ask for one list at a time?

It turns out that if we start with the prompt:

# create a list of the countries in Northern Europe

Copilot will generate the code to create a list of the countries in Northern Europe.

northern_europe = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']

That’s a good start. But we need to do this for each region. So if we hit Enter a couple of times to get a blank line, and then type the prompt:

# create a list of the countries in 

Copilot is smart enough to complete this prompt with Southern Europe, and generate the code to create a list of the countries in Southern Europe.

Even better, if we accept that code and hit Enter two more times, Copilot will actually generate the complete next prompt for us:

# create a list of the countries in Eastern Europe

And if we accept that one, it generates the code for us, and generates the next prompt:

# create a list of the countries in Western Europe

Once we accept that last prompt, and the code it generates, and again hit Enter twice, Copilot really shines. It generates the same prompts we saw above, and the code to create a new column called region, and then starts by labelling all countries as Western Europe. It generated this code earlier, but it didn’t work because we hadn’t first used the right prompts to get the four lists of countries in each region. Copilot then goes on to generate code to change the region labels for each country, based on the lists we created above.

However, when we try to run this code, we get an error:

# create a list of the countries in Northern Europe
northern_europe = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']

# create a list of the countries in Southern Europe
southern_europe = ['Albania', 'Bosnia and Herzegovina', 'Croatia', 'Greece', 'Italy', 'Malta', 'Montenegro', 'Portugal', 'Serbia', 'Slovenia', 'Spain']

# create a list of the countries in Eastern Europe
eastern_europe = ['Belarus', 'Bulgaria', 'Czech Republic', 'Hungary', 'Poland', 'Romania', 'Slovak Republic']

# create a list of the countries in Western Europe
western_europe = ['Austria', 'Belgium', 'France', 'Germany', 'Ireland', 'Luxembourg', 'Netherlands', 'United Kingdom']

# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'

# label the countries in Northern Europe as 'Northern Europe'
gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'

# label the countries in Southern Europe as 'Southern Europe'
gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'

# label the countries in Eastern Europe as 'Eastern Europe'
gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[9], line 22
     19 gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'
     21 # label the countries in Southern Europe as 'Southern Europe'
---> 22 gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
     24 # label the countries in Eastern Europe as 'Eastern Europe'
     25 gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:907, in _LocationIndexer.__setitem__(self, key, value)
    905     maybe_callable = com.apply_if_callable(key, self.obj)
    906     key = self._check_deprecated_callable_usage(key, maybe_callable)
--> 907 indexer = self._get_setitem_indexer(key)
    908 self._has_valid_setitem_indexer(key)
    910 iloc = self if self.name == "iloc" else self.obj.iloc

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:774, in _LocationIndexer._get_setitem_indexer(self, key)
    771 if isinstance(key, tuple):
    772     with suppress(IndexingError):
    773         # suppress "Too many indexers"
--> 774         return self._convert_tuple(key)
    776 if isinstance(key, range):
    777     # GH#45479 test_loc_setitem_range_key
    778     key = list(key)

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:989, in _LocationIndexer._convert_tuple(self, key)
    985 @final
    986 def _convert_tuple(self, key: tuple) -> tuple:
    987     # Note: we assume _tupleize_axis_indexer has been called, if necessary.
    988     self._validate_key_length(key)
--> 989     keyidx = [self._convert_to_indexer(k, axis=i) for i, k in enumerate(key)]
    990     return tuple(keyidx)

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1522, in _LocIndexer._convert_to_indexer(self, key, axis)
   1520         return key
   1521     else:
-> 1522         return self._get_listlike_indexer(key, axis)[1]
   1523 else:
   1524     try:

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name)
   6249     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['Malta'] not in index"

The error message tells us that ['Malta'] not in index. Remember that the index is a set of labels for the rows of the DataFrame. If you look back at the code we used, country was set as the index when we read the CSV file into a DataFrame. So this error is telling us that Malta is not in the index.

If we look at the code that generated the error, we can see that Malta is listed in the countries of Southern Europe. The error we get suggests that data for Malta don’t exist in the DataFrame. Let’s check that. We can get Copilot to help, by writing a prompt asking it to check if Malta is in the index of the DataFrame.

# check if 'Malta' is in gapminder_europe
print('Malta' in gapminder_europe.index)
False

To correct this, we can remove Malta from the Southern Europe list. Hopefully you remember how to remove a particular item from a list, from a previous lesson. But there’s so much to remember! You would probably need to go back and look it up, or Google it. But as you might anticipate, we can prompt Copilot to do this, too!

As I wrote the prompt below, however, Copilot made many incorrect guesses as to what I wanted. Initially rather than southern_europe it suggested gapminder_europe — but that would be completely wrong, because the issue is that Malta isn’t in gapminder_europe! When I started typing s rather than accepting gapminder_europe, Copilot suggested south_europe – but if you look at the code above, Copilot had named that list southern_europe. So, even though Copilot can be sensitive to the context of the file you’re working in, it’s still not perfect, and may suggest erroneous code such as the wrong variable name. If you’re not paying attention, and understanding the code you’re generating with Copilot, it’s highly likely that you will make a lot of errors. But after manually editing the Copilot-generated code (changing south_europe to southern_europe), Copilot did produce a correct suggestion for how to remove Malta from the southern_europe list:

# remove 'Malta' from southern_europe
southern_europe.remove('Malta')

An alternative approach – and probably a better one in practice – would have been to go back to the cell that labelled all the regions, and manually remove Malta from southern_europe. That would ensure that our notebook file had clean, functional, and accurate code, rather than the ugly history of our attempts to generate that code. However, the point here is to illustrate the process of working with Copilot, so we are deliberately not doing that.

Now we want to re-run the code above. Let’s cut and paste it from the cell above into the cell below, and then run it. Make sure not to copy over the code that define the lists of countries in each region, because those are already defined, and we don’t want to add Malta back into southern_europe!

# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'

# label the countries in Northern Europe as 'Northern Europe'
gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'

# label the countries in Southern Europe as 'Southern Europe'
gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'

# label the countries in Eastern Europe as 'Eastern Europe'
gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[12], line 13
     10 gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
     12 # label the countries in Eastern Europe as 'Eastern Europe'
---> 13 gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:907, in _LocationIndexer.__setitem__(self, key, value)
    905     maybe_callable = com.apply_if_callable(key, self.obj)
    906     key = self._check_deprecated_callable_usage(key, maybe_callable)
--> 907 indexer = self._get_setitem_indexer(key)
    908 self._has_valid_setitem_indexer(key)
    910 iloc = self if self.name == "iloc" else self.obj.iloc

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:774, in _LocationIndexer._get_setitem_indexer(self, key)
    771 if isinstance(key, tuple):
    772     with suppress(IndexingError):
    773         # suppress "Too many indexers"
--> 774         return self._convert_tuple(key)
    776 if isinstance(key, range):
    777     # GH#45479 test_loc_setitem_range_key
    778     key = list(key)

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:989, in _LocationIndexer._convert_tuple(self, key)
    985 @final
    986 def _convert_tuple(self, key: tuple) -> tuple:
    987     # Note: we assume _tupleize_axis_indexer has been called, if necessary.
    988     self._validate_key_length(key)
--> 989     keyidx = [self._convert_to_indexer(k, axis=i) for i, k in enumerate(key)]
    990     return tuple(keyidx)

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1522, in _LocIndexer._convert_to_indexer(self, key, axis)
   1520         return key
   1521     else:
-> 1522         return self._get_listlike_indexer(key, axis)[1]
   1523 else:
   1524     try:

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name)
   6249     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['Belarus'] not in index"

Another error, but it looks like the same problem as before: this time, Belarus is not in the DataFrame.

Why is this happening? Well, we assumed that when Copilot generated those lists of countries in each region, it would use the countries in the gapminder_europe DataFrame. But in reality, we don’t know how it generated those lists. Maybe it was based on data used to train Copilot that had a larger list of countries in Europe, rather than just the countries in the DataFrame. And again, Copilot was not accurately sensitive to the context of our notebook file.

Anyway, we now know how to fix this error, based on what we did above for Malta. But what next? How many more countries did Copilot “hallucinate” were in our DataFrame? We can keep repeating this process until we get no errors, but that seems tedious, and could take a long time (especially if we were working with a larger data set). Let’s see if we can tweak our prompts from above to generate more accurate lists of countries in each region.

More Prompt Engineering#

My first attempt at a prompt was:

# create a list of the countries in Northern Europe, limited to the countries in the gapminder_europe data frame

However, this generated the code below, which is obviously wrong. First of all, it’s circular, in that it’s defining a variable northern_europe based on selecting rows in the DataFrame that are in a variable with the same name (northern_europe).

northern_europe = gapminder_europe.loc[northern_europe]

The larger problem with this code is that it’s using the northern_europe list, but the issue we’re trying to fix is that that list is incorrect. So we need a better prompt. Again, we need to break the problem down into smaller steps. Let’s start with the first step, and see if we can get Copilot to generate the code to create a list of the countries in Northern Europe,as it did before. Then, in a second step, let’s ask it to remove the countries not in the DataFrame. I tried this:

# create a list of countries in northern europe. Then, remove the names of any 
# countries in the list that are not in the gapminder_europe data frame

But Copilot only generated code for the first step. So, then I cut the second step from the first prompt, and made it a second step. As you see below, this generated code that ran with no errors.

# create a list of countries in northern europe. 
northern_europe = ['Belgium', 'Denmark', 'Finland', 'Iceland', 'Ireland', 'Luxembourg', 'Netherlands', 'Norway', 'Sweden', 'United Kingdom']

# remove the names of any countries in the list northern_europe that are not in the gapminder_europe data frame
northern_europe = [country for country in northern_europe if country in gapminder_europe.index]

Note that the last line of code above is using list comprehension. This is useful and efficient, but it can be a bit challenging to read and understand if you’re not familiar with it.

The list comprehension above is equivalent to this code:

tmp = []
for country in northern_europe:
    if country in gapminder_europe.index:
        tmp.append(country)

northern_europe = tmp

In this code, we have to create a temporary variable tmp to store the new results as we iterate through the items in the list northern_europe. In contrast, list comprehension works all at once, rather than iteratively over each item in the list. Therefore it’s OK to modify the list while we’re iterating over it, because we’re not actually iterating over it.

Of course, with Copilot if the code works, we don’t necessarily need to understand it. But it’s still good practice to actually understand what your code is doing. If Copilot generates code that you don’t understand, you should take the time to learn something new!

Check Your Results#

Even if we understand what the code is doing, we should double-check whether the list got changed! We can check by viewing the list. You could ask Copilot to do this, but sometimes for simple things it’s faster just to type the code yourself:

print(northern_europe)
['Belgium', 'Denmark', 'Finland', 'Iceland', 'Ireland', 'Netherlands', 'Norway', 'Sweden', 'United Kingdom']

Cross-checking this list against the original one, we can see that Luxembourg is no longer in the list. That suggests that Copilot did indeed remove the countries not in the DataFrame. Let’s try this for the other regions. For completeness – having all the code for one conceptual step in one cell – we’ll copy and paste the Northern Europe code from above, then generate additional prompts for the other regions. Copilot catches on to the repetitive nature of this process quickly, and should generate most of the prompts and code for you. (remember to hit Enter twice after each line of code that you accept, to get the next prompt). In fact, after Copilot finished generating prompts for each region, it again generated the prompts to create a new region column in our DataFrame, and populate it with the region labels.

# create a list of countries in northern europe. 
northern_europe = ['Belgium', 'Denmark', 'Finland', 'Iceland', 'Ireland', 'Luxembourg', 'Netherlands', 'Norway', 'Sweden', 'United Kingdom']

# remove the names of any countries in the list northern_europe that are not in the gapminder_europe data frame
northern_europe = [country for country in northern_europe if country in gapminder_europe.index]

# create a list of countries in southern europe.
southern_europe = ['Albania', 'Bosnia and Herzegovina', 'Croatia', 'Greece', 'Italy', 'Montenegro', 'Portugal', 'Serbia', 'Slovenia', 'Spain']

# remove the names of any countries in the list southern_europe that are not in the gapminder_europe data frame
southern_europe = [country for country in southern_europe if country in gapminder_europe.index]

# create a list of countries in eastern europe.
eastern_europe = ['Belarus', 'Bulgaria', 'Czech Republic', 'Hungary', 'Poland', 'Romania', 'Slovak Republic']

# remove the names of any countries in the list eastern_europe that are not in the gapminder_europe data frame
eastern_europe = [country for country in eastern_europe if country in gapminder_europe.index]

# create a list of countries in western europe.
western_europe = ['Austria', 'France', 'Germany', 'Switzerland']

# remove the names of any countries in the list western_europe that are not in the gapminder_europe data frame
western_europe = [country for country in western_europe if country in gapminder_europe.index]

# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'

# label the countries in Northern Europe as 'Northern Europe'
gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'

# label the countries in Southern Europe as 'Southern Europe'
gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'

# label the countries in Eastern Europe as 'Eastern Europe'
gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'

This time the code ran without errors. We can spot-check our region labels by getting a random sample of rows from the DataFrame:

# generate a random sample of rows from the gapminder_europe data frame
gapminder_europe.sample(5)
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 region
country
Netherlands 8941.571858 11276.193440 12790.849560 15363.251360 18794.745670 21209.05920 21399.46046 23651.32361 26790.949610 30246.13063 33724.75778 36797.93332 Northern Europe
Austria 6137.076492 8842.598030 10750.721110 12834.602400 16661.625600 19749.42230 21597.08362 23687.82607 27042.018680 29095.92066 32417.60769 36126.49270 Western Europe
Ireland 5210.280328 5599.077872 6631.597314 7655.568963 9530.772896 11150.98113 12618.32141 13872.86652 17558.815550 24521.94713 34077.04939 40675.99635 Northern Europe
Slovak Republic 5074.659104 6093.262980 7481.107598 8412.902397 9674.167626 10922.66404 11348.54585 12037.26758 9498.467723 12126.23065 13638.77837 18678.31435 Eastern Europe
Norway 10095.421720 11653.973040 13450.401510 16361.876470 18965.055510 23311.34939 26298.63531 31540.97480 33965.661150 41283.16433 44683.97525 49357.19017 Northern Europe

Since new columns are added to the right side of a DataFrame, you’ll need to scroll to the right to see the new region column, but it is indeed there, and the labels in the sample are correct. If you want more reassurance, you could re-run the cell to get a new random sample, or increase the number of samples from 5 to a larger number.

You could also modify your prompt to show only the region column, and then view ta random sample of rows:

# show a random sample of rows from the gapminder_europe data frame, and only the region column
gapminder_europe['region'].sample(5)
country
Netherlands       Northern Europe
Portugal          Southern Europe
Finland           Northern Europe
United Kingdom    Northern Europe
Czech Republic     Eastern Europe
Name: region, dtype: object

Note

If you look at the code that Copilot generated above, and compare it to what we used in the pandas lesson previous chapter, you’ll see that it’s quite similar in that both use the gapminder_europe.loc method to label countries according to region. However, the process of generating it required some trial and error, and some manual intervention. This isn’t actually different from what you would probably be doing – especially as a novice coder – if you were writing the code yourself. But, perhaps ironically, part of the code that we had to prompt Copilot to generate, was to fix errors in the code that it generated.

At the same time, if you look back at the pandas lesson, the lists of countries in each region were defined manually — and in fact there was no explanation as to how those lists were created. Those lists were created manually by someone reviewing the list of country names in the DataFrame, and using their knowledge of geography (or maybe an internet search) to assign the countries to the labels. In contrast, Copilot generated those lists for us (based on the vast corpus of data it was trained on), which saved us a lot of time. The fact that we had to write additional code to clean up the lists is a small price to pay for that time savings. And indeed, cleaning data is a routine part of data science, and it’s good to be able to do it in code, rather than manually.

Critically Evaluating Copilot’s Code#

Although our previous pandas lesson and Copilot both used the gapminder_europe.loc method to label countries, there is one important difference between what we showed in the previous lesson, and Copilot’s code. Copilot first labelled all countries as Western Europe, and then changed the label for non-Western countries. In contrast, in the previous lesson we used the same gapminder_europe.loc method to label Western countries as Copilot did here for the other regions. This illustrates how different code can be used to accomplish the same task.

However, just because two approaches do the same job (or appear to), there is some danger to the approach that Copilot took here. In labelling all the countries first as Western Europe, the assumption is that the other labels we use will cover all of the countries that are not Western Europe. However, what if there was a country that was in the DataFrame, but not in any of Copilot’s lists of regions? Those countries would be labeled as Western Europe, which might be incorrect. In that case, the code Copilot generated would create bugs that would affect the outcome of any analyses applied to the data. This is a good example of why it’s important to read and understand the code you’re using, and not just blindly accept what Copilot generates.

At the very least, a good data scientist will assume the worst of their, or AI-generated, code, and test it to make sure it’s doing what it’s supposed to do. In this case, since the concern is that a country not in Western Europe may get that label, we can check by comparing the countries with that label in the DataFrame, with the list western_europe.

Using Copilot to Check its Own Work#

Fortunately, Copilot can help us check it’s own work – it even correctly generated a lot of the detailed prompt you see below! The prompt below also shows that you can use complex, multi-step prompts with Copilot, and it will generate code for each step. The difference between the complex prompt here, and some that failed above, is that we clearly end each step with a period, and start the next step as a new sentence.

# list all of the countries in the gapminder_europe data frame labelled as "Western Europe".
# Then compare this list to the contents of the western_europe list. Print the names of any
# countries that are in the western_europe list but not in the gapminder_europe data frame.
print(set(western_europe) - set(gapminder_europe[gapminder_europe['region'] == 'Western Europe'].index))
set()

The output shows that there are no countries labelled as Western Europe that are not in the western_europe list. So, Copilot’s code is correct. But, it is important to always check – whether it’s your own code, or AI-generated.

Finally, note that the code that Copilot did generate there is pretty complex. At this point in your development as a coder, do you think you could have written that? And if so, how long do you think it would have taken? Copilot generated that code in a few seconds. This is the promise of AI-assisted coding. At the same time, as a learner and a critical user of AI-generated code, you should take the time to understand what the code is doing, and why. In this case, you can do internet searches for the components of that code that you may not recognize, such as the set() function and the .index method (``).

Summary#

  • GitHub Copilot is a great coding partner. On its own, it often generates errors — including both fatal Python errors, and more sneaky errors that may not be obvious — cases where the code runs without errors, but there are flaws in the logic of the code.

  • In spite of the risk of errors errors, Copilot can save you time by quickly generating code, and often generating things that would take you significant time to look up.

  • The best approach to using Copilot is to break down your coding tasks into small steps, and prompt Copilot to generate code for each step. Always be conscious of what you’re trying to do and be clear in the expected outputs of that code.

  • Always actually read the code that Copilot generates and be sure you understand it. Based on that understanding, confirm the logic of what it’s doing, and ensure it’s what you expect.

  • Always test the results of the code Copilot generates, to confirm that it’s doing what you expect.

  • This approach will help you both code more efficiently, and become a better coder yourself. Professional coders have eagerly adopted Copilot due to the efficiencies it creates — but there is a learning curve, and ultimately it is just another tool in your toolbox, not the only tool you need.