Data Files and pandas DataFrames#
In the previous section we tried some very basic things with Copilot. Now let’s try a more complex example. In the previous chapter, we worked with the Gapminder dataset, which is a CSV file. We used the pandas
library to read the CSV file into a DataFrame
, and then used the DataFrame
to do some analysis. Let’s do that again, but this time we’ll use Copilot to help us.
First, let’s prompt Copilot to read the CSV file containing the data for Europe. I’ve deliberately started with a very minimal prompt, to help illustrate how the wording of the prompt can affect the results.
# read the gapminder data file for Europe
gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 2
1 # read the gapminder data file for Europe
----> 2 gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')
NameError: name 'pd' is not defined
That generated a scary looking error message! This is human error, not AI error. But before we sort that out, let’s marvel at what Copilot did there. We gave it a pretty poor prompt (“the gapminder data for europe”), and it guessed that the file was named gapminder_europe.csv
. It also guessed that we wanted to use the pandas
library (pd
) to read the CSV file into a DataFrame
. That’s pretty impressive!
Fortunately, the error is pretty self-explanatory. If we want to work with pandas, we always need to first import the pandas library. By convention we give it the alias pd
. So let’s add that to our prompt, and see what happens. Note that Copilot may generate code one line at a time, so you may have to accept the suggestion for the first line of code before it generates the next line of code.
# import the pandas library as pd and then read the gapminder data file for Europe
import pandas as pd
gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[2], line 3
1 # import the pandas library as pd and then read the gapminder data file for Europe
2 import pandas as pd
----> 3 gapminder_europe = pd.read_csv('gapminder_gdp_europe.csv', index_col='country')
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'gapminder_gdp_europe.csv'
More scary error messages! This time, the last line is the most informative. It’s telling us that it can’t find the file we asked it to read. That’s because we haven’t told it where to look for the file. When we just list the name of a file, pd.read_csv()
assumes that the file is in the same directory as the notebook. But in this case, the file is in a subdirectory called data
. So let’s add that to our prompt, and see what happens.
# import the pandas library as pd and then read the gapminder data file for Europe.
# The file is in a subfolder called data
import pandas as pd
gapminder_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
Tip
In many cases, Copilot might not have enough information to correctly guess the name of a file you want to load. In such cases, you could include the name of the file, as well as its path, in your prompt.
Now we’re getting somewhere! Let’s view the first few lines of the DataFrame
to make sure it looks right.
# show me the first few lines of the gapminder europe data
print(gapminder_europe.head())
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \
country
Albania 1601.056136 1942.284244 2312.888958
Austria 6137.076492 8842.598030 10750.721110
Belgium 8343.105127 9714.960623 10991.206760
Bosnia and Herzegovina 973.533195 1353.989176 1709.683679
Bulgaria 2444.286648 3008.670727 4254.337839
gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 \
country
Albania 2760.196931 3313.422188 3533.003910
Austria 12834.602400 16661.625600 19749.422300
Belgium 13149.041190 16672.143560 19117.974480
Bosnia and Herzegovina 2172.352423 2860.169750 3528.481305
Bulgaria 5577.002800 6597.494398 7612.240438
gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 \
country
Albania 3630.880722 3738.932735 2497.437901
Austria 21597.083620 23687.826070 27042.018680
Belgium 20979.845890 22525.563080 25575.570690
Bosnia and Herzegovina 4126.613157 4314.114757 2546.781445
Bulgaria 8224.191647 8239.854824 6302.623438
gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Albania 3193.054604 4604.211737 5937.029526
Austria 29095.920660 32417.607690 36126.492700
Belgium 27561.196630 30485.883750 33692.605080
Bosnia and Herzegovina 4766.355904 6018.975239 7446.298803
Bulgaria 5970.388760 7696.777725 10680.792820
Again, we used a pretty minimal prompt to get the output. That is, we didn’t need to use the exact name of the DataFrame
(gapminder_europe
), nor the exact name of the method (head()
).
The code that we got in developing this lesson was print(gapminder_europe.head())
. However, the output of passing a DataFrame
to print()
is not formatted as nicely as if we ask the DataFrame
to print itself. So let’s try modifying our prompt to get nicely formatted output:
# show me the first few lines of the gapminder europe data
# format the table nicely
print(gapminder_europe.head().to_string())
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Albania 1601.056136 1942.284244 2312.888958 2760.196931 3313.422188 3533.003910 3630.880722 3738.932735 2497.437901 3193.054604 4604.211737 5937.029526
Austria 6137.076492 8842.598030 10750.721110 12834.602400 16661.625600 19749.422300 21597.083620 23687.826070 27042.018680 29095.920660 32417.607690 36126.492700
Belgium 8343.105127 9714.960623 10991.206760 13149.041190 16672.143560 19117.974480 20979.845890 22525.563080 25575.570690 27561.196630 30485.883750 33692.605080
Bosnia and Herzegovina 973.533195 1353.989176 1709.683679 2172.352423 2860.169750 3528.481305 4126.613157 4314.114757 2546.781445 4766.355904 6018.975239 7446.298803
Bulgaria 2444.286648 3008.670727 4254.337839 5577.002800 6597.494398 7612.240438 8224.191647 8239.854824 6302.623438 5970.388760 7696.777725 10680.792820
Still not what we wanted (athough at least each row in the DataFrame is now one row in the output) — and Copilot didn’t have any alternative suggestions. This is where prompt engineering becomes important. Let’s try a different prompt.
# show me the first few lines of the gapminder europe data
# the output should be a table
gapminder_europe.head()
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
country | ||||||||||||
Albania | 1601.056136 | 1942.284244 | 2312.888958 | 2760.196931 | 3313.422188 | 3533.003910 | 3630.880722 | 3738.932735 | 2497.437901 | 3193.054604 | 4604.211737 | 5937.029526 |
Austria | 6137.076492 | 8842.598030 | 10750.721110 | 12834.602400 | 16661.625600 | 19749.422300 | 21597.083620 | 23687.826070 | 27042.018680 | 29095.920660 | 32417.607690 | 36126.492700 |
Belgium | 8343.105127 | 9714.960623 | 10991.206760 | 13149.041190 | 16672.143560 | 19117.974480 | 20979.845890 | 22525.563080 | 25575.570690 | 27561.196630 | 30485.883750 | 33692.605080 |
Bosnia and Herzegovina | 973.533195 | 1353.989176 | 1709.683679 | 2172.352423 | 2860.169750 | 3528.481305 | 4126.613157 | 4314.114757 | 2546.781445 | 4766.355904 | 6018.975239 | 7446.298803 |
Bulgaria | 2444.286648 | 3008.670727 | 4254.337839 | 5577.002800 | 6597.494398 | 7612.240438 | 8224.191647 | 8239.854824 | 6302.623438 | 5970.388760 | 7696.777725 | 10680.792820 |
That did it! We just needed to be a bit more specific about what we wanted to do, by asking for the output as a table.
It’s worth noting that others working through this lesson got gapminder_europe.head()
to the original prompt above, but we kept this series of outputs to illustrate how the wording of the prompt can affect the results.
Split-Apply-Combine#
In the previous chapter, we learned about the split-apply-combine strategy for data analysis. That’s when we split the data into different groups (e.g., according to a particular variable), apply an operation to each group separately, and then combine the results back into a single table. Let’s try that again, but this time we’ll use Copilot to help us.
Let’s replicate the split-apply-combine analysis we did in the previous chapter, where we calculate the average GDP per capita for each of four regions in Europe: Northern Europe, Southern Europe, Eastern Europe, and Western Europe. We’ll start with a very high-level prompt, and see if we get what we want.
Note that in the code cell below, I only typed the first three-line prompt. Copilot generated the rest of the code, including a series of prompts (comments) to help me fill in the details. In order to get it to generate additional prompts and code, I had to hit enter twice after the first prompt. Apparently Copilot is insistent that your code be nicely formatted, with empty lines between different sections of code that do different steps. This is a good thing, but if you forget to hit Enter
twice, you may get nothing an think Copilot isn’t working. Just hit Enter
twice, and you should get the prompts you need. Otherwise, try going back to the end of your prompt and hitting Enter
again.
# use split-apply-combine to calculate the average gdp per capita
# for each of four regions in Europe: Northern Europe, Southern Europe,
# Eastern Europe, and Western Europe
# split the data into four regions
regions = ['Northern Europe', 'Southern Europe', 'Eastern Europe', 'Western Europe']
grouped = gapminder_europe.groupby(['group', 'year'])
# apply the mean function to calculate the average gdp per capita for each region
gdp_per_capita = grouped['gdpPercap'].mean()
# combine the results into a new table
gdp_per_capita.head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[7], line 7
1 # use split-apply-combine to calculate the average gdp per capita
2 # for each of four regions in Europe: Northern Europe, Southern Europe,
3 # Eastern Europe, and Western Europe
4
5 # split the data into four regions
6 regions = ['Northern Europe', 'Southern Europe', 'Eastern Europe', 'Western Europe']
----> 7 grouped = gapminder_europe.groupby(['group', 'year'])
9 # apply the mean function to calculate the average gdp per capita for each region
10 gdp_per_capita = grouped['gdpPercap'].mean()
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/frame.py:9183, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
9180 if level is None and by is None:
9181 raise TypeError("You have to supply one of 'by' and 'level'")
-> 9183 return DataFrameGroupBy(
9184 obj=self,
9185 keys=by,
9186 axis=axis,
9187 level=level,
9188 as_index=as_index,
9189 sort=sort,
9190 group_keys=group_keys,
9191 observed=observed,
9192 dropna=dropna,
9193 )
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1329, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
1326 self.dropna = dropna
1328 if grouper is None:
-> 1329 grouper, exclusions, obj = get_grouper(
1330 obj,
1331 keys,
1332 axis=axis,
1333 level=level,
1334 sort=sort,
1335 observed=False if observed is lib.no_default else observed,
1336 dropna=self.dropna,
1337 )
1339 if observed is lib.no_default:
1340 if any(ping._passed_categorical for ping in grouper.groupings):
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:1043, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
1041 in_axis, level, gpr = False, gpr, None
1042 else:
-> 1043 raise KeyError(gpr)
1044 elif isinstance(gpr, Grouper) and gpr.key is not None:
1045 # Add key to exclusions
1046 exclusions.add(gpr.key)
KeyError: 'group'
Prompt Engineering — Writing Better Prompts#
Can you figure out why this error happened? Ideally, you should be able to do this by re-reading the code the Copilot generated. But, the error message also provides guidance, because it tells us there is an error on line 7 (this information is near the top of the error message), and the error is KeyError: 'group'
. That means that gapminder_europe
doesn’t have a column named group
. Indeed, if we look at line 7 we see it’s attempting to group the data by two columns, year
and group
. But gapminder_europe
doesn’t have a column named group
, which is why we got an error.
There are actually a few problems with this code. Not only does group
not exist as a column, but it’s not a great name for a column, because the column should indicate region of Europe. So a better name would be region
. Another issue is that the .groupby()
method was given another variable, year
. But our instructions didn’t say anything about year, so we don’t want to group by year. We just want to group by region.
Let’s try fix those issues through better prompt engineering. One thing you’ll encounter frequently is that you give Copilot too big a job, and doing what you’re asking will require a number of intermediate steps. Copilot isn’t great about long-term planning. Remember that LLMs are just advanced versions of the same kind of language model that suggests words when you’re typing on your phone. It’s often good at predicting the next word you want to type, but if you keep just picking the “best” suggestion, you will get a sentence of gibberish more times than not. The same is true with Copilot. If you give it a big job, it will often generate code that doesn’t work, because it’s trying to do too much at once.
So you need to break the job down into smaller steps. In this case, we’re doing split-apply-combine, so let’s try writing our own prompts for each of those conceptual steps. So let’s start with the first step, ‘split’, and see if we can get Copilot to generate the code to group the data by region.
# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'
AI may be seeming less magical now. Firstly, it repeated the entire prompt before generating any code. Secondly, it has labelled every country as Western Europe
, which is wrong. In the previous chapter, we had to manually create lists for each region label, containing the names of each country in that region. We could do that here to help Copilot along, but hopefully AI is smart enough to know what geographical region each European country is in. This is a reasonable prediction in this case, especially because the open-source Gapminder dataset is widely used in teaching data science, so there should be many versions of this specific example in Copilot’s training set. We just need to figure out how to get our prompting right.
Break the Problem Down into Smaller Steps#
Although our prompt may be readily interpretable by a human brain, in terms of communicating what we want, apparently it’s too complex for Copilot’s AI. If we think about the task we’re asking Copilot to do, there are several sub-tasks. That is, in one prompt we’re asking Copilot to create four different lists. What if, instead, we ask for one list at a time?
It turns out that if we start with the prompt:
# create a list of the countries in Northern Europe
Copilot will generate the code to create a list of the countries in Northern Europe.
northern_europe = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
That’s a good start. But we need to do this for each region. So if we hit Enter
a couple of times to get a blank line, and then type the prompt:
# create a list of the countries in
Copilot is smart enough to complete this prompt with Southern Europe
, and generate the code to create a list of the countries in Southern Europe.
Even better, if we accept that code and hit Enter
two more times, Copilot will actually generate the complete next prompt for us:
# create a list of the countries in Eastern Europe
And if we accept that one, it generates the code for us, and generates the next prompt:
# create a list of the countries in Western Europe
Once we accept that last prompt, and the code it generates, and again hit Enter
twice, Copilot really shines. It generates the same prompts we saw above, and the code to create a new column called region
, and then starts by labelling all countries as Western Europe
. It generated this code earlier, but it didn’t work because we hadn’t first used the right prompts to get the four lists of countries in each region. Copilot then goes on to generate code to change the region labels for each country, based on the lists we created above.
However, when we try to run this code, we get an error:
# create a list of the countries in Northern Europe
northern_europe = ['Denmark', 'Finland', 'Iceland', 'Norway', 'Sweden']
# create a list of the countries in Southern Europe
southern_europe = ['Albania', 'Bosnia and Herzegovina', 'Croatia', 'Greece', 'Italy', 'Malta', 'Montenegro', 'Portugal', 'Serbia', 'Slovenia', 'Spain']
# create a list of the countries in Eastern Europe
eastern_europe = ['Belarus', 'Bulgaria', 'Czech Republic', 'Hungary', 'Poland', 'Romania', 'Slovak Republic']
# create a list of the countries in Western Europe
western_europe = ['Austria', 'Belgium', 'France', 'Germany', 'Ireland', 'Luxembourg', 'Netherlands', 'United Kingdom']
# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'
# label the countries in Northern Europe as 'Northern Europe'
gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'
# label the countries in Southern Europe as 'Southern Europe'
gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
# label the countries in Eastern Europe as 'Eastern Europe'
gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[9], line 22
19 gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'
21 # label the countries in Southern Europe as 'Southern Europe'
---> 22 gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
24 # label the countries in Eastern Europe as 'Eastern Europe'
25 gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:907, in _LocationIndexer.__setitem__(self, key, value)
905 maybe_callable = com.apply_if_callable(key, self.obj)
906 key = self._check_deprecated_callable_usage(key, maybe_callable)
--> 907 indexer = self._get_setitem_indexer(key)
908 self._has_valid_setitem_indexer(key)
910 iloc = self if self.name == "iloc" else self.obj.iloc
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:774, in _LocationIndexer._get_setitem_indexer(self, key)
771 if isinstance(key, tuple):
772 with suppress(IndexingError):
773 # suppress "Too many indexers"
--> 774 return self._convert_tuple(key)
776 if isinstance(key, range):
777 # GH#45479 test_loc_setitem_range_key
778 key = list(key)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:989, in _LocationIndexer._convert_tuple(self, key)
985 @final
986 def _convert_tuple(self, key: tuple) -> tuple:
987 # Note: we assume _tupleize_axis_indexer has been called, if necessary.
988 self._validate_key_length(key)
--> 989 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in enumerate(key)]
990 return tuple(keyidx)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1522, in _LocIndexer._convert_to_indexer(self, key, axis)
1520 return key
1521 else:
-> 1522 return self._get_listlike_indexer(key, axis)[1]
1523 else:
1524 try:
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
1555 ax = self.obj._get_axis(axis)
1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
1560 return keyarr, indexer
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
6197 else:
6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
6202 keyarr = self.take(indexer)
6203 if isinstance(key, Index):
6204 # GH 42790 - Preserve name from an Index
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name)
6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")
KeyError: "['Malta'] not in index"
The error message tells us that ['Malta'] not in index
. Remember that the index is a set of labels for the rows of the DataFrame. If you look back at the code we used, country
was set as the index when we read the CSV file into a DataFrame. So this error is telling us that Malta
is not in the index.
If we look at the code that generated the error, we can see that Malta
is listed in the countries of Southern Europe. The error we get suggests that data for Malta don’t exist in the DataFrame. Let’s check that. We can get Copilot to help, by writing a prompt asking it to check if Malta is in the index of the DataFrame.
# check if 'Malta' is in gapminder_europe
print('Malta' in gapminder_europe.index)
False
To correct this, we can remove Malta from the Southern Europe
list. Hopefully you remember how to remove a particular item from a list, from a previous lesson. But there’s so much to remember! You would probably need to go back and look it up, or Google it. But as you might anticipate, we can prompt Copilot to do this, too!
As I wrote the prompt below, however, Copilot made many incorrect guesses as to what I wanted. Initially rather than southern_europe
it suggested gapminder_europe
— but that would be completely wrong, because the issue is that Malta
isn’t in gapminder_europe
! When I started typing s
rather than accepting gapminder_europe
, Copilot suggested south_europe
– but if you look at the code above, Copilot had named that list southern_europe
. So, even though Copilot can be sensitive to the context of the file you’re working in, it’s still not perfect, and may suggest erroneous code such as the wrong variable name. If you’re not paying attention, and understanding the code you’re generating with Copilot, it’s highly likely that you will make a lot of errors. But after manually editing the Copilot-generated code (changing south_europe
to southern_europe
), Copilot did produce a correct suggestion for how to remove Malta
from the southern_europe
list:
# remove 'Malta' from southern_europe
southern_europe.remove('Malta')
An alternative approach – and probably a better one in practice – would have been to go back to the cell that labelled all the regions, and manually remove Malta
from southern_europe
. That would ensure that our notebook file had clean, functional, and accurate code, rather than the ugly history of our attempts to generate that code. However, the point here is to illustrate the process of working with Copilot, so we are deliberately not doing that.
Now we want to re-run the code above. Let’s cut and paste it from the cell above into the cell below, and then run it. Make sure not to copy over the code that define the lists of countries in each region, because those are already defined, and we don’t want to add Malta
back into southern_europe
!
# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'
# label the countries in Northern Europe as 'Northern Europe'
gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'
# label the countries in Southern Europe as 'Southern Europe'
gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
# label the countries in Eastern Europe as 'Eastern Europe'
gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[12], line 13
10 gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
12 # label the countries in Eastern Europe as 'Eastern Europe'
---> 13 gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:907, in _LocationIndexer.__setitem__(self, key, value)
905 maybe_callable = com.apply_if_callable(key, self.obj)
906 key = self._check_deprecated_callable_usage(key, maybe_callable)
--> 907 indexer = self._get_setitem_indexer(key)
908 self._has_valid_setitem_indexer(key)
910 iloc = self if self.name == "iloc" else self.obj.iloc
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:774, in _LocationIndexer._get_setitem_indexer(self, key)
771 if isinstance(key, tuple):
772 with suppress(IndexingError):
773 # suppress "Too many indexers"
--> 774 return self._convert_tuple(key)
776 if isinstance(key, range):
777 # GH#45479 test_loc_setitem_range_key
778 key = list(key)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:989, in _LocationIndexer._convert_tuple(self, key)
985 @final
986 def _convert_tuple(self, key: tuple) -> tuple:
987 # Note: we assume _tupleize_axis_indexer has been called, if necessary.
988 self._validate_key_length(key)
--> 989 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in enumerate(key)]
990 return tuple(keyidx)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1522, in _LocIndexer._convert_to_indexer(self, key, axis)
1520 return key
1521 else:
-> 1522 return self._get_listlike_indexer(key, axis)[1]
1523 else:
1524 try:
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
1555 ax = self.obj._get_axis(axis)
1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
1560 return keyarr, indexer
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
6197 else:
6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
6202 keyarr = self.take(indexer)
6203 if isinstance(key, Index):
6204 # GH 42790 - Preserve name from an Index
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name)
6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")
KeyError: "['Belarus'] not in index"
Another error, but it looks like the same problem as before: this time, Belarus
is not in the DataFrame.
Why is this happening? Well, we assumed that when Copilot generated those lists of countries in each region, it would use the countries in the gapminder_europe
DataFrame. But in reality, we don’t know how it generated those lists. Maybe it was based on data used to train Copilot that had a larger list of countries in Europe, rather than just the countries in the DataFrame. And again, Copilot was not accurately sensitive to the context of our notebook file.
Anyway, we now know how to fix this error, based on what we did above for Malta. But what next? How many more countries did Copilot “hallucinate” were in our DataFrame? We can keep repeating this process until we get no errors, but that seems tedious, and could take a long time (especially if we were working with a larger data set). Let’s see if we can tweak our prompts from above to generate more accurate lists of countries in each region.
More Prompt Engineering#
My first attempt at a prompt was:
# create a list of the countries in Northern Europe, limited to the countries in the gapminder_europe data frame
However, this generated the code below, which is obviously wrong. First of all, it’s circular, in that it’s defining a variable northern_europe
based on selecting rows in the DataFrame that are in a variable with the same name (northern_europe
).
northern_europe = gapminder_europe.loc[northern_europe]
The larger problem with this code is that it’s using the northern_europe
list, but the issue we’re trying to fix is that that list is incorrect. So we need a better prompt. Again, we need to break the problem down into smaller steps. Let’s start with the first step, and see if we can get Copilot to generate the code to create a list of the countries in Northern Europe,as it did before. Then, in a second step, let’s ask it to remove the countries not in the DataFrame. I tried this:
# create a list of countries in northern europe. Then, remove the names of any
# countries in the list that are not in the gapminder_europe data frame
But Copilot only generated code for the first step. So, then I cut the second step from the first prompt, and made it a second step. As you see below, this generated code that ran with no errors.
# create a list of countries in northern europe.
northern_europe = ['Belgium', 'Denmark', 'Finland', 'Iceland', 'Ireland', 'Luxembourg', 'Netherlands', 'Norway', 'Sweden', 'United Kingdom']
# remove the names of any countries in the list northern_europe that are not in the gapminder_europe data frame
northern_europe = [country for country in northern_europe if country in gapminder_europe.index]
Note that the last line of code above is using list comprehension. This is useful and efficient, but it can be a bit challenging to read and understand if you’re not familiar with it.
The list comprehension above is equivalent to this code:
tmp = []
for country in northern_europe:
if country in gapminder_europe.index:
tmp.append(country)
northern_europe = tmp
In this code, we have to create a temporary variable tmp
to store the new results as we iterate through the items in the list northern_europe
. In contrast, list comprehension works all at once, rather than iteratively over each item in the list. Therefore it’s OK to modify the list while we’re iterating over it, because we’re not actually iterating over it.
Of course, with Copilot if the code works, we don’t necessarily need to understand it. But it’s still good practice to actually understand what your code is doing. If Copilot generates code that you don’t understand, you should take the time to learn something new!
Check Your Results#
Even if we understand what the code is doing, we should double-check whether the list got changed! We can check by viewing the list. You could ask Copilot to do this, but sometimes for simple things it’s faster just to type the code yourself:
print(northern_europe)
['Belgium', 'Denmark', 'Finland', 'Iceland', 'Ireland', 'Netherlands', 'Norway', 'Sweden', 'United Kingdom']
Cross-checking this list against the original one, we can see that Luxembourg
is no longer in the list. That suggests that Copilot did indeed remove the countries not in the DataFrame. Let’s try this for the other regions. For completeness – having all the code for one conceptual step in one cell – we’ll copy and paste the Northern Europe code from above, then generate additional prompts for the other regions. Copilot catches on to the repetitive nature of this process quickly, and should generate most of the prompts and code for you. (remember to hit Enter
twice after each line of code that you accept, to get the next prompt). In fact, after Copilot finished generating prompts for each region, it again generated the prompts to create a new region
column in our DataFrame, and populate it with the region labels.
# create a list of countries in northern europe.
northern_europe = ['Belgium', 'Denmark', 'Finland', 'Iceland', 'Ireland', 'Luxembourg', 'Netherlands', 'Norway', 'Sweden', 'United Kingdom']
# remove the names of any countries in the list northern_europe that are not in the gapminder_europe data frame
northern_europe = [country for country in northern_europe if country in gapminder_europe.index]
# create a list of countries in southern europe.
southern_europe = ['Albania', 'Bosnia and Herzegovina', 'Croatia', 'Greece', 'Italy', 'Montenegro', 'Portugal', 'Serbia', 'Slovenia', 'Spain']
# remove the names of any countries in the list southern_europe that are not in the gapminder_europe data frame
southern_europe = [country for country in southern_europe if country in gapminder_europe.index]
# create a list of countries in eastern europe.
eastern_europe = ['Belarus', 'Bulgaria', 'Czech Republic', 'Hungary', 'Poland', 'Romania', 'Slovak Republic']
# remove the names of any countries in the list eastern_europe that are not in the gapminder_europe data frame
eastern_europe = [country for country in eastern_europe if country in gapminder_europe.index]
# create a list of countries in western europe.
western_europe = ['Austria', 'France', 'Germany', 'Switzerland']
# remove the names of any countries in the list western_europe that are not in the gapminder_europe data frame
western_europe = [country for country in western_europe if country in gapminder_europe.index]
# create a new column in the gapminder_europe data frame called 'region'.
# Label each country as belonging to 'Northern Europe', 'Southern Europe',
# 'Eastern Europe', or 'Western Europe'
gapminder_europe['region'] = 'Western Europe'
# label the countries in Northern Europe as 'Northern Europe'
gapminder_europe.loc[northern_europe, 'region'] = 'Northern Europe'
# label the countries in Southern Europe as 'Southern Europe'
gapminder_europe.loc[southern_europe, 'region'] = 'Southern Europe'
# label the countries in Eastern Europe as 'Eastern Europe'
gapminder_europe.loc[eastern_europe, 'region'] = 'Eastern Europe'
This time the code ran without errors. We can spot-check our region labels by getting a random sample of rows from the DataFrame:
# generate a random sample of rows from the gapminder_europe data frame
gapminder_europe.sample(5)
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | |||||||||||||
Norway | 10095.421720 | 11653.973040 | 13450.401510 | 16361.876470 | 18965.055510 | 23311.349390 | 26298.635310 | 31540.974800 | 33965.661150 | 41283.164330 | 44683.975250 | 49357.190170 | Northern Europe |
Bosnia and Herzegovina | 973.533195 | 1353.989176 | 1709.683679 | 2172.352423 | 2860.169750 | 3528.481305 | 4126.613157 | 4314.114757 | 2546.781445 | 4766.355904 | 6018.975239 | 7446.298803 | Southern Europe |
Bulgaria | 2444.286648 | 3008.670727 | 4254.337839 | 5577.002800 | 6597.494398 | 7612.240438 | 8224.191647 | 8239.854824 | 6302.623438 | 5970.388760 | 7696.777725 | 10680.792820 | Eastern Europe |
France | 7029.809327 | 8662.834898 | 10560.485530 | 12999.917660 | 16107.191710 | 18292.635140 | 20293.897460 | 22066.442140 | 24703.796150 | 25889.784870 | 28926.032340 | 30470.016700 | Western Europe |
Germany | 7144.114393 | 10187.826650 | 12902.462910 | 14745.625610 | 18016.180270 | 20512.921230 | 22031.532740 | 24639.185660 | 26505.303170 | 27788.884160 | 30035.801980 | 32170.374420 | Western Europe |
Since new columns are added to the right side of a DataFrame, you’ll need to scroll to the right to see the new region
column, but it is indeed there, and the labels in the sample are correct. If you want more reassurance, you could re-run the cell to get a new random sample, or increase the number of samples from 5 to a larger number.
You could also modify your prompt to show only the region
column, and then view ta random sample of rows:
# show a random sample of rows from the gapminder_europe data frame, and only the region column
gapminder_europe['region'].sample(5)
country
Croatia Southern Europe
Iceland Northern Europe
France Western Europe
Portugal Southern Europe
Hungary Eastern Europe
Name: region, dtype: object
Note
If you look at the code that Copilot generated above, and compare it to what we used in the pandas lesson previous chapter, you’ll see that it’s quite similar in that both use the gapminder_europe.loc
method to label countries according to region. However, the process of generating it required some trial and error, and some manual intervention. This isn’t actually different from what you would probably be doing – especially as a novice coder – if you were writing the code yourself. But, perhaps ironically, part of the code that we had to prompt Copilot to generate, was to fix errors in the code that it generated.
At the same time, if you look back at the pandas lesson, the lists of countries in each region were defined manually — and in fact there was no explanation as to how those lists were created. Those lists were created manually by someone reviewing the list of country names in the DataFrame, and using their knowledge of geography (or maybe an internet search) to assign the countries to the labels. In contrast, Copilot generated those lists for us (based on the vast corpus of data it was trained on), which saved us a lot of time. The fact that we had to write additional code to clean up the lists is a small price to pay for that time savings. And indeed, cleaning data is a routine part of data science, and it’s good to be able to do it in code, rather than manually.
Critically Evaluating Copilot’s Code#
Although our previous pandas lesson and Copilot both used the gapminder_europe.loc
method to label countries, there is one important difference between what we showed in the previous lesson, and Copilot’s code. Copilot first labelled all countries as Western Europe
, and then changed the label for non-Western countries. In contrast, in the previous lesson we used the same gapminder_europe.loc
method to label Western countries as Copilot did here for the other regions. This illustrates how different code can be used to accomplish the same task.
However, just because two approaches do the same job (or appear to), there is some danger to the approach that Copilot took here. In labelling all the countries first as Western Europe
, the assumption is that the other labels we use will cover all of the countries that are not Western Europe
. However, what if there was a country that was in the DataFrame, but not in any of Copilot’s lists of regions? Those countries would be labeled as Western Europe
, which might be incorrect. In that case, the code Copilot generated would create bugs that would affect the outcome of any analyses applied to the data. This is a good example of why it’s important to read and understand the code you’re using, and not just blindly accept what Copilot generates.
At the very least, a good data scientist will assume the worst of their, or AI-generated, code, and test it to make sure it’s doing what it’s supposed to do. In this case, since the concern is that a country not in Western Europe may get that label, we can check by comparing the countries with that label in the DataFrame, with the list western_europe
.
Using Copilot to Check its Own Work#
Fortunately, Copilot can help us check it’s own work – it even correctly generated a lot of the detailed prompt you see below! The prompt below also shows that you can use complex, multi-step prompts with Copilot, and it will generate code for each step. The difference between the complex prompt here, and some that failed above, is that we clearly end each step with a period, and start the next step as a new sentence.
# list all of the countries in the gapminder_europe data frame labelled as "Western Europe".
# Then compare this list to the contents of the western_europe list. Print the names of any
# countries that are in the western_europe list but not in the gapminder_europe data frame.
print(set(western_europe) - set(gapminder_europe[gapminder_europe['region'] == 'Western Europe'].index))
set()
The output shows that there are no countries labelled as Western Europe
that are not in the western_europe
list. So, Copilot’s code is correct. But, it is important to always check – whether it’s your own code, or AI-generated.
Finally, note that the code that Copilot did generate there is pretty complex. At this point in your development as a coder, do you think you could have written that? And if so, how long do you think it would have taken? Copilot generated that code in a few seconds. This is the promise of AI-assisted coding. At the same time, as a learner and a critical user of AI-generated code, you should take the time to understand what the code is doing, and why. In this case, you can do internet searches for the components of that code that you may not recognize, such as the set()
function and the .index
method (``).
Summary#
GitHub Copilot is a great coding partner. On its own, it often generates errors — including both fatal Python errors, and more sneaky errors that may not be obvious — cases where the code runs without errors, but there are flaws in the logic of the code.
In spite of the risk of errors errors, Copilot can save you time by quickly generating code, and often generating things that would take you significant time to look up.
The best approach to using Copilot is to break down your coding tasks into small steps, and prompt Copilot to generate code for each step. Always be conscious of what you’re trying to do and be clear in the expected outputs of that code.
Always actually read the code that Copilot generates and be sure you understand it. Based on that understanding, confirm the logic of what it’s doing, and ensure it’s what you expect.
Always test the results of the code Copilot generates, to confirm that it’s doing what you expect.
This approach will help you both code more efficiently, and become a better coder yourself. Professional coders have eagerly adopted Copilot due to the efficiencies it creates — but there is a learning curve, and ultimately it is just another tool in your toolbox, not the only tool you need.