# Data Cleaning - Dealing with Outliers

## Contents

# Data Cleaning - Dealing with Outliers#

An important first step in data science is cleaning up your data. This includes identifying and removing data points that may be due to errors in data collection, or other factors that could lead us to make erroneous conclusions about the data. This is *not* the same as “cooking” your data — modifying it specifically to get the results you want. Data cleaning involves a set of methods that are specifically designed to identify and remove data points that are objectively anomalous — that is, you have objective reasons for removing them. Data cleaning steps should always be documented when you write up a study, and should be included in the code you share for analyses so that your procedures are transparent and reproducible.

## Outlier Identification and Removal#

**Outliers** are data points that are “significantly different” from the majority of data points in a data set. “Significantly different” is in quotes here, because this doesn’t not necessarily mean we use statistical testing to define outliers, and also that there several different definitions of “significant” that ca be used to identify outliers.

### Tukey’s method#

John Tukey, who was a driving force in the development of exploratory data analysis, defined outliers as points more than \(1.5 x IQR\). Recall that IQR is the *inter-quartile range*, which captures 50% of the data points in a data set. In a box plot (which Tukey invented), the IQR is the “box”, and 1.5 IQR are the extent of the “whiskers” in each direction from the median. We noted earlier that in box plots, outliers are represented as individual points in the plot, beyond the whiskers. Since Tukey invented box plots and this definition of outliers, it’s not surprising that they are identical. In general, outliers defined in this way will comprise approximately 1% of the data, if the data are approximately normally distributed.

*Image from Jhguch and shared under the Creative Commons Attribution-Share Alike 2.5 Generic license, via Wikimedia Commons*

### Standard scores#

An alternative method is to use standard scores. This is done by applying the ** z transform** to the data. This transform’s formula is:

$z = \frac{x - \mu}{\sigma}$

Where \(x\) is an individual data point, \(\mu\) is the mean of the data set, and \(\sigma\) is the standard deviation. In other words, the *z* transform takes a data set and transforms it so that its mean is zero (by subtracting the mean from every value), and its standard deviation as 1.

Put another way, after we apply a *z* transform to a data set, the numerical values in the data set are not in their original units (be that milliseconds, or dollars, or whatever) but in units of standard deviations. So in a *z* transformed data set, a data point with a value of 1 would have had an original value that was 1 standard deviation from the mean. This is where the “standard” in “standard scores” comes from.

Applying a *z* transform is also called *normalizing* the data set, because the *z* transform assumes that the data come from a normal distribution (also known as a Gaussian distribution, or the *bell curve*). The nice thing about standardizing a data set is that the values have fairly straightforward interpretations. In a normal distribution:

68.27% of all values fall within ±1 standard deviation of the mean (\(1 \sigma\)),

95.45% of all values fall within ±2 standard deviations of the mean (\(2 \sigma\)), and

99.73% of all values fall within ±3 standard deviations of the mean (\(3 \sigma\))

*Image from M. W. Toews, and shared under the Creative Commons Attribution-Share Alike 2.5 Generic license, via Wikimedia Commons*

This means that if we define outliers as those values *z* > 3 or *z* < -3, then we will tend to reject about 0.25% of a data set. Sometimes, *z* ±2.5 \(\sigma\) is used instead. This will tend to reject more data points (typically around 2%).

## Identifying outliers#

First, we’ll load in the same Flanker data set we used in the *Repeated Measures* lesson:

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('data/flanker_rt_data.csv')
```

### Visualizing the distribution#

Let’s generate a box plot of the entire data set, organized by condition. This is the same as the first box plot in the *Repeated Measures* lesson.

```
sns.catplot(kind='box',
data=df,
x='flankers', y='rt')
plt.show()
```

Let’s also look at the distribution of values with a kernel density estimate:

```
sns.displot(kind='kde',
data=df,
x='rt', hue='flankers')
plt.show()
```

### Idenfity outliers in a DataFrame#

To define values based on the IQR, we first need to calculate the IQR. pandas doesn’t have a method for this specifically, but we can use the pandas `.quantile()`

method with the argument 0.25 to reference the lower end of the IQR (the 0.25 quantile means the point below which 25% of data values lie), and 0.75 for the upper end of the IQR. Since IQR is defined as the range between the 25th and 75th quantiles, we can compute it be subtracting the difference between the two:

```
q1 = df['rt'].quantile(0.25)
q3 = df['rt'].quantile(0.75)
iqr = q3 - q1
```

The variable `iqr`

represents the size of the box in the box plot of the data. Recall that our definition of outliers is 1.5 x IQR — the length of the whiskers in the box plot.

We can now define the threshold for values we well call “outliers” (i.e., the length of the whiskers) based on `iqr`

:””

```
out_low = q1 - 1.5 * iqr
out_high = q3 + 1.5 * iqr
```

So `out_low`

(the end of the bottom whisker) is 1.5 x IQR below q1 (the lower end of the box), and likewise for `out_high`

.

Recall from the pandas lesson that we can select values in a DataFrame using a Boolean mask. We can do that here to identify the rows of the DataFrame where the RT values are lower than 1.5 x IQR:

```
df.loc[:, 'rt'] < out_low
```

```
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 False
Name: rt, Length: 4289, dtype: bool
```

The result will have values of `True`

for those data points that are “low” outliers, and `False`

for all other data points. We can count the number of these by passing the mask to `np.sum()`

(since `True`

= `1`

and `False`

= `0`

)

```
np.sum(df.loc[:, 'rt'] < out_low)
```

```
10
```

We can do the same to identify outliers at the high end of the range of values, using the `>`

operator this time:

```
np.sum(df.loc[:, 'rt'] > out_high)
```

```
110
```

Consistent with what we saw in the box plot, there are more outliers at the high (slow RT) end of the distribution than at the low (fast RT) end.

#### Marking data points as outliers#

We’ll worry about what to do with outliers later — after we’ve described standard scores. For now, let’s make a column in `df`

called `outliers_tukey`

that is a Boolean mask: values of `True`

in rows with outlier RT values, and `False`

for all other rows.

To do this, we want to combine the two masking operations above `< out_low`

and `> out_high`

. We can do that using the OR `|`

operator, as long as we include each of our masking operations in parentheses (this last bit is important, but easy to forget!):

```
(df.loc[:, 'rt'] < out_low) | (df.loc[:, 'rt'] > out_high)
```

```
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 True
Name: rt, Length: 4289, dtype: bool
```

It’s a bit hard to confirm that this works from the output above, but if we pass that whole expression to `np.sum()`

we can see that the number of all `True`

values is the same as the total of the `out_low`

and `out_high`

values reported above:

```
np.sum((df.loc[:, 'rt'] < out_low) | (df.loc[:, 'rt'] > out_high))
```

```
120
```

To actually use this expression to create a new column, we just have to assign the result to a new column name in `df`

:

```
df['outliers_tukey'] = (df.loc[:, 'rt'] < out_low) | (df.loc[:, 'rt'] > out_high)
df.head()
```

participant | block | trial | flankers | rt | outliers_tukey | |
---|---|---|---|---|---|---|

0 | s25 | 1 | 1 | congruent | 0.652072 | False |

1 | s25 | 1 | 2 | neutral | 0.644872 | False |

2 | s25 | 1 | 3 | congruent | 0.445182 | False |

3 | s25 | 1 | 4 | neutral | 0.525200 | False |

4 | s25 | 1 | 5 | congruent | 0.533328 | False |

And double-check the number, using the pandas `.sum()`

method:

```
df['outliers_tukey'].sum()
```

```
120
```

### Standard scores (*z* transform)#

We can convert scores to standard scores very easily, using the `zscore()`

function in `scipy.stats`

:

```
from scipy import stats
df['rt_z'] = stats.zscore(df['rt'])
df.head()
```

participant | block | trial | flankers | rt | outliers_tukey | rt_z | |
---|---|---|---|---|---|---|---|

0 | s25 | 1 | 1 | congruent | 0.652072 | False | 1.526800 |

1 | s25 | 1 | 2 | neutral | 0.644872 | False | 1.456483 |

2 | s25 | 1 | 3 | congruent | 0.445182 | False | -0.493736 |

3 | s25 | 1 | 4 | neutral | 0.525200 | False | 0.287737 |

4 | s25 | 1 | 5 | congruent | 0.533328 | False | 0.367119 |

Now we can identify outliers as those rows where `rt_z > 2.5`

or `rt_z < -2.5`

. We’ll define a variable `z_thresh`

because that way we can later change it to a different value (like 3) if we desire.

```
z_thresh = 2.5
(df['rt_z'] > z_thresh) | (df['rt_z'] < z_thresh)
```

```
0 True
1 True
2 True
3 True
4 True
...
4284 True
4285 True
4286 True
4287 True
4288 True
Name: rt_z, Length: 4289, dtype: bool
```

More simply, we can take the absolute value of *z* (which ignores the `-`

sign):

```
np.abs(df['rt_z']) > z_thresh
```

```
0 False
1 False
2 False
3 False
4 False
...
4284 False
4285 False
4286 False
4287 False
4288 True
Name: rt_z, Length: 4289, dtype: bool
```

The number of outliers identified this way is:

```
np.sum(np.abs(df['rt_z']) > z_thresh)
```

```
111
```

You’ll note that there are actually fewer outliers defined this way as using the 1.5 x IQR (Tukey) method. Why is this? As noted earlier, Tukey’s threshold and *z* >= 2.5 are only approximately the same, so we expect minor differences. However, we noted that the 1.5 x IQR range is a bit *wider* than *z* > 2.5, meaning it should capture more data points and *fewer* values should be identified as outliers by Tukey’s method. Yet yere, more are. The reason for this is that both Tukey’s and the standard score methods assume our data are normally distributed, when in fact they are skewed, with more values at the high (slow RT) end than at the low (fast RT) end. Since the data don’t fit the assumptions of the methods we used, we get unexpected results.

#### Write column of standard score outliers#

We’ll worry about skewness a little later, but for now we will add a column marking outliers as defined this way, similarly to what we did above for the Tukey method:

```
df['outliers_z'] = np.abs(df['rt_z']) > z_thresh
df.sample(6)
```

participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | |
---|---|---|---|---|---|---|---|---|

968 | s24 | 1 | 10 | congruent | 0.468974 | False | -0.261377 | False |

1284 | s2 | 1 | 13 | neutral | 0.533529 | False | 0.369081 | False |

981 | s24 | 1 | 23 | neutral | 0.820825 | True | 3.174884 | True |

3524 | s20 | 2 | 1 | congruent | 0.493003 | False | -0.026707 | False |

682 | s11 | 2 | 11 | neutral | 0.437073 | False | -0.572931 | False |

211 | s18 | 2 | 20 | congruent | 0.422508 | False | -0.715177 | False |

The above `.sample()`

is not likely to happen to sample outlier data points, but we can confirm their presence with:

```
df[df['outliers_z'] == True]
```

participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | |
---|---|---|---|---|---|---|---|---|

26 | s25 | 1 | 27 | incongruent | 0.757437 | True | 2.555820 | True |

37 | s25 | 2 | 6 | neutral | 0.917223 | True | 4.116328 | True |

117 | s25 | 4 | 22 | incongruent | 0.817285 | True | 3.140311 | True |

227 | s18 | 3 | 4 | neutral | 0.806528 | True | 3.035256 | True |

350 | s16 | 1 | 32 | incongruent | 0.762060 | True | 2.600971 | True |

... | ... | ... | ... | ... | ... | ... | ... | ... |

4208 | s15 | 3 | 16 | incongruent | 0.897641 | True | 3.925088 | True |

4231 | s15 | 4 | 7 | congruent | 0.881503 | True | 3.767480 | True |

4262 | s15 | 5 | 6 | incongruent | 0.853477 | True | 3.493770 | True |

4279 | s15 | 5 | 23 | neutral | 0.845636 | True | 3.417194 | True |

4288 | s15 | 5 | 32 | congruent | 0.789597 | True | 2.869903 | True |

111 rows × 8 columns

## Why are outliers a problem?#

Outliers are a problem because they have disproportionate *leverage* on statistical properties of the data, such as the mean and variance. We can see this by creating a small, simulated data set from a normal distribution, then replacing one value with either an extremely different or just slightly different data point. Let’s do this, and compare the means and standard deviations in each case. We use NumPy’s `np.random.default_rng().normal()`

function to do this:

```
# Create array with mean=0, standard deviation=1, and 10 data values
data = np.random.default_rng().normal(0, 1, 10)
print("Data:", str(data))
print("Mean of data with no outliers:", str(round(np.mean(data), 5)))
print("Standard deviation of data with no outliers:", str(round(np.std(data), 5)))
```

```
Data: [ 0.25954254 0.36116305 -1.02546356 -0.56279951 0.09641304 0.35646715
-0.60834364 1.63139323 0.38277995 0.4364362 ]
Mean of data with no outliers: 0.13276
Standard deviation of data with no outliers: 0.69963
```

```
data_not_extreme = np.copy(data)
# replace last value in array with outlier
data_not_extreme[-1] = 3
print("Data:", str(data_not_extreme))
print("Mean of data with minor outlier:", str(round(np.mean(data_not_extreme), 5)))
print("Standard deviation of data with minor outlier:", str(round(np.std(data_not_extreme), 5)))
```

```
Data: [ 0.25954254 0.36116305 -1.02546356 -0.56279951 0.09641304 0.35646715
-0.60834364 1.63139323 0.38277995 3. ]
Mean of data with minor outlier: 0.38912
Standard deviation of data with minor outlier: 1.11204
```

```
data_extreme = np.copy(data)
# replace last value in array with outlier
data_extreme[-1] = 10
print("Data:", str(data_extreme))
print("Mean of data with extreme outlier:", str(round(np.mean(data_extreme), 5)))
print("Standard deviation of data with extreme outlier:", str(round(np.std(data_extreme), 5)))
```

```
Data: [ 0.25954254 0.36116305 -1.02546356 -0.56279951 0.09641304 0.35646715
-0.60834364 1.63139323 0.38277995 10. ]
Mean of data with extreme outlier: 1.08912
Standard deviation of data with extreme outlier: 3.0499
```

From the above we can see that a larger outlier has a larger impact on both the mean and variance of the data. Also remember, when looking at the information above, that the true population mean of the data (as specified in the arguments to the random number generator) is 0 and the true standard deviation is 1. Using this normal distribution makes it easy to see that outliers bias our statistics away from the true population mean and variance.

The impact of outliers is often less extreme — in the above examples, our data set comprised only 10 values, and so making one an outlier meant that 10% of data points were outliers — whereas by the definitions above, we would normally expect only 1–2% of data points to be outliers.

If we re-run the simulations with a data set of 100 values and one outlier, you can see that the impact of even an extreme outlier is much reduced:

```
# Create array with mean=0, standard deviation=1, and 10 data values
data = np.random.default_rng().normal(0, 1, 100)
print("Data:", str(data))
print("Mean of data with no outliers:", str(round(np.mean(data), 5)))
print("Standard deviation of data with no outliers:", str(round(np.std(data), 5)))
```

```
Data: [ 1.79175605 -0.13788061 0.18860817 -0.64303306 0.77041065 0.49992583
0.20181919 -1.03533068 1.51880189 0.62223574 -0.02498828 1.86274493
0.83092308 -0.60835739 0.51663178 1.26417624 0.22164047 -1.07434415
-1.59784203 0.23141663 -1.23426212 0.9822327 -0.00617076 -0.34620639
0.83003377 -0.6381199 -0.89787135 1.01044357 1.07958402 -1.18069111
-0.13491858 -0.97316477 0.84339441 1.09909136 1.79749871 0.7877921
-0.47598965 1.96998548 -0.54075773 -0.78278793 0.06031915 1.26001383
-0.80217383 0.61627246 0.90049405 -1.91922491 -1.00304275 -0.71810077
0.13691494 -0.21624961 0.03415257 -0.19710905 1.259412 -0.56357804
0.82372467 1.99971872 -0.61135225 -0.6566064 -0.86792126 0.04889141
-0.4717682 -0.13676944 0.27381307 0.25144976 -0.06409195 0.07517452
-0.10256327 -1.41001619 0.15468849 -2.77737203 -0.37899992 -1.40682385
-0.99621654 0.50603459 0.50086719 -0.75261271 0.37754644 -0.61457792
-0.60959798 -0.44516364 0.77810344 0.87572161 0.64272727 0.29696818
-1.40888214 -0.73510556 -0.53209524 -1.10223436 2.15786526 1.16346519
0.32525076 -2.05483367 -0.90650126 -0.57652673 -0.2867165 -0.26157165
0.45398846 -0.7592461 -1.26273333 1.15147649]
Mean of data with no outliers: -0.01895
Standard deviation of data with no outliers: 0.95845
```

```
data_extreme = np.copy(data)
# replace last value in array with outlier
data_extreme[-1] = 10
print("Data:", str(data_extreme))
print("Mean of data with extreme outlier:", str(round(np.mean(data_extreme), 5)))
print("Standard deviation of data with extreme outlier:", str(round(np.std(data_extreme), 5)))
```

```
Data: [ 1.79175605e+00 -1.37880611e-01 1.88608167e-01 -6.43033064e-01
7.70410653e-01 4.99925834e-01 2.01819192e-01 -1.03533068e+00
1.51880189e+00 6.22235740e-01 -2.49882759e-02 1.86274493e+00
8.30923076e-01 -6.08357395e-01 5.16631783e-01 1.26417624e+00
2.21640467e-01 -1.07434415e+00 -1.59784203e+00 2.31416627e-01
-1.23426212e+00 9.82232701e-01 -6.17076195e-03 -3.46206393e-01
8.30033772e-01 -6.38119903e-01 -8.97871347e-01 1.01044357e+00
1.07958402e+00 -1.18069111e+00 -1.34918580e-01 -9.73164773e-01
8.43394415e-01 1.09909136e+00 1.79749871e+00 7.87792102e-01
-4.75989651e-01 1.96998548e+00 -5.40757726e-01 -7.82787927e-01
6.03191504e-02 1.26001383e+00 -8.02173834e-01 6.16272458e-01
9.00494046e-01 -1.91922491e+00 -1.00304275e+00 -7.18100769e-01
1.36914943e-01 -2.16249607e-01 3.41525687e-02 -1.97109046e-01
1.25941200e+00 -5.63578045e-01 8.23724670e-01 1.99971872e+00
-6.11352247e-01 -6.56606400e-01 -8.67921259e-01 4.88914135e-02
-4.71768203e-01 -1.36769443e-01 2.73813074e-01 2.51449757e-01
-6.40919476e-02 7.51745201e-02 -1.02563267e-01 -1.41001619e+00
1.54688494e-01 -2.77737203e+00 -3.78999916e-01 -1.40682385e+00
-9.96216536e-01 5.06034587e-01 5.00867187e-01 -7.52612713e-01
3.77546438e-01 -6.14577915e-01 -6.09597982e-01 -4.45163645e-01
7.78103441e-01 8.75721608e-01 6.42727266e-01 2.96968179e-01
-1.40888214e+00 -7.35105559e-01 -5.32095235e-01 -1.10223436e+00
2.15786526e+00 1.16346519e+00 3.25250760e-01 -2.05483367e+00
-9.06501260e-01 -5.76526735e-01 -2.86716503e-01 -2.61571650e-01
4.53988462e-01 -7.59246105e-01 -1.26273333e+00 1.00000000e+01]
Mean of data with extreme outlier: 0.06954
Standard deviation of data with extreme outlier: 1.37873
```

The impact of an individual outlier is thus dependent on the size of the data set, and indeed in many data sets comprising a large number of data points, the impact of one outlier (or even 1% of the data values being outliers) can be quite minimal.

However, the impact of an outlier is also proportionate to its size, relative to the other values in the data set. Here we use 100 as the anomalous value, rather than 10, again in a data set with 100 values:

```
data_extreme = np.copy(data)
# replace last value in array with outlier
data_extreme[-1] = 100
print("Data:", str(data_extreme))
print("Mean of data with extreme outlier:", str(round(np.mean(data_extreme), 5)))
print("Standard deviation of data with extreme outlier:", str(round(np.std(data_extreme), 5)))
```

```
Data: [ 1.79175605e+00 -1.37880611e-01 1.88608167e-01 -6.43033064e-01
7.70410653e-01 4.99925834e-01 2.01819192e-01 -1.03533068e+00
1.51880189e+00 6.22235740e-01 -2.49882759e-02 1.86274493e+00
8.30923076e-01 -6.08357395e-01 5.16631783e-01 1.26417624e+00
2.21640467e-01 -1.07434415e+00 -1.59784203e+00 2.31416627e-01
-1.23426212e+00 9.82232701e-01 -6.17076195e-03 -3.46206393e-01
8.30033772e-01 -6.38119903e-01 -8.97871347e-01 1.01044357e+00
1.07958402e+00 -1.18069111e+00 -1.34918580e-01 -9.73164773e-01
8.43394415e-01 1.09909136e+00 1.79749871e+00 7.87792102e-01
-4.75989651e-01 1.96998548e+00 -5.40757726e-01 -7.82787927e-01
6.03191504e-02 1.26001383e+00 -8.02173834e-01 6.16272458e-01
9.00494046e-01 -1.91922491e+00 -1.00304275e+00 -7.18100769e-01
1.36914943e-01 -2.16249607e-01 3.41525687e-02 -1.97109046e-01
1.25941200e+00 -5.63578045e-01 8.23724670e-01 1.99971872e+00
-6.11352247e-01 -6.56606400e-01 -8.67921259e-01 4.88914135e-02
-4.71768203e-01 -1.36769443e-01 2.73813074e-01 2.51449757e-01
-6.40919476e-02 7.51745201e-02 -1.02563267e-01 -1.41001619e+00
1.54688494e-01 -2.77737203e+00 -3.78999916e-01 -1.40682385e+00
-9.96216536e-01 5.06034587e-01 5.00867187e-01 -7.52612713e-01
3.77546438e-01 -6.14577915e-01 -6.09597982e-01 -4.45163645e-01
7.78103441e-01 8.75721608e-01 6.42727266e-01 2.96968179e-01
-1.40888214e+00 -7.35105559e-01 -5.32095235e-01 -1.10223436e+00
2.15786526e+00 1.16346519e+00 3.25250760e-01 -2.05483367e+00
-9.06501260e-01 -5.76526735e-01 -2.86716503e-01 -2.61571650e-01
4.53988462e-01 -7.59246105e-01 -1.26273333e+00 1.00000000e+02]
Mean of data with extreme outlier: 0.96954
Standard deviation of data with extreme outlier: 9.99829
```

Now the impact of this 1% of the data is much more extreme. In some data sets, it is unlikely or even impossible to have very extreme outliers. For example, in the Flanker study we’re working with, the lower bound of RT is 0, and the upper bound is 1000 ms, because if a participant didn’t make a response iwthin 1 s of the stimulus, the trial automatically ended without recording a response. However, in some data sets more extreme values are possible. For example, a technical problem with a neuroimaging system, such as EEG or single-unit recording electrodes, could result in very extreme values relative to normal physiologically-plausible values.

## Which method to use?#

If you look at the image idescribing the Tukey method above, you can see that the range of the distribution covered by \(1.5 x IQR\) is just slightly larger than 2.5 \(\sigma\). So, the Tukey method is roughly equivalent to using a *z* threshold of ±2.5. The \(1.5 x IQR\) range is slightly wider than *z* = 2.5, and so may on average identify slightly fewer data points as outliers, but the result is probably negligible in practice. However, many researchers use *z* > 3 as their threshold for outliers, which will tend to identify fewer outliers than the Tukey method.

Since box plots are a great — and well-known — way of visualizing data distributions, there is a logic to using the \(1.5 x IQR\) method to identify outliers since the results will be consistent. However, the approximate equivalence between the Tukey method and *z* = 2.5 is *only true if the data are normally distributed*. For skewed data, like RT data, this method may exclude too many outliers. In particular, it will tend to exclude more long RT values than short RT values, even though many of the long RT values are likely valid (especially in a case like the present experiment, where trials were cut off at 1 s, preventing really long RTs).

### An empirical study addressing outlier removal with RT data#

Berger and Kiefer (2021) performed a study in which they simulated many samples of reaction times, and introduced either more- or less-extreme outliers into the data. Essentially this was a more sophisticated version of the simulations shown above, and using the natural properties of RT data, and simulating an experiment with different conditions. Berger and Kiefer compared Tukey’s method, the *z* transform method using thresholds of both 2 and 3, and several other methods, as well as simply leaving outliers in the data. Since this was simulated data, the authors knew the true values of the mean and standard deviation in each data set, which allowed them to compare each method to the “truth” — something we never have the luxury of knowing in experimental psychology and neuroscience.

Berger and Kiefer found that the *z* transform method was the best choice. With RT data, as noted above, Tukey’s method excluded too many data points on the long-RT side of the distribution. This had two negative impacts: firstly, it made the data appear more normally distributed than it really was, and secondly, it biased the statistical analyses towards false positives (Type 1 errors) — finding statistically significant differences between conditions when they were not present in the “true” data. Berger and Kiefer also observed that *not* rejecting outliers was the worst choice. The outliers led to a high rate of false negative results (Type 2 errors), meaning that true differences between conditions were not found to be statistically significant, due to the presence of outliers.

The *z* transform was not perfect (and we cannot expect any method to be), but it was more balanced in not rejecting disproportionately more long RTs than short RTs, rejecting fewer data points overall, and biasing the statistical analyses the least. Since in general, with experimental data we want to only exclude truly anomalous data points, I suggest using a *z* threshold of 3 for outlier removal. Using a threshold of 2 or 2.5 will naturally exclude more data points, increasing the likelihood that valid data is excluded.

It’s also important to note that trying different outlier removal methods on a real data set, or even different *z* thresholds to define outliers, introduces a component of *researcher bias* (or *researcher degrees of freedom*) into the analysis. Once you start doing this, it because easy to choose a method based on what “looks right” — which is, ultimately, what the researcher *thinks* the data should look like, not the true nature of the data. For this reason, parameters such as outlier definition method and threshold should be chosen prior to performing the analyses (ideally, preregistering the plan even before running the experiment).

## What to do with outliers#

We have three basic options for outliers:

Ignore them — just include them in the data and all analyses

Remove them completely from the DataFrame

Replace them with non-outlier values

### 1. Ignore them#

Some scientists fundamentally do not believe in rejecting outliers. In working with data from human or animal participants, one can argue that any behavioural response is possibly valid, and outliers may represent interesting data points for any number of reasons (e.g., an exceptionally slow RT may mean the person spent a lot of time thinking about the stimulus and trying to decide on a response).

Another reason to consider not removing outliers (or using some other definition of outliers) is if the data are not actually normally distributed. Indeed, the RT data we have been working with are not normally distributed, they are skewed - an issue which we will address below. Skew is a relatively minor violation of normality. More generally, although the normal distribution is very common in nature, not all data is normally distributed, and so it is worth examining the distribution of your data (one reason we spend so much time looking at box plots, histograms, and other visualizations of distributions).

On the other hand, outliers can arise for reasons that really aren’t of experimental interest (McCormack & Wright, 1964). For instance, in the present data set we have a few very short RTs from true measurement error. In the box plot of the data, we can see 2–3 data points in each flanker condition below 100–200 ms. It is nearly impossible for a typical human to respond to a stimulus in less than this amount of time, particularly when some decision (like which way an arrow is pointing) is required first (Luce, 1991; Whelan, 2008). So rather than being scientifically interesting, these outliers probably reflect trials when the participant accidentally pressed a response button. Likewise, exceptionally long RTs could reflect trials on which the participant thought they pressed the button then realized they didn’t, trials where the participant got distracted, or sneezed and didn’t see the stimulus, etc.. As noted above, Berger and Kiefer (2021) found that not rejecting outliers decreased the likelihood of finding statistically significant differences, when those differences were actually present in the data.

My general advice is not to ignore outliers, unless you have strong, principled reasons as to why outliers might be interesting (or not actually outliers, if the distribution is not normal) and how you would interpret them.

### 2. Remove them from the data set#

Removing data points — known as **trimming** or **truncation** — is acceptable in some cases, but problematic in others. For repeated measures data, it’s often fine to remove outliers at the individual participant level, since we have many other samples of the measurement (i.e., the repeated measures, such as trials). On the other hand, if we only have one measurement for a particular “unit” of the experimental design (e.g., one measurement from each participant, in each condition), then we might not want to completely lose that data point. This is especially true for data that will be analyzed statistically with a *t* test or ANOVA, because these methods cannot handle missing values.

### 3. Replace them with non-outlier values#

Another approach to dealing with outliers is to replace them with a less-outlying value. This could be the mean value, or the most extreme value that is not considered an outlier. Replacing an outlier with the mean value basically “neutralizes” that data point, since it is equivalent to the mean it doesn’t alter the mean of the data points. However, it does artificially reduce the variance in the data a bit, because variance is, by definition, the amount by which any data point differs from the mean. But if you are only replacing 1–2% of your data values, the impact on variance should be fairly small. Also note that, properly, the mean that is used as the replacement value should be computed only from all of the *non-outlier* data values. This is because a major reason to replace or remove outliers in the first place is that they have disproportionate influence on the mean.

Replacing the outlier with the most-extreme non-outlier value will preserve variance contributed by that data point, but reduce the variance relative to retaining the outlier value. Likewise, the impact on the mean will not be neutral, but it will be rather small. Note that an assumption of replacing the outlier value with another, relatively extreme (relative to the mean), value is that the outlier was actually a valid measurement, perhaps with some noise in it. Since we don’t know for sure what the true value of the data was, in principle it seems better to replace outliers with the mean.

## Replacing outliers with mean values#

Based on the information above, we will approach outliers in the Flanker data set by applying a *z* transform, defining outliers as those values with \(\lvert z \lvert \gt 3\), and replace these values with the mean (where the mean is calculated excluding the outlier values).

First, as we did above, we’ll define `z_thresh`

as 3 and fill the `outliers_z`

column with `True`

where \(\lvert z \lvert \gt 3\) and `False`

for all other rows.

```
z_thresh = 3
df['outliers_z'] = np.abs(df['rt_z']) > z_thresh
```

Next we’ll create a column containing the mean RT. To exclude the outlier values in computing this, we select only the rows where `outliers_z`

is False:

```
df['rt_mean'] = df[df['outliers_z'] == False]['rt'].mean()
df.head()
```

participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | rt_mean | |
---|---|---|---|---|---|---|---|---|---|

0 | s25 | 1 | 1 | congruent | 0.652072 | False | 1.526800 | False | 0.49191 |

1 | s25 | 1 | 2 | neutral | 0.644872 | False | 1.456483 | False | 0.49191 |

2 | s25 | 1 | 3 | congruent | 0.445182 | False | -0.493736 | False | 0.49191 |

3 | s25 | 1 | 4 | neutral | 0.525200 | False | 0.287737 | False | 0.49191 |

4 | s25 | 1 | 5 | congruent | 0.533328 | False | 0.367119 | False | 0.49191 |

Finally, we fill a new column, `rt_clean`

, with either the original RT value (for non-outliers), or the mean RT (for outliers). The way that we create a new pandas column, contingent on the values of another column, is using the `np.where()`

function. This is essentially a way of running an `if-else`

conditional statement on each row in a pandas column. `np.where()`

takes three arguments, which are (in order), the conditional statement, the value to fill in the new column if the conditional is true, and finally the value to fill in if the conditional is false.

So the code below tests, for each row, whether `outliers_z`

is `True`

. If it is (i.e., an outlier), the mean RT is filled in; otherwise the original RT value for that trial is copied to the `clean_rt`

column. The three arguments to `np.where()`

are each shown on their own line, to make it a little more readable:

```
df['rt_clean'] = np.where(df['outliers_z'] == True,
df['rt_mean'],
df['rt'])
df.head()
```

participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|---|---|

0 | s25 | 1 | 1 | congruent | 0.652072 | False | 1.526800 | False | 0.49191 | 0.652072 |

1 | s25 | 1 | 2 | neutral | 0.644872 | False | 1.456483 | False | 0.49191 | 0.644872 |

2 | s25 | 1 | 3 | congruent | 0.445182 | False | -0.493736 | False | 0.49191 | 0.445182 |

3 | s25 | 1 | 4 | neutral | 0.525200 | False | 0.287737 | False | 0.49191 | 0.525200 |

4 | s25 | 1 | 5 | congruent | 0.533328 | False | 0.367119 | False | 0.49191 | 0.533328 |

Although the first few rows in this DataFrame were not identified as outliers, we can confirm that this procedure worked by filtering the DataFrame to only show us rows where `outliers_z`

is `True`

:

```
df[df['outliers_z'] == True]
```

participant | block | trial | flankers | rt | outliers_tukey | rt_z | outliers_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|---|---|

37 | s25 | 2 | 6 | neutral | 0.917223 | True | 4.116328 | True | 0.49191 | 0.49191 |

117 | s25 | 4 | 22 | incongruent | 0.817285 | True | 3.140311 | True | 0.49191 | 0.49191 |

227 | s18 | 3 | 4 | neutral | 0.806528 | True | 3.035256 | True | 0.49191 | 0.49191 |

518 | s22 | 2 | 7 | incongruent | 0.817863 | True | 3.145955 | True | 0.49191 | 0.49191 |

841 | s8 | 2 | 11 | congruent | 0.056265 | True | -4.291995 | True | 0.49191 | 0.49191 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

4195 | s15 | 3 | 3 | congruent | 0.853705 | True | 3.495996 | True | 0.49191 | 0.49191 |

4208 | s15 | 3 | 16 | incongruent | 0.897641 | True | 3.925088 | True | 0.49191 | 0.49191 |

4231 | s15 | 4 | 7 | congruent | 0.881503 | True | 3.767480 | True | 0.49191 | 0.49191 |

4262 | s15 | 5 | 6 | incongruent | 0.853477 | True | 3.493770 | True | 0.49191 | 0.49191 |

4279 | s15 | 5 | 23 | neutral | 0.845636 | True | 3.417194 | True | 0.49191 | 0.49191 |

66 rows × 10 columns

### Reporting outlier rejections#

In any report describing your data, you should state:

the method you used to identify outliers

how you dealt with outliers

what proportion of the data were identified as outliers, and thus treated

To get this last value, we can simply compute the number of `True`

values in `outliers_z`

, divided by the total number of trials (rows):

```
outliers = df['outliers_z'].sum() / df['outliers_z'].count()
# Convert proportion to percent, and round to 2 decimal places
print(str(round(outliers * 100, 2)), "% of the data identified as outliers and replaced with mean value.")
```

```
1.54 % of the data identified as outliers and replaced with mean value.
```

## Working with nested data#

All of the above was correct, for data without nesting or repeated measures. The reason for this caveat is that with nested data, what constitutes an “outlier” for one source of data (e.g., one participant) may be a very reasonable value for another person. This is perhaps most intuitive if we think about an experiment comparing older and younger adults. With age, RTs on average get longer. So slow RT that is an outlier for an 18 year old university student may be about the average RT for someone who is 80 years old.

To properly address the nested structure of data in identifying and correcting outliers, we can use the same steps as above, but using the split-apply-combine approach provided by pandas `.groupby()`

method. In other words, we split the data by participant, apply the outlier detection and correction steps, and then combine the data into a new column in the DataFrame.

The code below is probably the most complex that we’ve worked with so far. Take your time going through this, and try to understand it as fully as you can. It may seem like a convoluted series of steps, but they are a logical sequence and each one is optimized for pandas. This is in fact pandas’ suggested approach for outlier correction

Since we’ve added lots of columns above, let’s start fresh by re-loading the DataFrame from the file:

```
df = pd.read_csv('data/flanker_rt_data.csv')
```

### Compute *z* scores by participant#

Now let’s split by participant and apply the *z* transform to the split data. Pandas has a useful method, `.transform()`

, that allows us to run a non-pandas function on a selection from a DataFrame. So we group by participant, select only the `rt`

column to operate on, and then `.transform(stats.zscore)`

and assign this to a new column in the DataFrame, `rt_z`

. Note that when we pass a function to `.apply()`

, we *do not* put parentheses after the name of the function we are using. That is, if we were running `stats.zscore`

on its own (see above), we would use `stats.zscore(df['rt'])`

. But since the we have already selected the data that we are passing with `.transform()`

, we don’t use the parentheses with the function call.

```
df['rt_z'] = df.groupby('participant')['rt'].transform(stats.zscore)
df.sample(12)
```

participant | block | trial | flankers | rt | rt_z | |
---|---|---|---|---|---|---|

2075 | s9 | 1 | 16 | incongruent | 0.393590 | -0.570383 |

1308 | s2 | 2 | 5 | neutral | 0.434128 | -0.549747 |

1735 | s13 | 5 | 19 | neutral | 0.480952 | 0.398947 |

4262 | s15 | 5 | 6 | incongruent | 0.853477 | 2.777256 |

2666 | s4 | 5 | 1 | neutral | 0.460397 | 0.763855 |

37 | s25 | 2 | 6 | neutral | 0.917223 | 4.631054 |

797 | s11 | 5 | 31 | neutral | 0.425538 | -0.345903 |

3759 | s17 | 4 | 13 | neutral | 0.394254 | -1.094923 |

1155 | s21 | 2 | 11 | neutral | 0.566813 | 1.201670 |

2650 | s4 | 4 | 17 | incongruent | 0.377910 | -0.762814 |

922 | s8 | 4 | 28 | congruent | 0.477410 | 0.131951 |

3269 | s27 | 3 | 31 | neutral | 0.486155 | 0.283878 |

### Compute mean RT from valid trials for each participant#

Likewise, we need to compute the mean RT within participants, so again we could split by participant and then compute the within-participant mean, this time passing `np.mean`

to `.transform()`

:

```
df['rt_mean'] = df.groupby('participant')['rt'].transform(np.mean)
```

But actually, doing the above computes the mean across all trials for each participant — including the outlier trials. We can compute the mean across only the non-outlier trials by adding a filter on `df`

prior to the `groupby`

operation:

```
df['rt_mean'] = df[np.abs(df['rt_z']) <= z_thresh].groupby('participant')['rt'].transform(np.mean)
df.head()
```

participant | block | trial | flankers | rt | rt_z | rt_mean | |
---|---|---|---|---|---|---|---|

0 | s25 | 1 | 1 | congruent | 0.652072 | 1.822298 | 0.475143 |

1 | s25 | 1 | 2 | neutral | 0.644872 | 1.746028 | 0.475143 |

2 | s25 | 1 | 3 | congruent | 0.445182 | -0.369295 | 0.475143 |

3 | s25 | 1 | 4 | neutral | 0.525200 | 0.478337 | 0.475143 |

4 | s25 | 1 | 5 | congruent | 0.533328 | 0.564439 | 0.475143 |

### Fill in mean values for outlier rows#

There’s a catch with this however — since we grouped *filtered* `np.abs(df['rt_z']) <= z_thresh`

data, this only fills in the *non-outlier* rows with mean RT, because those are the only rows that get passed to the `.transform()`

method. The `mean_RT`

rows where *z* exceeds outlier threshold are filled with `NaN`

, since pandas doesn’t know what else to do with them. We can see this if we ask for a sample of just the rows that exceed the outlier threshold:

```
# view sample of rows to confirm RT varies by participant
df[df['rt_z'] > z_thresh].sample(10)
```

participant | block | trial | flankers | rt | rt_z | rt_mean | |
---|---|---|---|---|---|---|---|

1524 | s1 | 3 | 32 | incongruent | 0.777096 | 4.349914 | NaN |

350 | s16 | 1 | 32 | incongruent | 0.762060 | 3.576242 | NaN |

1151 | s21 | 2 | 7 | incongruent | 0.750666 | 3.288323 | NaN |

3058 | s23 | 2 | 10 | congruent | 0.953989 | 3.714217 | NaN |

3118 | s23 | 4 | 8 | neutral | 0.919488 | 3.395270 | NaN |

227 | s18 | 3 | 4 | neutral | 0.806528 | 3.368613 | NaN |

2016 | s12 | 4 | 20 | neutral | 0.814067 | 3.895140 | NaN |

2695 | s4 | 5 | 30 | incongruent | 0.657840 | 4.418124 | NaN |

1078 | s24 | 4 | 28 | neutral | 0.888508 | 3.114528 | NaN |

1878 | s7 | 5 | 8 | incongruent | 0.965726 | 3.095147 | NaN |

To fix this, we need to group the data by participant again, except this time not excluding outlier rows. Then, we need to replace all NaNs in the `rt_mean`

column with the mean RT for that participant.

To do this, we run another `.groupby()`

+ `.transform()`

operation, but this time applying a more complex transformation using a **lambda function**. Lambda functions are basically a way of writing a small Python function inside a command. They are used when what you need to do is a bit more complex or specific than can be done with an existing method or function.

In this case, what we want to do is group `df`

by participant ID, then fill in the NaN values with the mean RT from all non-NaN values for that participant. We do this by applying two operations in sequence: first we use `np.nanmean()`

to compute the mean over all non-NaN RT values for that participant, and then we pass that value to `.fillna()`

to replace any NaN values for that participant with the mean RT. This works because the `x`

in the lambda function is a variable that refers to the entire grouped data set that we pass to it:

```
df['rt_mean'] = df.groupby('participant')['rt_mean'].transform(lambda x: x.fillna(np.nanmean(x)))
# Check by again printing a smaple of outlier rows
df[df['rt_z'] > z_thresh].sample(10)
```

participant | block | trial | flankers | rt | rt_z | rt_mean | |
---|---|---|---|---|---|---|---|

2454 | s14 | 3 | 13 | incongruent | 0.797524 | 5.013169 | 0.446461 |

1299 | s2 | 1 | 28 | neutral | 0.837920 | 4.414282 | 0.474832 |

766 | s11 | 4 | 31 | neutral | 0.749664 | 4.908618 | 0.443325 |

4231 | s15 | 4 | 7 | congruent | 0.881503 | 3.046180 | 0.557696 |

3612 | s20 | 4 | 25 | neutral | 0.754067 | 3.132075 | 0.489359 |

3319 | s27 | 5 | 17 | incongruent | 0.773004 | 3.873254 | 0.457454 |

2579 | s4 | 2 | 10 | incongruent | 0.589781 | 3.158490 | 0.416534 |

3118 | s23 | 4 | 8 | neutral | 0.919488 | 3.395270 | 0.542556 |

4077 | s5 | 4 | 13 | incongruent | 0.742321 | 3.207646 | 0.499061 |

4208 | s15 | 3 | 16 | incongruent | 0.897641 | 3.201033 | 0.557696 |

If you don’t fully understand the lambda function above, that’s OK. It is fairly advanced relative to most of what we’ve covered so far. However, it’s introduced here because it is the proper and efficient way to replace outlier values. It’s also worth knowing about lambda functions because you will start seeing them more in the future!

### Create final output column#

Next, we want to create a column called `rt_clean`

that contains the original RT for non-outlier trials, and the mean RT for that participant for outlier trials. We can use `np.where()`

to find outliers and replace them with each individual’s mean RT. Below, we do this in one line of code, which is more compact and efficient than we did above. Earlier, we first created a column, `outliers_z`

and then used that to determine which rows would be filled with the original RT versus the mean RT. However, creating a column in our DataFrame that we will never use again is not efficient use of resources. Here we simply use the threshold test as our conditional.

```
df['rt_clean'] = np.where(np.abs(df['rt_z']) > z_thresh,
df['rt_mean'],
df['rt'])
df.sample(10)
```

participant | block | trial | flankers | rt | rt_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|

285 | s18 | 4 | 30 | congruent | 0.521391 | 0.070464 | 0.513467 | 0.521391 |

2158 | s9 | 4 | 4 | neutral | 0.391304 | -0.603053 | 0.430409 | 0.391304 |

496 | s22 | 1 | 17 | incongruent | 0.441949 | 0.003149 | 0.431980 | 0.441949 |

495 | s22 | 1 | 16 | incongruent | 0.722015 | 3.428305 | 0.431980 | 0.431980 |

4016 | s5 | 2 | 16 | incongruent | 0.526269 | 0.340851 | 0.499061 | 0.526269 |

3784 | s17 | 5 | 6 | neutral | 0.501446 | 0.347897 | 0.473696 | 0.501446 |

3004 | s6 | 5 | 20 | incongruent | 0.521189 | 0.376526 | 0.466651 | 0.521189 |

2031 | s12 | 5 | 4 | neutral | 0.446116 | -0.567060 | 0.490843 | 0.446116 |

53 | s25 | 2 | 22 | neutral | 0.425260 | -0.580330 | 0.475143 | 0.425260 |

3513 | s20 | 1 | 22 | incongruent | 0.618098 | 1.487136 | 0.489359 | 0.618098 |

Since a random sample of 10 rows may not show any where *z* actually exceeded our outlier threshold, we can use filtering to view only those rows where the absolute value of *z* was greater than our threshold, to confirm that in those rows `rt_clean`

is the same as `rt_mean`

:

```
df[np.abs(df['rt_z']) > z_thresh].sample(10)
```

participant | block | trial | flankers | rt | rt_z | rt_mean | rt_clean | |
---|---|---|---|---|---|---|---|---|

2076 | s9 | 1 | 17 | incongruent | 0.705617 | 3.889038 | 0.430409 | 0.430409 |

3118 | s23 | 4 | 8 | neutral | 0.919488 | 3.395270 | 0.542556 | 0.542556 |

3571 | s20 | 3 | 16 | neutral | 0.841455 | 4.189282 | 0.489359 | 0.489359 |

667 | s11 | 1 | 28 | incongruent | 0.701472 | 4.127361 | 0.443325 | 0.443325 |

4208 | s15 | 3 | 16 | incongruent | 0.897641 | 3.201033 | 0.557696 | 0.557696 |

2314 | s26 | 3 | 32 | neutral | 0.937082 | 4.420926 | 0.550074 | 0.550074 |

37 | s25 | 2 | 6 | neutral | 0.917223 | 4.631054 | 0.475143 | 0.475143 |

609 | s22 | 5 | 2 | incongruent | 0.713883 | 3.328852 | 0.431980 | 0.431980 |

843 | s8 | 2 | 13 | neutral | 0.041767 | -3.394478 | 0.477679 | 0.477679 |

350 | s16 | 1 | 32 | incongruent | 0.762060 | 3.576242 | 0.454536 | 0.454536 |

Finally we’ll print out the proportion of data identified as outliers, by computing the number of *z* values that exceed our threshold, relative to the total number of *z* values (i.e., the number of rows in the DataFrame):

```
outliers = df[df['rt_z'] > z_thresh]['rt_z'].count() / df['rt_z'].count()
# Convert proportion to percent, and round to 2 decimal places
print(str(round(outliers * 100, 2)), "% of the data identified as outliers and replaced with mean value.")
```

```
1.17 % of the data identified as outliers and replaced with mean value.
```

## Visualizing the cleaned data#

### Box plots#

#### Before outlier correction#

Let’s generate a box plot of the original RT values for reference (this is the same as the one plotted at the top of this lesson).

```
sns.catplot(kind='box',
data=df,
x='flankers', y='rt')
plt.show()
```

#### After outlier correction#

Below, we can see that the number of short RT outliers is reduced, and those that remain are very close to the whiskers (and also greater than 200 ms). At the same time, the number of outliers at the long RT end of the distribution is also reduced, but there are still more of them than at the short end (which is reasonable, for RT data).

```
sns.catplot(kind='box',
data=df,
x='flankers', y='rt_clean')
plt.show()
```

### Kernel density estimates#

#### Before outlier correction#

```
sns.displot(kind='kde',
data=df,
x='rt', hue='flankers')
plt.show()
```

#### After outlier correction#

We can see that the distributions look less “peaky” in the middle with long tails. The long tails were caused by the few outlier data points at extreme ends of the 0–1 s range.

```
sns.displot(kind='kde',
data=df,
x='rt_clean', hue='flankers')
plt.show()
```