Exploratory Data Analysis: Outliers

#EDA #Outliers #Tukey's fences #boxplot #Python # Pandas

Today I'd like to shed some light on outliers as identifying them is one of the important steps of the EDA (exploratory data analysis). However, before we start let's recall what EDA basically is. Personally, I really like the definition posted in the article from International Encyclopedia of Education which can be viewed here. It says it's an approach of finding information and generating ideas by looking at the data in many different ways. As this definition suggests, EDA is a very creative process and requires incorporating some domain knowledge as well.

Ok, how about outliers? What are they? How do we spot them and why? Well, if we are to look at the data in many different ways, checking for outliers is one of them. Outlier is a data point that differs significantly from other observations*. By its definition it is worth looking at them and trying to figure out the reason they're present in a dataset. They may be due to some measurement or processing errors as well as actual variation in the population, What does it mean for an outlier to differ significantly? There are couple of methods to detect an outlier and it's rather subjective which one to use. One way to do that is to use Tukey's fences. The first step is to calculate the interquartile range (iqr), which is the difference between upper and lower quartile (Q3-Q1) or in other words the difference between 75th and 25th percentiles. Then according to this method every observation that falls below Q1 - k * IQR or above Q3 + k * IQR is considered an outlier, where 'k' is a nonnegative constant. For regular outliers usually k = 1.5, however, we can also substitute 3 (three) instead of 1.5 when we want to detect so called "far outs". *

Based on these criteria we can filter our dataset so it doesn't contain outliers. Let's see how can we do it in Pandas:

Once we have created a dataframe with some random observations of feature A, we can proceed with calculating interquartile range, constructing Tukey's fences and eventually filtering out the outliers:

We can see that based on Tukey's fences with k = 1.5 we have two outliers (44 and 39). We can also visualize them using a boxplot. We need to import seaborn library and compare the results:

The boxplot confirms the presence of the two ouliers, They are represented by these two small diamonds beyond the right whisker. It is worth to mention that constant 'k' is set to 1.5 by default and is represented by the 'whis' argument. If for some reason we would like to identify data points that are 'far out' instead of outliers, we just need to set 'whis' to 3. In this case our feature_A has no 'far outs', the right whisker expands and now represents the maximum observed value(44):

To sum up, boxplots are very useful charts for visualizing outliers. In Seaborn library they employ Tukey's fences method and allow us to manipulate with the 'k' constant depending on our needs. It is up to us what to do with the outliers. In some cases they can be very destructive when unmanaged (for example regression models are sensitive to outliers), in others we can keep them.

Sources:

*https://en.wikipedia.org/wiki/Outlier

Exploratory Data Analysis: Outliers

Ostatnie posty

Comments