Most useful functions in Pandas when cleaning your data

#Datacleaning #Pandas

There are a few important steps before we start actual analysis of our data. Typically we want to check how big our dataset is, what types our features are or whether there are any missing values. Luckily Pandas offers great tools to do that in a fast manner.

One of them is a DataFrame.info() function which gives us a quick overview of our data. It lists all the columns along with their types as well as counts how many non-null values there are in each column. This way we can get quite a few important facts regarding our dataset with just one command.

If we notice that some columns have wrong type we should change it in order to facilitate our future analysis. As an example, let's say we have in our dataframe a column with dates and noticed that its type is for some reason an object (string). If we leave it unchanged, then we won't be able to pull a day, month or year out of it later on. That will obviously limit our EDA. So we need to convert it to date type using pd.to_datetime() function. Once we have done it, we can easily manipulate it using functions from datetime library, for example:

pandas.Series.dt.year - pulls out year from date

pandas.Series.dt.day_name, pulls out name of the day, etc.

The same process we apply to numerical data which is formatted as an object. In this case we use pd.to_numeric.

Another pandas function related to time is pd.to_timedelta. It is particularly useful when we have some features describing start and end time of an event and we are interested in calculating the difference or duration of it. Converting these features into timedelta format makes it really simple.

Having converted our features into desirable types/formats, the next step is to identify which columns have missing values. This is an important step, especially when preparing our data for modeling. Depending on the size of our dataset and context of our analysis we may want to either remove the observations with missing data or fill them in using specified strategy. Dropping features with missing values may lead to worse performance of our model as we get rid of potentially useful information. In Pandas we can do that using dropna() function. To check which columns have missing values we apply isnull() method. Let's see how it works with `tips` dataset from plotly:

import pandas as pd
import plotly.express as px

df = px.data.tips()
df.isnull().sum()

This code returns a list of all columns in our dataframe and number of missing values associated with them.

We can see that `tips` dataset has no missing values.

If we decide to fill some values in, there's fillna() method available. Some common ideas to do so is to impute a mean or median in place of missing datapoints. Let's remove the first value from `tip` feature:

df.loc[0, 'tip'] = np.nan

output:

Now let's calculate and impute the mean in place of NaN:

mean_value = df['tip'].mean()
df['tip'] = df['tip'].fillna(mean_value)

output:

As shown above the first row of the 'tip' column was filled in with mean value.

Typically we are not 100% sure which approach will result in better performance of the model, therefore it's a good idea to check our chosen metric on both solutions.

At this stage it is also a good idea to check for duplicates. Pandas offers two methods to handle duplicates: duplicated() method which returns a series of boolean values indicating which rows are duplicated and drop_duplicates() method to get rid of the duplicates.

Sometimes we may want to change the names of the columns or case from lower to upper and vice versa. Then, if there's a lot of columns to be changed it's convenient to use a rename() function.

Now our DataFrame is more or less prepared for further steps, including EDA. In the next post we'll continue with this process focusing on the outliers.

Most useful functions in Pandas when cleaning your data

Ostatnie posty

Comments