Why do we split our datasets?

#Data Preparation #ML # Scikit-learn

Introduction

When building machine learning models, we must assess their performance on unseen data. To achieve this, we divide the dataset into training and test sets. The training set is used to train the model, while the test set evaluates how well the model generalizes to new data. This process helps prevent overfitting and ensures that our model performs well on real-world data.

Splitting Data Manually with Pandas

Let’s consider a hypothetical dataset containing customer information:

import pandas as pd

data = pd.DataFrame({
    'customer_id': range(1, 11),
    'age': [25, 34, 45, 23, 35, 46, 29, 31, 39, 41],
    'spend': [200, 500, 700, 150, 400, 800, 250, 350, 600, 650]
})

When splitting manually this dataset into training and test sets we need to make sure the process is random. To achieve this, we can use sample():

train = data.sample(frac=0.8, random_state=42)

frac=0.8 specifies that 80% of the data should be used for training.
random_state=42 ensures reproducibility

Now that we have sampled 80% of our dataset and assigned it to the training set, we need to define our test set (remaining 20%). We can do it as follows:

test = data.drop(train.index)

This code simply drops the rows which were sampled in the previous step.

Let's visualize the results:

We can see the original dataset (Full dataset) was splitted according to our needs (size + randomness) into Training and Test sets respectively.

Splitting Data with train_test_split

A more efficient way to split data is by using train_test_split from sklearn.model_selection:

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=42)

The result of employing train_test_split function is exactly the same as what we had received manually.

Key arguments:

test_size: Proportion of data allocated to the test set (e.g., 0.2 for 20%).
random_state: Ensures reproducibility.
shuffle: Determines whether the data should be shuffled before splitting (default: True).

Shuffling the data before splitting is what we usually need. This is why by default this argument is set to True. However, there are cases when preserving the temporal order is crucial (time series). In these cases, we set shuffle=False:

train, test = train_test_split(data, test_size=0.2, shuffle=False)

Now we can see that the last two rows from the full dataset were assigned to the Test set (not 2 random rows!). This is an important difference to keep in mind when working with time series as it allows us to predict the future based on the past, and not the other way around:)

stratify: Maintains the same proportion of target values in both sets.

Let's modify our original dataset a little bit to explore what this argument (stratify) really does. First of all, let's make it bigger (more rows) and let's add an imbalanced binary feature (15% ones, 85% zeros):

 data = pd.DataFrame({
	'customer_id': range(1, 101),
	'age': np.random.randint(18, 65, size=100),
	'spend': np.random.randint(100, 1000, size=100)
})
# Adding an imbalanced binary target variable (15% ones, 85% zeros)
data['target'] = np.where(np.random.rand(100) < 0.15, 1, 0)

This is how it looks now:

Let's confirm class distribution in our target feature:

As discussed earlier, splitting our dataset is usually meant to be random. If so, we can't be sure that the proportion of zeros and ones in both sets is going to be similar. This can be an issue when evaluating our model. In order to mitigate it, we can use stratify argument. Let's see an example:

# Splitting without stratify

train_no_strat, test_no_strat = train_test_split(data, test_size=0.2, random_state=200)

# Splitting with stratify

train_strat, test_strat = train_test_split(data, test_size=0.2, stratify=data['target'])

Let's see our target distributions in both training and test sets with and without stratifying:

We can clearly see that without stratifying, the proportions of zeros and ones in training and test sets can be far from similar due to randomness!

In extreme cases it might even have an impact on our model performance.

Conclusion

Splitting a dataset ensures model evaluation on unseen data.
sample() allows manual splitting in Pandas.
train_test_split is a more efficient method with useful parameters like test_size, stratify, and shuffle.
Special cases like imbalanced data require stratify, while time series data should not be shuffled.
Using stratify ensures that the class distribution in training and test sets remains consistent, avoiding biased evaluations.

Why do we split our datasets?

Introduction

Splitting Data Manually with Pandas

Splitting Data with train_test_split

Conclusion

Ostatnie posty

Comments