Lecture 11 - Data Exploration and Preprocessing

View notebook on Github Open In Collab

11.1 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important step in all data science projects, and involves several exploratory steps to obtain a better understanding of the data.

EDA typically includes: inspecting the summary statistics of the data, observing if there are missing values and adopting an appropriate strategy for handling them, checking the distribution of the features and whether there is a correlation between features, understanding which features are important and worth keeping and which ones are less important, and similar.

To provide an example of EDA, we will use the Titanic dataset which can be loaded from the Seaborn datasets.

[1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
[2]:
titanic = sns.load_dataset('titanic')

Let’s check the basic information about the data. There are 891 rows (samples) and 15 columns (features). We can see below the data types of each column.

[3]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   survived     891 non-null    int64
 1   pclass       891 non-null    int64
 2   sex          891 non-null    object
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64
 5   parch        891 non-null    int64
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object
 8   class        891 non-null    category
 9   who          891 non-null    object
 10  adult_male   891 non-null    bool
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object
 13  alive        891 non-null    object
 14  alone        891 non-null    bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

Let’s display the first five rows and the last five rows. As we can notice, each row presents data for one passenger on Titanic.

[4]:
titanic.head()
[4]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
[5]:
titanic.tail()
[5]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no True

Let’s also see the summary statistic. Recall that statistics are shown only for the columns with numerical data.

[6]:
titanic.describe()
[6]:
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Let’s assume that our task is to predict whether the passenger shown in the next cell survived. I.e., we will take the survived column to be the target, and we will implement a classification algorithm to predict it based on the other columns in the dataset.

[7]:
titanic.head(1)
[7]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.25 S Third man True NaN Southampton no False

Explore Column Information

Let’s first inspect some of the columns in the dataset.

For instance, we can use the following code to find how many passengers survived and how many died.

[8]:
titanic['survived'].value_counts()
[8]:
0    549
1    342
Name: survived, dtype: int64

It is often easier to understand the data if we plot the values. The pandas library provides basic plotting functions. For instance, in the next cell we created a bar plot by using plot(kind='bar') directly in pandas. The syntax for plotting in pandas is somewhat different than the matplotlib functions, and admittedly, the functionality for plotting in pandas is limited. We will learn later about the Seaborn library which allows plotting directly in DataFrames and provides improved visualizations in comparison to pandas plots.

[9]:
titanic['survived'].value_counts().plot(kind='bar')
[9]:
<Axes: >
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_21_1.png

Notice that there is a column called alive which is the same as the survived column. We need to remove this column from the data, otherwise the classifier will just use that column to make predictions for the survived passengers, and will achieve 100% accuracy.

[10]:
titanic['alive'].value_counts()
[10]:
no     549
yes    342
Name: alive, dtype: int64
[11]:
titanic.drop(['alive'], axis=1, inplace=True)
[12]:
# verify the change
titanic.head()
[12]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton True

Also, there are two columns called class and plass. Let’s examine how many passengers are in these columns.

[13]:
titanic['pclass'].value_counts()
[13]:
3    491
1    216
2    184
Name: pclass, dtype: int64
[14]:
titanic['class'].value_counts()
[14]:
Third     491
First     216
Second    184
Name: class, dtype: int64
[15]:
# compare the two columns
p_class = titanic[['pclass', 'class']]
p_class.head()
[15]:
pclass class
0 3 Third
1 1 First
2 3 Third
3 1 First
4 3 Third

It seems that both of these columns are the same, except that one is numeric and the other contains text. Let’s drop the class column.

[16]:
titanic.drop(['class'], axis=1, inplace=True)
[17]:
# verify the change
titanic.head()
[17]:
survived pclass sex age sibsp parch fare embarked who adult_male deck embark_town alone
0 0 3 male 22.0 1 0 7.2500 S man True NaN Southampton False
1 1 1 female 38.0 1 0 71.2833 C woman False C Cherbourg False
2 1 3 female 26.0 0 0 7.9250 S woman False NaN Southampton True
3 1 1 female 35.0 1 0 53.1000 S woman False C Southampton False
4 0 3 male 35.0 0 0 8.0500 S man True NaN Southampton True

Let’s explore the two columns embarked and embark_town.

[18]:
titanic['embarked'].value_counts()
[18]:
S    644
C    168
Q     77
Name: embarked, dtype: int64
[19]:
titanic['embark_town'].value_counts()
[19]:
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

They are the same, therefore, drop embarked.

[20]:
titanic.drop(['embarked'], axis=1, inplace=True)
[21]:
# verify the change
titanic.head()
[21]:
survived pclass sex age sibsp parch fare who adult_male deck embark_town alone
0 0 3 male 22.0 1 0 7.2500 man True NaN Southampton False
1 1 1 female 38.0 1 0 71.2833 woman False C Cherbourg False
2 1 3 female 26.0 0 0 7.9250 woman False NaN Southampton True
3 1 1 female 35.0 1 0 53.1000 woman False C Southampton False
4 0 3 male 35.0 0 0 8.0500 man True NaN Southampton True

Let’s plot the occurrences in embark_town in a bar plot.

[22]:
titanic['embark_town'].value_counts().plot(kind='bar')
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_40_0.png

Also, let’s check how many men and women there are in the dataset.

[23]:
titanic['sex'].value_counts()
[23]:
male      577
female    314
Name: sex, dtype: int64
[24]:
titanic['sex'].value_counts().plot(kind='bar')
[24]:
<Axes: >
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_43_1.png
[25]:
titanic['who'].value_counts()
[25]:
man      537
woman    271
child     83
Name: who, dtype: int64

We can note that the column who is similar to the sex column, but it also includes the number of children.

As an exercise, let’s show the categories of the column who using a Pie Chart to visualize their values.

[26]:
titanic.who.value_counts().plot(kind='pie')
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_46_0.png

There is another column adult_male which is similar, but different than sex and who.

[27]:
titanic['adult_male'].value_counts()
[27]:
True     537
False    354
Name: adult_male, dtype: int64

Missing Data

Let’s check which columns have data missing.

[28]:
titanic.isnull().sum()
[28]:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
who              0
adult_male       0
deck           688
embark_town      2
alone            0
dtype: int64

There are missing data in age, deck, and embark_town columns.

The deck column has too many rows missing, and probably the deck on which the passenger boarded the ship is not too important for the task of predicting the survived passengers, thus, let’s drop it.

[29]:
titanic.drop(['deck'], axis=1, inplace=True)
[30]:
# verify the change
titanic.head()
[30]:
survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 3 male 22.0 1 0 7.2500 man True Southampton False
1 1 1 female 38.0 1 0 71.2833 woman False Cherbourg False
2 1 3 female 26.0 0 0 7.9250 woman False Southampton True
3 1 1 female 35.0 1 0 53.1000 woman False Southampton False
4 0 3 male 35.0 0 0 8.0500 man True Southampton True

Since only 2 rows are missing in embark town let’s remove those two rows. In the next cell we used dropna to remove only those rows that have missing values in the the embark town column.

Recall again that to drop columns in pandas we use axis=1 and to drop rows we use axis=0.

[31]:
titanic.dropna(subset=['embark_town'], axis=0, inplace=True)

We can notice that the number of rows was reduced to 889 from the original 891, because of the removed 2 rows.

[32]:
# verify the change
titanic.shape
[32]:
(889, 11)

Now we have Nan values only in the age column.

[33]:
titanic.isnull().sum()
[33]:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
who              0
adult_male       0
embark_town      0
alone            0
dtype: int64

There are several ways to deal with this. One is to replace the missing values in the age column with the average value of the age of passengers, or with some other value (e.g., 0 in some cases).

Let’s first explore the first option, and let’s create a new DataFrame called titanic_filled which replaces the missing values in the age column with the average age. For this purpose we will use the method fillna that will calculate the mean and fill in the missing values.

[34]:
titanic_filled = titanic.fillna(titanic.mean(axis=0))
C:\Users\vakanski\AppData\Local\Temp\ipykernel_10716\1546117764.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  titanic_filled = titanic.fillna(titanic.mean(axis=0))

To verify the above, let’s display several rows that have missing values for the age, and let’s display below the DataFrame with the filled values. We can note that the average age is 29.64 years.

[35]:
titanic.head(8)
[35]:
survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 3 male 22.0 1 0 7.2500 man True Southampton False
1 1 1 female 38.0 1 0 71.2833 woman False Cherbourg False
2 1 3 female 26.0 0 0 7.9250 woman False Southampton True
3 1 1 female 35.0 1 0 53.1000 woman False Southampton False
4 0 3 male 35.0 0 0 8.0500 man True Southampton True
5 0 3 male NaN 0 0 8.4583 man True Queenstown True
6 0 1 male 54.0 0 0 51.8625 man True Southampton True
7 0 3 male 2.0 3 1 21.0750 child False Southampton False
[36]:
titanic_filled.head(8)
[36]:
survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 3 male 22.000000 1 0 7.2500 man True Southampton False
1 1 1 female 38.000000 1 0 71.2833 woman False Cherbourg False
2 1 3 female 26.000000 0 0 7.9250 woman False Southampton True
3 1 1 female 35.000000 1 0 53.1000 woman False Southampton False
4 0 3 male 35.000000 0 0 8.0500 man True Southampton True
5 0 3 male 29.642093 0 0 8.4583 man True Queenstown True
6 0 1 male 54.000000 0 0 51.8625 man True Southampton True
7 0 3 male 2.000000 3 1 21.0750 child False Southampton False

We can observe now that there are no missing values in the titanic_filled DataFrame.

[37]:
titanic_filled.isnull().sum()
[37]:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
who            0
adult_male     0
embark_town    0
alone          0
dtype: int64

Another alternative is to drop the rows with the missing values for the age. Let’s explore this strategy as well.

In general, we can try both strategies and check which one produces better results with the classification algorithm.

After dropping the rows with missing values, 712 rows are remaining in the dataset. The method reset_index will change the index column to range from 0 to 711. If we didn’t reset the index column, the index values would have still ranged from 0 to 891.

[38]:
titanic.dropna(inplace=True)
[39]:
titanic.reset_index(inplace=True)
[40]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   index        712 non-null    int64
 1   survived     712 non-null    int64
 2   pclass       712 non-null    int64
 3   sex          712 non-null    object
 4   age          712 non-null    float64
 5   sibsp        712 non-null    int64
 6   parch        712 non-null    int64
 7   fare         712 non-null    float64
 8   who          712 non-null    object
 9   adult_male   712 non-null    bool
 10  embark_town  712 non-null    object
 11  alone        712 non-null    bool
dtypes: bool(2), float64(2), int64(5), object(3)
memory usage: 57.1+ KB
[41]:
titanic.head()
[41]:
index survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 0 3 male 22.0 1 0 7.2500 man True Southampton False
1 1 1 1 female 38.0 1 0 71.2833 woman False Cherbourg False
2 2 1 3 female 26.0 0 0 7.9250 woman False Southampton True
3 3 1 1 female 35.0 1 0 53.1000 woman False Southampton False
4 4 0 3 male 35.0 0 0 8.0500 man True Southampton True

Checking Feature Distribution

Let’s check the distribution of the numerical columns in the dataset, by plotting the histograms. For the columns with categorical data, such as survived and pclass, the histograms are equivalent to bar plots, and are less helpful.

[42]:
titanic[['survived','pclass','age','sibsp','parch','fare']].hist(bins=10)
plt.tight_layout()
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_74_0.png

Or, we can inspect the distribution of each feature.

[43]:
titanic['age'].plot(kind='hist', bins=30)
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_76_0.png

If we wish, we can use the groupby function in pandas, for instance, to show the counts for the pclass column grouped by survived.

[44]:
titanic.groupby('survived')['pclass'].value_counts().plot(kind="bar")
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_78_0.png

As we mentioned earlier, although the pandas library provides some functionality for plotting directly from DataFrames, there are other plotting libraries that provide improved graphs. Among the most popular is Seaborn. A few plots created with Seaborn are shown below.

[45]:
sns.countplot(data=titanic, x='survived', hue='pclass')
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_80_0.png
[46]:
sns.countplot(data=titanic, x='survived', hue='sex')
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_81_0.png

In the next figure we can see a scatter plot of the age, grouped by fare and class. As expected, the fare in the first class was more expensive, in comparison to the second and third classes.

[47]:
sns.scatterplot(data=titanic, x='age', y='fare', hue='pclass', palette='viridis')
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_83_0.png

Checking Correlated Features

Checking correlations between the features in a dataset can help to identify similarities between the features. If two features have high correlation, that means they contain similar or perhaps the same information, and if one of them is removed, the performance of the classification algorithm will be less affected.

We use the corr() method in pandas to check the correlation between the columns. Based on the correlation values, there aren’t columns that are highly correlated. One of the reasons is that we already dropped several columns that were similar or the same as other columns.

[48]:
correlation = titanic.corr()
correlation
C:\Users\vakanski\AppData\Local\Temp\ipykernel_10716\825769680.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correlation = titanic.corr()
[48]:
index survived pclass age sibsp parch fare adult_male alone
index 1.000000 0.029526 -0.035609 0.033681 -0.082704 -0.011672 0.009655 0.024069 0.059677
survived 0.029526 1.000000 -0.356462 -0.082446 -0.015523 0.095265 0.266100 -0.551151 -0.199741
pclass -0.035609 -0.356462 1.000000 -0.365902 0.065187 0.023666 -0.552893 0.094635 0.150576
age 0.033681 -0.082446 -0.365902 1.000000 -0.307351 -0.187896 0.093143 0.286543 0.195766
sibsp -0.082704 -0.015523 0.065187 -0.307351 1.000000 0.383338 0.139860 -0.313016 -0.629408
parch -0.011672 0.095265 0.023666 -0.187896 0.383338 1.000000 0.206624 -0.365580 -0.577109
fare 0.009655 0.266100 -0.552893 0.093143 0.139860 0.206624 1.000000 -0.177446 -0.262799
adult_male 0.024069 -0.551151 0.094635 0.286543 -0.313016 -0.365580 -0.177446 1.000000 0.400718
alone 0.059677 -0.199741 0.150576 0.195766 -0.629408 -0.577109 -0.262799 0.400718 1.000000

If we observe the correlation for the column survived, we can see that it is most correlated with adult_male, pclass, fare, and alone, and less correlated with age, sibsp (siblings), and parch (parent or child).

[49]:
correlation['survived']
[49]:
index         0.029526
survived      1.000000
pclass       -0.356462
age          -0.082446
sibsp        -0.015523
parch         0.095265
fare          0.266100
adult_male   -0.551151
alone        -0.199741
Name: survived, dtype: float64
[50]:
correlation['fare']
[50]:
index         0.009655
survived      0.266100
pclass       -0.552893
age           0.093143
sibsp         0.139860
parch         0.206624
fare          1.000000
adult_male   -0.177446
alone        -0.262799
Name: fare, dtype: float64

The following heatmap shows the correlation between all features in a graph with a colorbar.

[51]:
sns.heatmap(correlation, annot=True)
plt.show()
../../../_images/Lectures_Theme_2-Data_Engineering_Lecture_11-Data_Exploration_Lecture_11-Data_Exploration_and_Preprocessing_90_0.png

11.2 Preprocessing Numerical Data

Tabular data can be classified into two main categories:

  • Numerical data: a quantity represented by a real or integer number.

  • Categorical data: a discrete value, typically represented by string labels taken from a finite list of possible choices, but it is also possible to be represented by numbers from a discrete set of possible choices.

Most machine learning algorithms are sensitive to the range of values that are used for numerical inputs, and expect the input features to be scaled before processing. Feature scaling is transforming the numerical features into a small range of values.

Common feature scaling techniques for numerical features include:

  • Normalization

  • Standardization

  • Robust scaling

Whether or not a machine learning model requires scaling of the features depends on the model family. Linear models, such as logistic regression, generally benefit from scaling the features, while other models such as tree-based models (i.e., decision trees, random forests) do not need such preprocessing.

11.2.1 Normalization

Normalization is a scaling technique that transforms numerical features into a range of values between 0 and 1. When we work with features that have different ranges of values, normalizing the features can be a good practice. For example, if we have one feature (column) in the range from 100-1000, and another feature varies from 0.05-0.2, we can scale them so that they both have a range of values from 0 to 1.

Normalizing data is performed using the following formula, where \(X_{min}\) is the minimum value of feature \(X\), and \(X_{max}\) is the maximum value of \(X\).

\[X_{norm} = \frac {X-X_{min}} {X_{max}-X_{min}}\]

For illustration purposes of normalization, we will use a smaller dataset called tips available in Seaborn.

[52]:
tip_data = sns.load_dataset('tips')
tip_data.head()
[52]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Let’s separate all numerical features from the above data into a new DataFrame original_features.

[53]:
original_features = tip_data[['total_bill', 'tip', 'size']]

To perform normalization, we will use the scikit-learn library which provides the function MinMaxScaler() to scale the data to the range between 0 and 1. That is the default range, but we can also select an arbitrary range to scale the data.

The fit method in the code below first fits the data, i.e., for this task it calculates the minimum and maximum values for each column.

Afterwards, the transform method scales the data, i.e., it uses the calculated minimum and maximum values for each column and substitutes them in the above formula to obtain scaled values of the data.

The syntax is scaler.fit(data) and scaler.transform(data).

[54]:
from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler()

minmax_scaler.fit(original_features)

normalized_features = minmax_scaler.transform(original_features)
[55]:
# Show the first five rows in 'total_bill', 'tip', and 'size'
normalized_features[:5]
[55]:
array([[0.29157939, 0.00111111, 0.2       ],
       [0.1522832 , 0.07333333, 0.4       ],
       [0.3757855 , 0.27777778, 0.4       ],
       [0.43171345, 0.25666667, 0.2       ],
       [0.45077503, 0.29      , 0.6       ]])

The scikit-learn library also provides a combined method fit_transform which calls first fit and then transform in one step. This can be more efficient than calling fit and transform separately. The syntax is scaler.fit_transform(data).

[56]:
normalized_features_2 = minmax_scaler.fit_transform(original_features)
[57]:
# Show the first five rows in 'total_bill', 'tip', and 'size'
normalized_features_2[:5]
[57]:
array([[0.29157939, 0.00111111, 0.2       ],
       [0.1522832 , 0.07333333, 0.4       ],
       [0.3757855 , 0.27777778, 0.4       ],
       [0.43171345, 0.25666667, 0.2       ],
       [0.45077503, 0.29      , 0.6       ]])

The output of MinMaxScaler() is a NumPy array, with the values in each column scaled in the range between 0 and 1.

11.2.2 Standardization

Standardization is another scaling technique where numerical features are rescaled to have 0 mean (\(\mu\)) and 1 standard deviation (\(\sigma\) ).

The formula for standardization is as follows:

\[X_{std} = \frac {X - \mu} {\sigma}\]

where \(X_{std}\) is the standardized feature, \(X\) is the original feature, \(\mu\) is the mean of the feature, and \(\sigma\) is the standard deviation.

When should we standardize the features? When we know that the training data has a normal (Gaussian) distribution. And, if the data does not have normal distribution, then normalization is a preferable scaling technique to standardization.

With some machine learning algorithms the performance will be the same with features scaling using normalization or standardization, but with other algorithms, there can be a difference in the performance. Therefore, in some cases, we can try both feature scaling techniques, especially if we are not sure about the distribution of the data.

Standardization is implemented in scikit-learn with StandardScaler. Similar to MinMaxScaler, we can either use the syntax scaler.fit_transform(data) or the syntax with scaler.fit(data) and scaler.transform(data). And, we will explain the difference between these two syntaxes in the next lectures when we will introduce the concepts of training and testing datasets.

[58]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()

standardized_features = std_scaler.fit_transform(original_features)
[59]:
standardized_features[:5]
[59]:
array([[-0.31471131, -1.43994695, -0.60019263],
       [-1.06323531, -0.96920534,  0.45338292],
       [ 0.1377799 ,  0.36335554,  0.45338292],
       [ 0.4383151 ,  0.22575414, -0.60019263],
       [ 0.5407447 ,  0.4430195 ,  1.50695847]])

We can inspect the mean and standard variation of the original data using the mean_ and var_ attributes. The convention in scikit-learn is that if an attribute is learned from the data, its name ends with an underscore, as in mean_ and var_ for the StandardScaler.

[60]:
# The mean of each feature in the original data
std_scaler.mean_
[60]:
array([19.78594262,  2.99827869,  2.56967213])
[61]:
# The variance each feature in the original data
std_scaler.var_
[61]:
array([78.92813149,  1.90660851,  0.9008835 ])

As expected, each column in the scaled data standardized_features has zero mean and unit variance.

[62]:
# The mean of each feature in the scaled data
np.round(standardized_features.mean(axis=0))
[62]:
array([-0.,  0., -0.])
[63]:
# The variance each feature in the scaled data
np.round(standardized_features.std(axis=0))
[63]:
array([1., 1., 1.])

It is easy to confuse normalization with standardization, since standardization rescales the data to have a normal distribution with mean 0 and standard deviation 1, therefore, pay attention and try not to confuse these two scaling techniques.

11.2.3 Robust Scaling

Scikit-learn provides another scaling method called RobustScaler, which is more suitable when the data contain many outliers.

RobustScaler applies similar scaling to standardization, but it uses the median and the interquartile range (IQR) instead of the mean and standard deviation to scale the features. This makes it less sensitive to extreme values or outliers in the data. Recall the Interquartile Range(IQR) is the difference between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).

Because of that, RobustScaler is also more suitable for datasets with non-normal distributions.

[64]:
from sklearn.preprocessing import RobustScaler

rob_scaler = RobustScaler()
robust_scaled_features = rob_scaler.fit_transform(original_features)
[65]:
robust_scaled_features[:5]
[65]:
array([[-0.07467532, -1.2096    ,  0.        ],
       [-0.69155844, -0.7936    ,  1.        ],
       [ 0.29823748,  0.384     ,  1.        ],
       [ 0.54591837,  0.2624    ,  0.        ],
       [ 0.63033395,  0.4544    ,  2.        ]])

We can confirm that the columns in the scaled data have 0 median.

[66]:
np.round(np.median(robust_scaled_features, axis=0))
[66]:
array([-0.,  0.,  0.])

The MinMaxScaler, StandardScaler, and RobustScaler in scikit-learn are also called transformers, since they are used to perform various data transformations on the original dataset before feeding the data into a machine learning model. Transformers are an essential part of the data processing in scikit-learn’s pipeline.

11.3 Preprocessing Categorical Data

Categorical data contain a limited number of discrete categories. An example is the feature who in the titanic dataset that has three categories: man, woman, and child.

In many cases, categorical features have text values, and they need to be converted into numerical values in order to be processed by machine learning algorithms.

We will look into the following techniques for converting categorical features into numerical features:

  • Mapping method

  • Ordinal encoding

  • Label encoding

  • Pandas dummies

  • One-hot encoding

The first three techniques produce a single number for each category, and the last two techniques produce one-hot matrix.

We are going to use again the Titanic dataset, which has several categorical features.

[67]:
titanic.head()
[67]:
index survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 0 3 male 22.0 1 0 7.2500 man True Southampton False
1 1 1 1 female 38.0 1 0 71.2833 woman False Cherbourg False
2 2 1 3 female 26.0 0 0 7.9250 woman False Southampton True
3 3 1 1 female 35.0 1 0 53.1000 woman False Southampton False
4 4 0 3 male 35.0 0 0 8.0500 man True Southampton True

11.3.1 Mapping Method

The mapping method is a straightforward way to encode categorical features when there are few categories. For instance, for the who feature, we will create a dictionary map_dict whose keys are the three categories man, woman, child, and they are mapped to numerical values 0, 1, and 2.

[68]:
titanic['who'].value_counts()
[68]:
man      413
woman    216
child     83
Name: who, dtype: int64
[69]:
map_dict = {'man': 0, 'woman': 1, 'child': 2}

The map() method is applied next to map the keys to values in the who column.

[70]:
titanic['who'] = titanic['who'].map(map_dict)

Now the who feature is numerical, and all instances with the class man were replaced with 0, and the same applies to the other two classes.

[71]:
titanic.head()
[71]:
index survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 0 3 male 22.0 1 0 7.2500 0 True Southampton False
1 1 1 1 female 38.0 1 0 71.2833 1 False Cherbourg False
2 2 1 3 female 26.0 0 0 7.9250 1 False Southampton True
3 3 1 1 female 35.0 1 0 53.1000 1 False Southampton False
4 4 0 3 male 35.0 0 0 8.0500 0 True Southampton True
[72]:
# verify that value_counts remained the same after the mapping
titanic['who'].value_counts()
[72]:
0    413
1    216
2     83
Name: who, dtype: int64

11.3.2 Ordinal Encoding

Ordinal encoding can be implemented with the OrdinalEncoder in scikit-learn, which will automatically encode each category with a different numerical value. This method is often preferred, since it is automated and less prone to errors.

Let’s apply it to the columns alone and adult_male.

[73]:
titanic['alone'].value_counts()
[73]:
True     402
False    310
Name: alone, dtype: int64
[74]:
titanic['adult_male'].value_counts()
[74]:
True     413
False    299
Name: adult_male, dtype: int64
[75]:
from sklearn.preprocessing import OrdinalEncoder

categs_feats = titanic[['adult_male', 'alone']]

encoder = OrdinalEncoder()

categs_encoded = encoder.fit_transform(categs_feats)

The output of the OrdinalEncoder is a NumPy array categs_encoded, shown below.

[76]:
categs_encoded
[76]:
array([[1., 0.],
       [0., 0.],
       [0., 1.],
       ...,
       [0., 1.],
       [1., 1.],
       [1., 1.]])

In the next cell, we will convert the NumPy array categs_encoded into pandas DataFrame. In this line, columns=categs_feats.columns specifies the column names for the DataFrame, and index=categs_feats.index specifies the row index for the DataFrame.

Note below that the text values in the columns alone and adult_male have been replaced with numeric values 0 or 1.

[77]:
titanic[['adult_male', 'alone']] = pd.DataFrame(categs_encoded, columns=categs_feats.columns, index=categs_feats.index)
titanic.head()
[77]:
index survived pclass sex age sibsp parch fare who adult_male embark_town alone
0 0 0 3 male 22.0 1 0 7.2500 0 1.0 Southampton 0.0
1 1 1 1 female 38.0 1 0 71.2833 1 0.0 Cherbourg 0.0
2 2 1 3 female 26.0 0 0 7.9250 1 0.0 Southampton 1.0
3 3 1 1 female 35.0 1 0 53.1000 1 0.0 Southampton 0.0
4 4 0 3 male 35.0 0 0 8.0500 0 1.0 Southampton 1.0
[78]:
# verify that value_counts remained the same after the mapping
titanic['alone'].value_counts()
[78]:
1.0    402
0.0    310
Name: alone, dtype: int64

We can also check the applied mapping between the categories and the numerical values via the attribute categories_.

[79]:
encoder.categories_
[79]:
[array([False,  True]), array([False,  True])]

Note that OrdinalEncoder can not handle missing values, and if we try to apply it to a column with missing values, we will get an error.

Also, we need to be careful when applying this encoding strategy, because by default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. For instance, suppose the dataset has a categorical variable named "size" with categories such as “S”, “M”, “L”, “XL”, and we would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However, the lexicographical strategy used by default would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3, by following the alphabetical order. To avoid that, we can pass a list with the expected order for the categories argument for each feature (e.g., encoder = OrdinalEncoder(categories=[True, False] for the column alone).

If a categorical variable does not carry any meaningful order information, then we can consider using one-hot encoding described in the sections below.

11.3.3 Label Encoding

Label encoding is used to encode categorical values in the target label column with the LabelEncoder in scikit-learn.

In this case, the target label column survived has numerical values, and it does not need to be encoded. Therefore, let’s apply the LabelEncoder to the embark_town column.

[80]:
from sklearn.preprocessing import LabelEncoder

embtown_feat = titanic[['embark_town']]

label_encoder = LabelEncoder()

embtown_encoded = label_encoder.fit_transform(embtown_feat)
C:\Users\vakanski\anaconda3\Lib\site-packages\sklearn\preprocessing\_label.py:114: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

The output of Label Encoder is also a NumPy array. Let’s convert it to a pandas dataframe, and add it as a new column embark_town_ord.

[81]:
titanic['embark_town_ord'] = pd.DataFrame(embtown_encoded, columns=embtown_feat.columns, index=embtown_feat.index)

titanic.head()
[81]:
index survived pclass sex age sibsp parch fare who adult_male embark_town alone embark_town_ord
0 0 0 3 male 22.0 1 0 7.2500 0 1.0 Southampton 0.0 2
1 1 1 1 female 38.0 1 0 71.2833 1 0.0 Cherbourg 0.0 0
2 2 1 3 female 26.0 0 0 7.9250 1 0.0 Southampton 1.0 2
3 3 1 1 female 35.0 1 0 53.1000 1 0.0 Southampton 0.0 2
4 4 0 3 male 35.0 0 0 8.0500 0 1.0 Southampton 1.0 2
[82]:
label_encoder.classes_
[82]:
array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)
[83]:
titanic['embark_town_ord'].value_counts()
[83]:
2    554
0    130
1     28
Name: embark_town_ord, dtype: int64

11.3.4 Pandas Dummies

Pandas provides a function get_dummies that can be also used to handle categorical features. This function creates new columns based on the number of available categories in a target column. For example, let’s apply it to the feature sex.

[84]:
titanic.head()
[84]:
index survived pclass sex age sibsp parch fare who adult_male embark_town alone embark_town_ord
0 0 0 3 male 22.0 1 0 7.2500 0 1.0 Southampton 0.0 2
1 1 1 1 female 38.0 1 0 71.2833 1 0.0 Cherbourg 0.0 0
2 2 1 3 female 26.0 0 0 7.9250 1 0.0 Southampton 1.0 2
3 3 1 1 female 35.0 1 0 53.1000 1 0.0 Southampton 0.0 2
4 4 0 3 male 35.0 0 0 8.0500 0 1.0 Southampton 1.0 2
[85]:
dummies = pd.get_dummies(titanic['sex'])
[86]:
titanic = pd.concat([titanic.drop('sex',axis=1),dummies], axis=1)

Note that new columns female and male with 0 or 1 values were added to the right of the DataFrame.

[87]:
titanic.head()
[87]:
index survived pclass age sibsp parch fare who adult_male embark_town alone embark_town_ord female male
0 0 0 3 22.0 1 0 7.2500 0 1.0 Southampton 0.0 2 0 1
1 1 1 1 38.0 1 0 71.2833 1 0.0 Cherbourg 0.0 0 1 0
2 2 1 3 26.0 0 0 7.9250 1 0.0 Southampton 1.0 2 1 0
3 3 1 1 35.0 1 0 53.1000 1 0.0 Southampton 0.0 2 1 0
4 4 0 3 35.0 0 0 8.0500 0 1.0 Southampton 1.0 2 0 1

This type of encoding is also called one-hot encoding, where each category (unique value) in the column sex became a column, and for each row (sample), 1 specifies the category to which it belongs.

11.3.5 One-Hot Encoding

Scikit-learn provides a function OneHotEncoder that converts a feature into one-hot matrix. As with the pandas dummies, additional columns corresponding to the values of the given categories are created.

Let’s apply it to embark_town.

[88]:
titanic['embark_town'].value_counts()
[88]:
Southampton    554
Cherbourg      130
Queenstown      28
Name: embark_town, dtype: int64
[89]:
titanic.head()
[89]:
index survived pclass age sibsp parch fare who adult_male embark_town alone embark_town_ord female male
0 0 0 3 22.0 1 0 7.2500 0 1.0 Southampton 0.0 2 0 1
1 1 1 1 38.0 1 0 71.2833 1 0.0 Cherbourg 0.0 0 1 0
2 2 1 3 26.0 0 0 7.9250 1 0.0 Southampton 1.0 2 1 0
3 3 1 1 35.0 1 0 53.1000 1 0.0 Southampton 0.0 2 1 0
4 4 0 3 35.0 0 0 8.0500 0 1.0 Southampton 1.0 2 0 1
[90]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder()

town_encoded = one_hot.fit_transform(titanic[['embark_town']])
[91]:
one_hot.categories_
[91]:
[array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]
[92]:
town_encoded
[92]:
<712x3 sparse matrix of type '<class 'numpy.float64'>'
        with 712 stored elements in Compressed Sparse Row format>

The output of OneHotEncoder is a sparse matrix. We will need to convert it into NumPy array first, and afterward we can convert it into pandas DataFrame.

[93]:
town_encoded = town_encoded.toarray()
[94]:
columns = list(one_hot.categories_)

town_df = pd.DataFrame(town_encoded, columns=columns)

town_df.head()
[94]:
Cherbourg Queenstown Southampton
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 0.0 1.0
3 0.0 0.0 1.0
4 0.0 0.0 1.0
[95]:
titanic.drop('embark_town',axis=1, inplace=True)
[96]:
titanic[['Cherbourg', 'Queenstown', 'Southampton']] = town_df
[97]:
titanic.head()
[97]:
index survived pclass age sibsp parch fare who adult_male alone embark_town_ord female male Cherbourg Queenstown Southampton
0 0 0 3 22.0 1 0 7.2500 0 1.0 0.0 2 0 1 0.0 0.0 1.0
1 1 1 1 38.0 1 0 71.2833 1 0.0 0.0 0 1 0 1.0 0.0 0.0
2 2 1 3 26.0 0 0 7.9250 1 0.0 1.0 2 1 0 0.0 0.0 1.0
3 3 1 1 35.0 1 0 53.1000 1 0.0 0.0 2 1 0 0.0 0.0 1.0
4 4 0 3 35.0 0 0 8.0500 0 1.0 1.0 2 0 1 0.0 0.0 1.0

Choosing an encoding strategy

Choosing an encoding strategy depends on the used models and the type of categories (i.e., ordinal vs. nominal). In general, One-Hot Encoding is the encoding strategy used when the downstream models are linear models, while Ordinal Encoding is often a good strategy with tree-based models.

With ordinal encoding, there is an order in the resulting categories, e.g. 0 < 1 < 2 (called ordinal categories). The impact of violating this ordering assumption is dependent on the downstream models. Linear models will be impacted by misordered categories, while tree-based models will not.

One-hot encoding is applied when the ordering of the categories is not important. Such categories are also called nominal categories. This encoding can cause computational inefficiency in tree-based models with high number of categories, and because of this, it is not recommended to use with these models.

11.4 Combining Numerical and Categorical Features

Now let’s prepare the numerical and categorical data in the titanic dataset and train a classification model.

First, assign the survived column to be the target label y.

[98]:
y = titanic['survived']

We will use the other columns to be data features X, therefore let’s drop the survived column and the index column.

It is very important to always remove the index column from the data used for training a model. If we leave the index column in the data, the model can learn to associate the target labels with the index of each data point (row).

[99]:
X = titanic.drop(['survived', 'index'], axis=1)
[100]:
X
[100]:
pclass age sibsp parch fare who adult_male alone embark_town_ord female male Cherbourg Queenstown Southampton
0 3 22.0 1 0 7.2500 0 1.0 0.0 2 0 1 0.0 0.0 1.0
1 1 38.0 1 0 71.2833 1 0.0 0.0 0 1 0 1.0 0.0 0.0
2 3 26.0 0 0 7.9250 1 0.0 1.0 2 1 0 0.0 0.0 1.0
3 1 35.0 1 0 53.1000 1 0.0 0.0 2 1 0 0.0 0.0 1.0
4 3 35.0 0 0 8.0500 0 1.0 1.0 2 0 1 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
707 3 39.0 0 5 29.1250 1 0.0 0.0 1 1 0 0.0 1.0 0.0
708 2 27.0 0 0 13.0000 0 1.0 1.0 2 0 1 0.0 0.0 1.0
709 1 19.0 0 0 30.0000 1 0.0 1.0 2 1 0 0.0 0.0 1.0
710 1 26.0 0 0 30.0000 0 1.0 1.0 0 0 1 1.0 0.0 0.0
711 3 32.0 0 0 7.7500 0 1.0 1.0 1 0 1 0.0 1.0 0.0

712 rows × 14 columns

We will first split the dataset into a training and test dataset. This step will be explained in more detail in the next lectures. The objective of this lecture is to learn how to preprocess the data and prepare it for model fitting.

[101]:
from sklearn.model_selection import train_test_split

# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, stratify=y)
[102]:
print('Training data inputs', X_train.shape)
print('Training labels', y_train.shape)
print('Testing data inputs', X_test.shape)
print('Testing labels', y_test.shape)
Training data inputs (534, 14)
Training labels (534,)
Testing data inputs (178, 14)
Testing labels (178,)

We will apply standard scaling to the training and test datasets.

[103]:
# Apply StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now, let’s apply k-Nearest Neighbor classifier from the scikit-learn library. The function fit() is used to fit the model to the data.

[104]:
from sklearn import neighbors

# create the model
knn_model = neighbors.KNeighborsClassifier(n_neighbors=5)

# fit the model
knn_model.fit(X_train_scaled, y_train)
[104]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The function score() is used to predict the accuracy of the model. The model can predict whether a passenger survived with an accuracy of 82.58%.

[105]:
# score on test set
accuracy = knn_model.score(X_test_scaled, y_test)
print('The test accuracy of k-Nearest Neighbors is {0:5.2f} %'.format(accuracy*100))
The test accuracy of k-Nearest Neighbors is 82.58 %

Next, let’s use the trained model to predict the survived labels for the first 10 passengers in the X_test dataset.

[106]:
# Make predictions on the test data
y_pred = knn_model.predict(X_test_scaled[:10])
[107]:
# Show the predictions
y_pred
[107]:
array([0, 0, 0, 1, 1, 1, 1, 0, 1, 0], dtype=int64)
[108]:
# Show the actual values from the 'survived' column
np.array(y_test[:10])
[108]:
array([0, 1, 0, 0, 0, 1, 1, 0, 1, 0], dtype=int64)

As we can see, for the first 10 samples, the model predicted correctly the target label for 7 samples.

We will have a separate lecture on scikit-learn in which we will explain in more detail how to perform classification with machine learning models.

References

  1. Complete Machine Learning Package, Jean de Dieu Nyandwi, available at: https://github.com/Nyandwi/machine_learning_complete.

  2. Advanced Python for Data Science, University of Cincinnati, available at: https://github.com/uc-python/advanced-python-datasci.

  3. Python Machine Learning (2nd Ed.) Code Repository, Sebastian Raschka, available at: https://github.com/rasbt/python-machine-learning-book-2nd-edition.

BACK TO TOP