Lecture 11 - Data Exploration and Preprocessing¶
11.1 Exploratory Data Analysis¶
Exploratory Data Analysis (EDA) is an important step in all data science projects, and involves several exploratory steps to obtain a better understanding of the data.
EDA typically includes: inspecting the summary statistics of the data, observing if there are missing values and adopting an appropriate strategy for handling them, checking the distribution of the features and whether there is a correlation between features, understanding which features are important and worth keeping and which ones are less important, and similar.
To provide an example of EDA, we will use the Titanic
dataset which can be loaded from the Seaborn
datasets.
[1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
[2]:
titanic = sns.load_dataset('titanic')
Let’s check the basic information about the data. There are 891 rows (samples) and 15 columns (features). We can see below the data types of each column.
[3]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
Let’s display the first five rows and the last five rows. As we can notice, each row presents data for one passenger on Titanic.
[4]:
titanic.head()
[4]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
[5]:
titanic.tail()
[5]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.00 | S | Second | man | True | NaN | Southampton | no | True |
887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.00 | S | First | woman | False | B | Southampton | yes | True |
888 | 0 | 3 | female | NaN | 1 | 2 | 23.45 | S | Third | woman | False | NaN | Southampton | no | False |
889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.00 | C | First | man | True | C | Cherbourg | yes | True |
890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.75 | Q | Third | man | True | NaN | Queenstown | no | True |
Let’s also see the summary statistic. Recall that statistics are shown only for the columns with numerical data.
[6]:
titanic.describe()
[6]:
survived | pclass | age | sibsp | parch | fare | |
---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Let’s assume that our task is to predict whether the passenger shown in the next cell survived. I.e., we will take the survived
column to be the target, and we will implement a classification algorithm to predict it based on the other columns in the dataset.
[7]:
titanic.head(1)
[7]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.25 | S | Third | man | True | NaN | Southampton | no | False |
Explore Column Information¶
Let’s first inspect some of the columns in the dataset.
For instance, we can use the following code to find how many passengers survived and how many died.
[8]:
titanic['survived'].value_counts()
[8]:
0 549
1 342
Name: survived, dtype: int64
It is often easier to understand the data if we plot the values. The pandas library provides basic plotting functions. For instance, in the next cell we created a bar plot by using plot(kind='bar')
directly in pandas. The syntax for plotting in pandas
is somewhat different than the matplotlib
functions, and admittedly, the functionality for plotting in pandas is limited. We will learn later about the Seaborn
library which allows plotting directly in DataFrames and provides
improved visualizations in comparison to pandas plots.
[9]:
titanic['survived'].value_counts().plot(kind='bar')
[9]:
<Axes: >
Notice that there is a column called alive
which is the same as the survived
column. We need to remove this column from the data, otherwise the classifier will just use that column to make predictions for the survived passengers, and will achieve 100% accuracy.
[10]:
titanic['alive'].value_counts()
[10]:
no 549
yes 342
Name: alive, dtype: int64
[11]:
titanic.drop(['alive'], axis=1, inplace=True)
[12]:
# verify the change
titanic.head()
[12]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | True |
Also, there are two columns called class
and plass
. Let’s examine how many passengers are in these columns.
[13]:
titanic['pclass'].value_counts()
[13]:
3 491
1 216
2 184
Name: pclass, dtype: int64
[14]:
titanic['class'].value_counts()
[14]:
Third 491
First 216
Second 184
Name: class, dtype: int64
[15]:
# compare the two columns
p_class = titanic[['pclass', 'class']]
p_class.head()
[15]:
pclass | class | |
---|---|---|
0 | 3 | Third |
1 | 1 | First |
2 | 3 | Third |
3 | 1 | First |
4 | 3 | Third |
It seems that both of these columns are the same, except that one is numeric and the other contains text. Let’s drop the class
column.
[16]:
titanic.drop(['class'], axis=1, inplace=True)
[17]:
# verify the change
titanic.head()
[17]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | who | adult_male | deck | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | man | True | NaN | Southampton | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | woman | False | C | Cherbourg | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | woman | False | NaN | Southampton | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | woman | False | C | Southampton | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | man | True | NaN | Southampton | True |
Let’s explore the two columns embarked
and embark_town
.
[18]:
titanic['embarked'].value_counts()
[18]:
S 644
C 168
Q 77
Name: embarked, dtype: int64
[19]:
titanic['embark_town'].value_counts()
[19]:
Southampton 644
Cherbourg 168
Queenstown 77
Name: embark_town, dtype: int64
They are the same, therefore, drop embarked
.
[20]:
titanic.drop(['embarked'], axis=1, inplace=True)
[21]:
# verify the change
titanic.head()
[21]:
survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | deck | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | True | NaN | Southampton | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | False | C | Cherbourg | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | False | NaN | Southampton | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | False | C | Southampton | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | True | NaN | Southampton | True |
Let’s plot the occurrences in embark_town
in a bar plot.
[22]:
titanic['embark_town'].value_counts().plot(kind='bar')
plt.show()
Also, let’s check how many men and women there are in the dataset.
[23]:
titanic['sex'].value_counts()
[23]:
male 577
female 314
Name: sex, dtype: int64
[24]:
titanic['sex'].value_counts().plot(kind='bar')
[24]:
<Axes: >
[25]:
titanic['who'].value_counts()
[25]:
man 537
woman 271
child 83
Name: who, dtype: int64
We can note that the column who
is similar to the sex
column, but it also includes the number of children.
As an exercise, let’s show the categories of the column who
using a Pie Chart to visualize their values.
[26]:
titanic.who.value_counts().plot(kind='pie')
plt.show()
There is another column adult_male
which is similar, but different than sex
and who
.
[27]:
titanic['adult_male'].value_counts()
[27]:
True 537
False 354
Name: adult_male, dtype: int64
Missing Data¶
Let’s check which columns have data missing.
[28]:
titanic.isnull().sum()
[28]:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
who 0
adult_male 0
deck 688
embark_town 2
alone 0
dtype: int64
There are missing data in age
, deck
, and embark_town
columns.
The deck
column has too many rows missing, and probably the deck on which the passenger boarded the ship is not too important for the task of predicting the survived passengers, thus, let’s drop it.
[29]:
titanic.drop(['deck'], axis=1, inplace=True)
[30]:
# verify the change
titanic.head()
[30]:
survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | True | Southampton | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | False | Cherbourg | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | False | Southampton | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | False | Southampton | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | True | Southampton | True |
Since only 2 rows are missing in embark town
let’s remove those two rows. In the next cell we used dropna
to remove only those rows that have missing values in the the embark town
column.
Recall again that to drop columns in pandas we use axis=1
and to drop rows we use axis=0.
[31]:
titanic.dropna(subset=['embark_town'], axis=0, inplace=True)
We can notice that the number of rows was reduced to 889 from the original 891, because of the removed 2 rows.
[32]:
# verify the change
titanic.shape
[32]:
(889, 11)
Now we have Nan values only in the age
column.
[33]:
titanic.isnull().sum()
[33]:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
who 0
adult_male 0
embark_town 0
alone 0
dtype: int64
There are several ways to deal with this. One is to replace the missing values in the age
column with the average value of the age of passengers, or with some other value (e.g., 0 in some cases).
Let’s first explore the first option, and let’s create a new DataFrame called titanic_filled
which replaces the missing values in the age
column with the average age. For this purpose we will use the method fillna
that will calculate the mean and fill in the missing values.
[34]:
titanic_filled = titanic.fillna(titanic.mean(axis=0))
C:\Users\vakanski\AppData\Local\Temp\ipykernel_10716\1546117764.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
titanic_filled = titanic.fillna(titanic.mean(axis=0))
To verify the above, let’s display several rows that have missing values for the age
, and let’s display below the DataFrame with the filled values. We can note that the average age is 29.64
years.
[35]:
titanic.head(8)
[35]:
survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | True | Southampton | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | False | Cherbourg | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | False | Southampton | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | False | Southampton | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | True | Southampton | True |
5 | 0 | 3 | male | NaN | 0 | 0 | 8.4583 | man | True | Queenstown | True |
6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | man | True | Southampton | True |
7 | 0 | 3 | male | 2.0 | 3 | 1 | 21.0750 | child | False | Southampton | False |
[36]:
titanic_filled.head(8)
[36]:
survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.000000 | 1 | 0 | 7.2500 | man | True | Southampton | False |
1 | 1 | 1 | female | 38.000000 | 1 | 0 | 71.2833 | woman | False | Cherbourg | False |
2 | 1 | 3 | female | 26.000000 | 0 | 0 | 7.9250 | woman | False | Southampton | True |
3 | 1 | 1 | female | 35.000000 | 1 | 0 | 53.1000 | woman | False | Southampton | False |
4 | 0 | 3 | male | 35.000000 | 0 | 0 | 8.0500 | man | True | Southampton | True |
5 | 0 | 3 | male | 29.642093 | 0 | 0 | 8.4583 | man | True | Queenstown | True |
6 | 0 | 1 | male | 54.000000 | 0 | 0 | 51.8625 | man | True | Southampton | True |
7 | 0 | 3 | male | 2.000000 | 3 | 1 | 21.0750 | child | False | Southampton | False |
We can observe now that there are no missing values in the titanic_filled
DataFrame.
[37]:
titanic_filled.isnull().sum()
[37]:
survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
who 0
adult_male 0
embark_town 0
alone 0
dtype: int64
Another alternative is to drop the rows with the missing values for the age
. Let’s explore this strategy as well.
In general, we can try both strategies and check which one produces better results with the classification algorithm.
After dropping the rows with missing values, 712 rows are remaining in the dataset. The method reset_index
will change the index column to range from 0 to 711. If we didn’t reset the index column, the index values would have still ranged from 0 to 891.
[38]:
titanic.dropna(inplace=True)
[39]:
titanic.reset_index(inplace=True)
[40]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 712 non-null int64
1 survived 712 non-null int64
2 pclass 712 non-null int64
3 sex 712 non-null object
4 age 712 non-null float64
5 sibsp 712 non-null int64
6 parch 712 non-null int64
7 fare 712 non-null float64
8 who 712 non-null object
9 adult_male 712 non-null bool
10 embark_town 712 non-null object
11 alone 712 non-null bool
dtypes: bool(2), float64(2), int64(5), object(3)
memory usage: 57.1+ KB
[41]:
titanic.head()
[41]:
index | survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | True | Southampton | False |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | False | Cherbourg | False |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | False | Southampton | True |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | False | Southampton | False |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | True | Southampton | True |
Checking Feature Distribution¶
Let’s check the distribution of the numerical columns in the dataset, by plotting the histograms. For the columns with categorical data, such as survived
and pclass
, the histograms are equivalent to bar plots, and are less helpful.
[42]:
titanic[['survived','pclass','age','sibsp','parch','fare']].hist(bins=10)
plt.tight_layout()
plt.show()
Or, we can inspect the distribution of each feature.
[43]:
titanic['age'].plot(kind='hist', bins=30)
plt.show()
If we wish, we can use the groupby
function in pandas, for instance, to show the counts for the pclass
column grouped by survived
.
[44]:
titanic.groupby('survived')['pclass'].value_counts().plot(kind="bar")
plt.show()
As we mentioned earlier, although the pandas library provides some functionality for plotting directly from DataFrames, there are other plotting libraries that provide improved graphs. Among the most popular is Seaborn. A few plots created with Seaborn
are shown below.
[45]:
sns.countplot(data=titanic, x='survived', hue='pclass')
plt.show()
[46]:
sns.countplot(data=titanic, x='survived', hue='sex')
plt.show()
In the next figure we can see a scatter plot of the age
, grouped by fare
and class
. As expected, the fare in the first class was more expensive, in comparison to the second and third classes.
[47]:
sns.scatterplot(data=titanic, x='age', y='fare', hue='pclass', palette='viridis')
plt.show()
11.2 Preprocessing Numerical Data¶
Tabular data can be classified into two main categories:
Numerical data: a quantity represented by a real or integer number.
Categorical data: a discrete value, typically represented by string labels taken from a finite list of possible choices, but it is also possible to be represented by numbers from a discrete set of possible choices.
Most machine learning algorithms are sensitive to the range of values that are used for numerical inputs, and expect the input features to be scaled before processing. Feature scaling is transforming the numerical features into a small range of values.
Common feature scaling techniques for numerical features include:
Normalization
Standardization
Robust scaling
Whether or not a machine learning model requires scaling of the features depends on the model family. Linear models, such as logistic regression, generally benefit from scaling the features, while other models such as tree-based models (i.e., decision trees, random forests) do not need such preprocessing.
11.2.1 Normalization¶
Normalization is a scaling technique that transforms numerical features into a range of values between 0 and 1. When we work with features that have different ranges of values, normalizing the features can be a good practice. For example, if we have one feature (column) in the range from 100-1000, and another feature varies from 0.05-0.2, we can scale them so that they both have a range of values from 0 to 1.
Normalizing data is performed using the following formula, where \(X_{min}\) is the minimum value of feature \(X\), and \(X_{max}\) is the maximum value of \(X\).
For illustration purposes of normalization, we will use a smaller dataset called tips
available in Seaborn.
[52]:
tip_data = sns.load_dataset('tips')
tip_data.head()
[52]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Let’s separate all numerical features from the above data into a new DataFrame original_features
.
[53]:
original_features = tip_data[['total_bill', 'tip', 'size']]
To perform normalization, we will use the scikit-learn
library which provides the function MinMaxScaler()
to scale the data to the range between 0 and 1. That is the default range, but we can also select an arbitrary range to scale the data.
The fit
method in the code below first fits the data, i.e., for this task it calculates the minimum and maximum values for each column.
Afterwards, the transform
method scales the data, i.e., it uses the calculated minimum and maximum values for each column and substitutes them in the above formula to obtain scaled values of the data.
The syntax is scaler.fit(data)
and scaler.transform(data)
.
[54]:
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
minmax_scaler.fit(original_features)
normalized_features = minmax_scaler.transform(original_features)
[55]:
# Show the first five rows in 'total_bill', 'tip', and 'size'
normalized_features[:5]
[55]:
array([[0.29157939, 0.00111111, 0.2 ],
[0.1522832 , 0.07333333, 0.4 ],
[0.3757855 , 0.27777778, 0.4 ],
[0.43171345, 0.25666667, 0.2 ],
[0.45077503, 0.29 , 0.6 ]])
The scikit-learn
library also provides a combined method fit_transform
which calls first fit
and then transform
in one step. This can be more efficient than calling fit
and transform
separately. The syntax is scaler.fit_transform(data)
.
[56]:
normalized_features_2 = minmax_scaler.fit_transform(original_features)
[57]:
# Show the first five rows in 'total_bill', 'tip', and 'size'
normalized_features_2[:5]
[57]:
array([[0.29157939, 0.00111111, 0.2 ],
[0.1522832 , 0.07333333, 0.4 ],
[0.3757855 , 0.27777778, 0.4 ],
[0.43171345, 0.25666667, 0.2 ],
[0.45077503, 0.29 , 0.6 ]])
The output of MinMaxScaler()
is a NumPy array, with the values in each column scaled in the range between 0 and 1.
11.2.2 Standardization¶
Standardization is another scaling technique where numerical features are rescaled to have 0 mean (\(\mu\)) and 1 standard deviation (\(\sigma\) ).
The formula for standardization is as follows:
where \(X_{std}\) is the standardized feature, \(X\) is the original feature, \(\mu\) is the mean of the feature, and \(\sigma\) is the standard deviation.
When should we standardize the features? When we know that the training data has a normal (Gaussian) distribution. And, if the data does not have normal distribution, then normalization is a preferable scaling technique to standardization.
With some machine learning algorithms the performance will be the same with features scaling using normalization or standardization, but with other algorithms, there can be a difference in the performance. Therefore, in some cases, we can try both feature scaling techniques, especially if we are not sure about the distribution of the data.
Standardization is implemented in scikit-learn
with StandardScaler
. Similar to MinMaxScaler
, we can either use the syntax scaler.fit_transform(data)
or the syntax with scaler.fit(data)
and scaler.transform(data)
. And, we will explain the difference between these two syntaxes in the next lectures when we will introduce the concepts of training and testing datasets.
[58]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
standardized_features = std_scaler.fit_transform(original_features)
[59]:
standardized_features[:5]
[59]:
array([[-0.31471131, -1.43994695, -0.60019263],
[-1.06323531, -0.96920534, 0.45338292],
[ 0.1377799 , 0.36335554, 0.45338292],
[ 0.4383151 , 0.22575414, -0.60019263],
[ 0.5407447 , 0.4430195 , 1.50695847]])
We can inspect the mean and standard variation of the original data using the mean_
and var_
attributes. The convention in scikit-learn is that if an attribute is learned from the data, its name ends with an underscore, as in mean_
and var_
for the StandardScaler
.
[60]:
# The mean of each feature in the original data
std_scaler.mean_
[60]:
array([19.78594262, 2.99827869, 2.56967213])
[61]:
# The variance each feature in the original data
std_scaler.var_
[61]:
array([78.92813149, 1.90660851, 0.9008835 ])
As expected, each column in the scaled data standardized_features
has zero mean and unit variance.
[62]:
# The mean of each feature in the scaled data
np.round(standardized_features.mean(axis=0))
[62]:
array([-0., 0., -0.])
[63]:
# The variance each feature in the scaled data
np.round(standardized_features.std(axis=0))
[63]:
array([1., 1., 1.])
It is easy to confuse normalization with standardization, since standardization rescales the data to have a normal distribution with mean 0 and standard deviation 1, therefore, pay attention and try not to confuse these two scaling techniques.
11.2.3 Robust Scaling¶
Scikit-learn provides another scaling method called RobustScaler
, which is more suitable when the data contain many outliers.
RobustScaler
applies similar scaling to standardization, but it uses the median and the interquartile range (IQR) instead of the mean and standard deviation to scale the features. This makes it less sensitive to extreme values or outliers in the data. Recall the Interquartile Range(IQR) is the difference between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).
Because of that, RobustScaler
is also more suitable for datasets with non-normal distributions.
[64]:
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler()
robust_scaled_features = rob_scaler.fit_transform(original_features)
[65]:
robust_scaled_features[:5]
[65]:
array([[-0.07467532, -1.2096 , 0. ],
[-0.69155844, -0.7936 , 1. ],
[ 0.29823748, 0.384 , 1. ],
[ 0.54591837, 0.2624 , 0. ],
[ 0.63033395, 0.4544 , 2. ]])
We can confirm that the columns in the scaled data have 0 median.
[66]:
np.round(np.median(robust_scaled_features, axis=0))
[66]:
array([-0., 0., 0.])
The MinMaxScaler
, StandardScaler
, and RobustScaler
in scikit-learn are also called transformers, since they are used to perform various data transformations on the original dataset before feeding the data into a machine learning model. Transformers are an essential part of the data processing in scikit-learn’s pipeline.
11.3 Preprocessing Categorical Data¶
Categorical data contain a limited number of discrete categories. An example is the feature who
in the titanic
dataset that has three categories: man
, woman
, and child
.
In many cases, categorical features have text values, and they need to be converted into numerical values in order to be processed by machine learning algorithms.
We will look into the following techniques for converting categorical features into numerical features:
Mapping method
Ordinal encoding
Label encoding
Pandas dummies
One-hot encoding
The first three techniques produce a single number for each category, and the last two techniques produce one-hot matrix.
We are going to use again the Titanic dataset, which has several categorical features.
[67]:
titanic.head()
[67]:
index | survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | True | Southampton | False |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | False | Cherbourg | False |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | False | Southampton | True |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | False | Southampton | False |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | True | Southampton | True |
11.3.1 Mapping Method¶
The mapping method is a straightforward way to encode categorical features when there are few categories. For instance, for the who
feature, we will create a dictionary map_dict
whose keys are the three categories man
, woman
, child
, and they are mapped to numerical values 0, 1, and 2.
[68]:
titanic['who'].value_counts()
[68]:
man 413
woman 216
child 83
Name: who, dtype: int64
[69]:
map_dict = {'man': 0, 'woman': 1, 'child': 2}
The map()
method is applied next to map the keys to values in the who
column.
[70]:
titanic['who'] = titanic['who'].map(map_dict)
Now the who
feature is numerical, and all instances with the class man
were replaced with 0
, and the same applies to the other two classes.
[71]:
titanic.head()
[71]:
index | survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | 0 | True | Southampton | False |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | 1 | False | Cherbourg | False |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | 1 | False | Southampton | True |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | 1 | False | Southampton | False |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | 0 | True | Southampton | True |
[72]:
# verify that value_counts remained the same after the mapping
titanic['who'].value_counts()
[72]:
0 413
1 216
2 83
Name: who, dtype: int64
11.3.2 Ordinal Encoding¶
Ordinal encoding can be implemented with the OrdinalEncoder
in scikit-learn, which will automatically encode each category with a different numerical value. This method is often preferred, since it is automated and less prone to errors.
Let’s apply it to the columns alone
and adult_male
.
[73]:
titanic['alone'].value_counts()
[73]:
True 402
False 310
Name: alone, dtype: int64
[74]:
titanic['adult_male'].value_counts()
[74]:
True 413
False 299
Name: adult_male, dtype: int64
[75]:
from sklearn.preprocessing import OrdinalEncoder
categs_feats = titanic[['adult_male', 'alone']]
encoder = OrdinalEncoder()
categs_encoded = encoder.fit_transform(categs_feats)
The output of the OrdinalEncoder
is a NumPy array categs_encoded
, shown below.
[76]:
categs_encoded
[76]:
array([[1., 0.],
[0., 0.],
[0., 1.],
...,
[0., 1.],
[1., 1.],
[1., 1.]])
In the next cell, we will convert the NumPy array categs_encoded
into pandas DataFrame. In this line, columns=categs_feats.columns
specifies the column names for the DataFrame, and index=categs_feats.index
specifies the row index for the DataFrame.
Note below that the text values in the columns alone
and adult_male
have been replaced with numeric values 0
or 1
.
[77]:
titanic[['adult_male', 'alone']] = pd.DataFrame(categs_encoded, columns=categs_feats.columns, index=categs_feats.index)
titanic.head()
[77]:
index | survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | Southampton | 0.0 |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | Cherbourg | 0.0 |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | Southampton | 1.0 |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | Southampton | 0.0 |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | Southampton | 1.0 |
[78]:
# verify that value_counts remained the same after the mapping
titanic['alone'].value_counts()
[78]:
1.0 402
0.0 310
Name: alone, dtype: int64
We can also check the applied mapping between the categories and the numerical values via the attribute categories_
.
[79]:
encoder.categories_
[79]:
[array([False, True]), array([False, True])]
Note that OrdinalEncoder
can not handle missing values, and if we try to apply it to a column with missing values, we will get an error.
Also, we need to be careful when applying this encoding strategy, because by default, OrdinalEncoder
uses a lexicographical strategy to map string category labels to integers. For instance, suppose the dataset has a categorical variable named "size"
with categories such as “S”, “M”, “L”, “XL”, and we would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3
. However, the lexicographical strategy used by default
would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3, by following the alphabetical order. To avoid that, we can pass a list with the expected order for the categories
argument for each feature (e.g., encoder = OrdinalEncoder(categories=[True, False]
for the column alone
).
If a categorical variable does not carry any meaningful order information, then we can consider using one-hot encoding described in the sections below.
11.3.3 Label Encoding¶
Label encoding is used to encode categorical values in the target label column with the LabelEncoder
in scikit-learn.
In this case, the target label column survived
has numerical values, and it does not need to be encoded. Therefore, let’s apply the LabelEncoder
to the embark_town
column.
[80]:
from sklearn.preprocessing import LabelEncoder
embtown_feat = titanic[['embark_town']]
label_encoder = LabelEncoder()
embtown_encoded = label_encoder.fit_transform(embtown_feat)
C:\Users\vakanski\anaconda3\Lib\site-packages\sklearn\preprocessing\_label.py:114: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
The output of Label Encoder is also a NumPy array. Let’s convert it to a pandas dataframe, and add it as a new column embark_town_ord
.
[81]:
titanic['embark_town_ord'] = pd.DataFrame(embtown_encoded, columns=embtown_feat.columns, index=embtown_feat.index)
titanic.head()
[81]:
index | survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | embark_town_ord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | Southampton | 0.0 | 2 |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | Cherbourg | 0.0 | 0 |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | Southampton | 1.0 | 2 |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | Southampton | 0.0 | 2 |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | Southampton | 1.0 | 2 |
[82]:
label_encoder.classes_
[82]:
array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)
[83]:
titanic['embark_town_ord'].value_counts()
[83]:
2 554
0 130
1 28
Name: embark_town_ord, dtype: int64
11.3.4 Pandas Dummies¶
Pandas provides a function get_dummies
that can be also used to handle categorical features. This function creates new columns based on the number of available categories in a target column. For example, let’s apply it to the feature sex
.
[84]:
titanic.head()
[84]:
index | survived | pclass | sex | age | sibsp | parch | fare | who | adult_male | embark_town | alone | embark_town_ord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | Southampton | 0.0 | 2 |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | Cherbourg | 0.0 | 0 |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | Southampton | 1.0 | 2 |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | Southampton | 0.0 | 2 |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | Southampton | 1.0 | 2 |
[85]:
dummies = pd.get_dummies(titanic['sex'])
[86]:
titanic = pd.concat([titanic.drop('sex',axis=1),dummies], axis=1)
Note that new columns female
and male
with 0 or 1 values were added to the right of the DataFrame.
[87]:
titanic.head()
[87]:
index | survived | pclass | age | sibsp | parch | fare | who | adult_male | embark_town | alone | embark_town_ord | female | male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | Southampton | 0.0 | 2 | 0 | 1 |
1 | 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | Cherbourg | 0.0 | 0 | 1 | 0 |
2 | 2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | Southampton | 1.0 | 2 | 1 | 0 |
3 | 3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | Southampton | 0.0 | 2 | 1 | 0 |
4 | 4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | Southampton | 1.0 | 2 | 0 | 1 |
This type of encoding is also called one-hot encoding, where each category (unique value) in the column sex
became a column, and for each row (sample), 1 specifies the category to which it belongs.
11.3.5 One-Hot Encoding¶
Scikit-learn provides a function OneHotEncoder
that converts a feature into one-hot matrix. As with the pandas dummies, additional columns corresponding to the values of the given categories are created.
Let’s apply it to embark_town
.
[88]:
titanic['embark_town'].value_counts()
[88]:
Southampton 554
Cherbourg 130
Queenstown 28
Name: embark_town, dtype: int64
[89]:
titanic.head()
[89]:
index | survived | pclass | age | sibsp | parch | fare | who | adult_male | embark_town | alone | embark_town_ord | female | male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | Southampton | 0.0 | 2 | 0 | 1 |
1 | 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | Cherbourg | 0.0 | 0 | 1 | 0 |
2 | 2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | Southampton | 1.0 | 2 | 1 | 0 |
3 | 3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | Southampton | 0.0 | 2 | 1 | 0 |
4 | 4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | Southampton | 1.0 | 2 | 0 | 1 |
[90]:
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder()
town_encoded = one_hot.fit_transform(titanic[['embark_town']])
[91]:
one_hot.categories_
[91]:
[array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]
[92]:
town_encoded
[92]:
<712x3 sparse matrix of type '<class 'numpy.float64'>'
with 712 stored elements in Compressed Sparse Row format>
The output of OneHotEncoder is a sparse matrix. We will need to convert it into NumPy array first, and afterward we can convert it into pandas DataFrame.
[93]:
town_encoded = town_encoded.toarray()
[94]:
columns = list(one_hot.categories_)
town_df = pd.DataFrame(town_encoded, columns=columns)
town_df.head()
[94]:
Cherbourg | Queenstown | Southampton | |
---|---|---|---|
0 | 0.0 | 0.0 | 1.0 |
1 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 0.0 | 1.0 |
[95]:
titanic.drop('embark_town',axis=1, inplace=True)
[96]:
titanic[['Cherbourg', 'Queenstown', 'Southampton']] = town_df
[97]:
titanic.head()
[97]:
index | survived | pclass | age | sibsp | parch | fare | who | adult_male | alone | embark_town_ord | female | male | Cherbourg | Queenstown | Southampton | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | 0.0 | 2 | 0 | 1 | 0.0 | 0.0 | 1.0 |
1 | 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 0.0 | 0.0 |
2 | 2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | 1.0 | 2 | 1 | 0 | 0.0 | 0.0 | 1.0 |
3 | 3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | 0.0 | 2 | 1 | 0 | 0.0 | 0.0 | 1.0 |
4 | 4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | 1.0 | 2 | 0 | 1 | 0.0 | 0.0 | 1.0 |
Choosing an encoding strategy¶
Choosing an encoding strategy depends on the used models and the type of categories (i.e., ordinal vs. nominal). In general, One-Hot Encoding is the encoding strategy used when the downstream models are linear models, while Ordinal Encoding is often a good strategy with tree-based models.
With ordinal encoding, there is an order in the resulting categories, e.g. 0 < 1 < 2
(called ordinal categories). The impact of violating this ordering assumption is dependent on the downstream models. Linear models will be impacted by misordered categories, while tree-based models will not.
One-hot encoding is applied when the ordering of the categories is not important. Such categories are also called nominal categories. This encoding can cause computational inefficiency in tree-based models with high number of categories, and because of this, it is not recommended to use with these models.
11.4 Combining Numerical and Categorical Features¶
Now let’s prepare the numerical and categorical data in the titanic
dataset and train a classification model.
First, assign the survived
column to be the target label y
.
[98]:
y = titanic['survived']
We will use the other columns to be data features X
, therefore let’s drop the survived
column and the index
column.
It is very important to always remove the index
column from the data used for training a model. If we leave the index
column in the data, the model can learn to associate the target labels with the index of each data point (row).
[99]:
X = titanic.drop(['survived', 'index'], axis=1)
[100]:
X
[100]:
pclass | age | sibsp | parch | fare | who | adult_male | alone | embark_town_ord | female | male | Cherbourg | Queenstown | Southampton | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1.0 | 0.0 | 2 | 0 | 1 | 0.0 | 0.0 | 1.0 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 0.0 | 0.0 |
2 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | 1.0 | 2 | 1 | 0 | 0.0 | 0.0 | 1.0 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0.0 | 0.0 | 2 | 1 | 0 | 0.0 | 0.0 | 1.0 |
4 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1.0 | 1.0 | 2 | 0 | 1 | 0.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
707 | 3 | 39.0 | 0 | 5 | 29.1250 | 1 | 0.0 | 0.0 | 1 | 1 | 0 | 0.0 | 1.0 | 0.0 |
708 | 2 | 27.0 | 0 | 0 | 13.0000 | 0 | 1.0 | 1.0 | 2 | 0 | 1 | 0.0 | 0.0 | 1.0 |
709 | 1 | 19.0 | 0 | 0 | 30.0000 | 1 | 0.0 | 1.0 | 2 | 1 | 0 | 0.0 | 0.0 | 1.0 |
710 | 1 | 26.0 | 0 | 0 | 30.0000 | 0 | 1.0 | 1.0 | 0 | 0 | 1 | 1.0 | 0.0 | 0.0 |
711 | 3 | 32.0 | 0 | 0 | 7.7500 | 0 | 1.0 | 1.0 | 1 | 0 | 1 | 0.0 | 1.0 | 0.0 |
712 rows × 14 columns
We will first split the dataset into a training and test dataset. This step will be explained in more detail in the next lectures. The objective of this lecture is to learn how to preprocess the data and prepare it for model fitting.
[101]:
from sklearn.model_selection import train_test_split
# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, stratify=y)
[102]:
print('Training data inputs', X_train.shape)
print('Training labels', y_train.shape)
print('Testing data inputs', X_test.shape)
print('Testing labels', y_test.shape)
Training data inputs (534, 14)
Training labels (534,)
Testing data inputs (178, 14)
Testing labels (178,)
We will apply standard scaling to the training and test datasets.
[103]:
# Apply StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now, let’s apply k-Nearest Neighbor classifier from the scikit-learn library. The function fit()
is used to fit the model to the data.
[104]:
from sklearn import neighbors
# create the model
knn_model = neighbors.KNeighborsClassifier(n_neighbors=5)
# fit the model
knn_model.fit(X_train_scaled, y_train)
[104]:
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
The function score()
is used to predict the accuracy of the model. The model can predict whether a passenger survived with an accuracy of 82.58%.
[105]:
# score on test set
accuracy = knn_model.score(X_test_scaled, y_test)
print('The test accuracy of k-Nearest Neighbors is {0:5.2f} %'.format(accuracy*100))
The test accuracy of k-Nearest Neighbors is 82.58 %
Next, let’s use the trained model to predict the survived
labels for the first 10 passengers in the X_test
dataset.
[106]:
# Make predictions on the test data
y_pred = knn_model.predict(X_test_scaled[:10])
[107]:
# Show the predictions
y_pred
[107]:
array([0, 0, 0, 1, 1, 1, 1, 0, 1, 0], dtype=int64)
[108]:
# Show the actual values from the 'survived' column
np.array(y_test[:10])
[108]:
array([0, 1, 0, 0, 0, 1, 1, 0, 1, 0], dtype=int64)
As we can see, for the first 10 samples, the model predicted correctly the target label for 7 samples.
We will have a separate lecture on scikit-learn
in which we will explain in more detail how to perform classification with machine learning models.
References¶
Complete Machine Learning Package, Jean de Dieu Nyandwi, available at: https://github.com/Nyandwi/machine_learning_complete.
Advanced Python for Data Science, University of Cincinnati, available at: https://github.com/uc-python/advanced-python-datasci.
Python Machine Learning (2nd Ed.) Code Repository, Sebastian Raschka, available at: https://github.com/rasbt/python-machine-learning-book-2nd-edition.
BACK TO TOP