2. Supervised classification

In this ML tutorial, we will explore probably the most famous data set for data analysis - the Iris data set (also known as the “Hello, world” of machine learning).

The Iris data set is basically a table with four numbers (the width and the length of the sepals and petals), the so-called “features” or “attributes”, and the name of the specific Iris species or classes. It consists of 150 instances.

In this tutorial, we want to train a model to predict the class given the features (i.e. width and length of sepals and petals). We can also say, the “target variable”, or the desired output, is the species of the Iris. This model should perform within a given accuracy for new data.

This data set is perfectly suitable to start your ML career, because it has a well balanced class distribution and there are no missing data. This means you do not need to invest any time in data preparation. Good data preparation is usually one of the most important steps in data analysis, and the many possibilities and complexities can be very overwhelming for a beginner.

So, let’s get started. :)

First, we need to import some packages.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score # for cross-validation score
from sklearn.model_selection import StratifiedKFold # k-fold for cross-validation score
from sklearn import svm # support vector machine algorithm
from sklearn.neighbors import KNeighborsClassifier # K neareast neighbours algorithm
from sklearn.linear_model import LogisticRegression # logistic regression algorithm
from sklearn.tree import DecisionTreeClassifier # decision tree algorithm
from sklearn import metrics # for evaluating the model

2.1. Iris data set

Now we need our data set, which is available online on the UC Irvine Machine Learning Repository. We define the path and insert this path in the command read_csv from the pandas package. We also specify the names of the columns in the read_csv command. This organizes the output a bit better and we can access the individual columns via these names.

path_to_data = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
iris = pd.read_csv(path_to_data, names = columns)

Let’s have a first look into the data.

print(iris)
     sepal_length  sepal_width  petal_length  petal_width           class
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]

So, let’s check the minimum and maximum values of the sepal lengths.

iris['sepal_length'].max()
7.9
iris['sepal_length'].min()
4.3

You can use the methods describe and info to get more information about your data. This is specifically useful to get e.g. the number of instancec, some statistical values and information about null values in the data.

iris.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
iris['class'].value_counts()
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

2.2. Splitting into training and test set

Ok, now we know a bit more about the data set. We see that there are no missing data, no NaNs or other corrupted data points. Next, let’s split our data set into training and test sets.

train, test = train_test_split(iris, test_size = 0.3)

The keyword test_size gives the percentage of data that should be withhold for the test set.

train.describe()
sepal_length sepal_width petal_length petal_width
count 105.000000 105.000000 105.00000 105.000000
mean 5.857143 3.062857 3.77619 1.200000
std 0.868892 0.447054 1.81153 0.773603
min 4.300000 2.000000 1.10000 0.100000
25% 5.100000 2.800000 1.50000 0.300000
50% 5.700000 3.000000 4.40000 1.300000
75% 6.400000 3.400000 5.10000 1.800000
max 7.900000 4.400000 6.90000 2.500000
test.describe()
sepal_length sepal_width petal_length petal_width
count 45.000000 45.000000 45.000000 45.000000
mean 5.811111 3.033333 3.717778 1.195556
std 0.732127 0.404520 1.668326 0.746798
min 4.400000 2.200000 1.000000 0.100000
25% 5.200000 2.800000 1.700000 0.400000
50% 5.800000 3.000000 4.100000 1.300000
75% 6.400000 3.300000 5.100000 1.600000
max 7.400000 4.000000 6.100000 2.500000
train['class'].value_counts()
Iris-virginica     39
Iris-setosa        36
Iris-versicolor    30
Name: class, dtype: int64

We see that the distribution of the classes in the training set does not resemble the distribution in the original data set, where all of the different Iris species are equally distributed. We can use the keyword stratify in train_test_split to ensure the same distribution.

train, test = train_test_split(iris, test_size = 0.3, stratify = iris['class'])
train['class'].value_counts()
Iris-setosa        35
Iris-versicolor    35
Iris-virginica     35
Name: class, dtype: int64

Save each class in an individual variable - it’s more comfortable to access the different classes this way.

setosa = train[train['class']=='Iris-setosa']
virginica = train[train['class']=='Iris-virginica']
versicolor = train[train['class']=='Iris-versicolor']

We can here apply the same methods (describe and info) as previously.

setosa.describe()
sepal_length sepal_width petal_length petal_width
count 35.000000 35.000000 35.000000 35.000000
mean 4.960000 3.422857 1.468571 0.240000
std 0.328275 0.341352 0.185934 0.103469
min 4.300000 2.900000 1.000000 0.100000
25% 4.750000 3.200000 1.400000 0.200000
50% 5.000000 3.400000 1.500000 0.200000
75% 5.100000 3.650000 1.500000 0.300000
max 5.700000 4.400000 1.900000 0.500000

2.3. Plotting the data

Now let’s make some plots to get a better feeling for the Iris data.

First, we will make two scatter plots - the sepal length versus the sepal width, and the petal length versus the petal width. We will also color the species differently.

fig = plt.figure(figsize=(6,12))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)

ax1.scatter(setosa['sepal_length'], setosa['sepal_width'], c='b', label='setosa')
ax1.scatter(versicolor['sepal_length'], versicolor['sepal_width'], c='g', label='versicolor')
ax1.scatter(virginica['sepal_length'], virginica['sepal_width'], c='orange', label='virginica')
ax1.set_xlabel('sepal length [cm]')
ax1.set_ylabel('sepal width [cm]')
ax1.legend(loc='lower right')

ax2.scatter(setosa['petal_length'], setosa['petal_width'], c='b', label='setosa')
ax2.scatter(versicolor['petal_length'], versicolor['petal_width'], c='g', label='versicolor')
ax2.scatter(virginica['petal_length'], virginica['petal_width'], c='orange', label='virginica')
ax2.set_xlabel('petal length [cm]')
ax2.set_ylabel('petal width [cm]')
ax2.legend(loc='lower right');

ax1.grid(True, linewidth=0.1, color='#000000', linestyle='-')
ax2.grid(True, linewidth=0.1, color='#000000', linestyle='-')

plt.show()
../../_images/Tutorial_SupervisedClassification_29_0.png

In the petal plot, we can clearly see three clusters. It seems that the petal features are better suited to distinguish the species than the sepal features.

Let’s have a look at the distribution of the four features with histograms.

n_bins = 10
fig2 = plt.figure(figsize=(12,12))
ax3 = fig2.add_subplot(421)
ax4 = fig2.add_subplot(422)
ax5 = fig2.add_subplot(423)
ax6 = fig2.add_subplot(424)

ax3.hist(train['sepal_length'], bins=n_bins, edgecolor='black', linewidth=1.1);
ax3.set_xlabel('sepal length [cm]')

ax4.hist(train['petal_length'], bins=n_bins, edgecolor='black', linewidth=1.1);
ax4.set_xlabel('petal length [cm]')

ax5.hist(train['sepal_width'], bins=n_bins, edgecolor='black', linewidth=1.1);
ax5.set_xlabel('sepal width [cm]')

ax6.hist(train['petal_width'], bins=n_bins, edgecolor='black', linewidth=1.1);
ax6.set_xlabel('petal width [cm]')

fig2.tight_layout(pad=1.0);

plt.show()
../../_images/Tutorial_SupervisedClassification_32_0.png

Again, we see that there is a group of smaller values for the petal lengths and widths. For the last plot, let’s plot a so called violin plot.

plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.violinplot(y='class', x='sepal_length', data=train, innter='quartile')
plt.subplot(2,2,2)
sns.violinplot(y='class', x='sepal_width', data=train, innter='quartile')
plt.subplot(2,2,3)
sns.violinplot(y='class', x='petal_length', data=train, innter='quartile')
plt.subplot(2,2,4)
sns.violinplot(y='class', x='petal_width', data=train, innter='quartile')
plt.tight_layout(pad=1.0);
../../_images/Tutorial_SupervisedClassification_34_0.png

And again, we recognize that Iris setosa has the lowest values for petal lengths and widths.

2.4. Correlation and feature selection

Correlation between the features plays an important role. If there are many correlated features, then it is not advised to take all of the features for training the algorithm, since this reduces the accuracy of the model. Feature selection before training an algorithm is a very important step.

Let’s have a look at the correlation matrix for our training data set. To get the correlation, we use the method corr on our data set. To plot the correlation matrix, we use heatmap from the seaborn package.

train_corr = train.corr()
print(train_corr['sepal_length'].sort_values(ascending=False))
sepal_length    1.000000
petal_length    0.874768
petal_width     0.808879
sepal_width    -0.110165
Name: sepal_length, dtype: float64
plt.figure(figsize=(9, 7))
sns.heatmap(train_corr, annot=True, cmap='coolwarm')
<AxesSubplot:>
../../_images/Tutorial_SupervisedClassification_39_1.png

We see that the sepal features are not correlated with each other, whereas the petal features are. High correlation between features presents redundant information, which increases the dimensionality without providing valuable information. This usually reduces the accuracy of a trained model.

As a first step, let’s train the model with all of the features. Then we make a selection of the features and investigate the effects on the outcome.

2.5. Algorithms

Pythons’s sklearn provides many algorithms for different purposes. We show here only a few of the available algorithms suitable for classification problem.

First, let’s specify the feature vector, usually denoted with X (= independent variables), and the label vector, usually denoted with y (= dependent variables).

X_train = train[['sepal_length','sepal_width','petal_length','petal_width']] 
y_train = train['class']

print(X_train.shape,y_train.shape)

X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
y_test = test['class']

print(X_test.shape,y_test.shape)
(105, 4) (105,)
(45, 4) (45,)

Next, let’s train our models and make some predictions.

To train the model, we will use the fit method. This will be done on the training data, X_train, and the training output, y_train. To make a prediction with the trained model, we will then use the predict method, which will be done on the test data, X_test.

2.5.1. Support Vector Machine - SVM

model_svm = svm.SVC(gamma='auto') # specify the model/algorithm
model_svm.fit(X_train, y_train) # train the model with the training data
y_pred_svm = model_svm.predict(X_test) # make the prediction with the test data

print('%s: %f' %('accuracy (SVM)', metrics.accuracy_score(y_test, y_pred_svm)))

kfold = StratifiedKFold(n_splits=5, shuffle=False)
cv_score_svm = cross_val_score(model_svm, X_train, y_train, scoring='accuracy', cv=kfold)

print('%s: %f (%f)' %('accuracy cross-validation (standard deviation)', cv_score_svm.mean(), cv_score_svm.std()))
accuracy (SVM): 0.955556
accuracy cross-validation (standard deviation): 0.980952 (0.023328)

The accuracy of the prediction can be calculated with accuracy_score and can be used for a somehow final evaluation on how good your model is.

If we are interested in the cross-validation score, we can use StratifiedKFold to generate n_splits of the training set and then calculate an accuracy with cross_val_score. We get an accuracy for each of the folds and can then also do some statistics (e.g. mean or standard deviation) on these results.

Just a few words on accuracy: The accuracy value is not always indicative of a good model and its performance. The accuracy of a model can be really good, but the model performance might not be that good. This might especially be true for imbalanced data sets, but also for balanced ones. There are nice discussions on this topic throughout the internet.

All in all, it is never wrong to use different metrics for comparing models and for evaluating their accuracies and performances.

cls = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
conf_mat = metrics.confusion_matrix(y_pred_svm, y_test, labels=cls)
print(conf_mat)
[[15  0  0]
 [ 0 14  1]
 [ 0  1 14]]
df_cm = pd.DataFrame(conf_mat, index = [i for i in cls],
                  columns = [i for i in cls])
plt.figure(figsize = (8,6))
sns.heatmap(df_cm, annot=True)
<AxesSubplot:>
../../_images/Tutorial_SupervisedClassification_49_1.png
print(metrics.classification_report(y_pred_svm, y_test))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.93      0.93      0.93        15
 Iris-virginica       0.93      0.93      0.93        15

       accuracy                           0.96        45
      macro avg       0.96      0.96      0.96        45
   weighted avg       0.96      0.96      0.96        45

2.5.2. Logistic Regression

model_lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)

print('%s: %f' %('accuracy (Logistic Regression)', metrics.accuracy_score(y_test, y_pred_lr)))

kfold = StratifiedKFold(n_splits=5, shuffle=False)
cv_score_lr = cross_val_score(model_lr, X_train, y_train, scoring='accuracy', cv=kfold)
print('%s: %f (%f)' %('accuracy cross-validation (standard deviation)', cv_score_lr.mean(), cv_score_lr.std()))
accuracy (Logistic Regression): 0.977778
accuracy cross-validation (standard deviation): 0.971429 (0.038095)
print(metrics.classification_report(y_pred_lr, y_test))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.93      1.00      0.97        14
 Iris-virginica       1.00      0.94      0.97        16

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45

2.5.3. Decision Tree

model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
y_pred_dt = model_dt.predict(X_test)

print('%s: %f' %('accuracy (Decision Tree)', metrics.accuracy_score(y_test, y_pred_dt)))

kfold = StratifiedKFold(n_splits=5, shuffle=False)
cv_score_dt = cross_val_score(model_dt, X_train, y_train, cv=kfold)
print('%s: %f (%f)' %('accuracy cross-validation (standard deviation)', cv_score_dt.mean(), cv_score_dt.std()))
accuracy (Decision Tree): 0.977778
accuracy cross-validation (standard deviation): 0.961905 (0.035635)
print(metrics.classification_report(y_pred_dt, y_test))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.93      1.00      0.97        14
 Iris-virginica       1.00      0.94      0.97        16

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45

2.5.4. K-nearest Neighbors

model_knn = KNeighborsClassifier() # default: n_neighbors = 5
model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test)

print('%s: %f' %('accuracy (K-nearest Neighbors)', metrics.accuracy_score(y_test, y_pred_knn)))

kfold = StratifiedKFold(n_splits=5, shuffle=False)
cv_score_knn = cross_val_score(model_knn, X_train, y_train, cv=kfold)
print('%s: %f (%f)' %('accuracy cross-validation (standard deviation)', cv_score_knn.mean(), cv_score_knn.std()))
accuracy (K-nearest Neighbors): 0.977778
accuracy cross-validation (standard deviation): 0.961905 (0.035635)
print(metrics.classification_report(y_pred_knn, y_test))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.93      1.00      0.97        14
 Iris-virginica       1.00      0.94      0.97        16

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45
model_knn = KNeighborsClassifier(n_neighbors = 4) # default: n_neighbors = 5
model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test)

print('%s: %f' %('accuracy (K-nearest Neighbors)', metrics.accuracy_score(y_test, y_pred_knn)))

kfold = StratifiedKFold(n_splits=5, shuffle=False)
cv_score_knn = cross_val_score(model_knn, X_train, y_train, cv=kfold)
print('%s: %f (%f)' %('accuracy cross-validation (standard deviation)', cv_score_knn.mean(), cv_score_knn.std()))
accuracy (K-nearest Neighbors): 0.977778
accuracy cross-validation (standard deviation): 0.961905 (0.019048)
print(metrics.classification_report(y_pred_knn, y_test))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.93      1.00      0.97        14
 Iris-virginica       1.00      0.94      0.97        16

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45
cls = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
conf_mat = metrics.confusion_matrix(y_pred_knn, y_test, labels=cls)
print(conf_mat)
[[15  0  0]
 [ 0 14  0]
 [ 0  1 15]]
df_cm = pd.DataFrame(conf_mat, index = [i for i in cls],
                  columns = [i for i in cls])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)
<AxesSubplot:>
../../_images/Tutorial_SupervisedClassification_63_1.png