Machine Learning: Part 1 - Naive Bayes

Introduction

This is the first in a series of posts that will look at how to write simple implementations of Machine Learning classifiers using Python. It assumes a basic familiarity with Machine Learning concepts, and will only provide a very brief introduction to the underlying theory before tackling the implementation and providing soem examples to demonstrate it's use, using freely-available data from one or more sources.

Background

The Naive Bayes classifier is a simple classification technique that makes use of Bayes theorem:

$$ P(A | B) = \frac{P(B | A) P(A)}{P(B)} $$

Specificlly, given a vector of features $\bold{x} = (x_1, ..., x_n)$ the probability of the class outcome $C_k$ given $\bold{x}$ is given by:

$$ p(C_k \space | \space \bold{x}) = \frac{p(C_k) \space p(\bold{x} \space | \space C_k)}{p(\bold{x})} $$

Using the "naive" assumption that features are independent, and after some manipulation, we get the following:

$$ p(C_k \space | \space x_1, ..., x_n) = \frac{1}{Z} p(C_k) \space \displaystyle\prod_{i=1}^n p(x_i \space | \space C_k) $$

where $Z = p(\bold{x})$ which is constant, and hence can be ommitted:

$$ p(C_k \space | \space x_1, ..., x_n) \propto p(C_k) \space \displaystyle\prod_{i=1}^n p(x_i \space | \space C_k) $$

The classifier selects the outcome, or class, with the highest probability, so as to minimize the probability of misclassification:

$$ \hat{y} = \argmax_{k \in \lbrace 1, ..., K \rbrace } \space p(C_k) \space \displaystyle\prod_{i=1}^n p(x_i \space | \space C_k) $$

Each $x_i$ can be treated as coming from a normal (Gaussian) distribution, and thus $p(x_i \space | \space C_k)$ can be calculated from the mean and variance of the data segmented by class, for each $x_i$ by plugging all values into the normal distribution equation.

Implementation

With this in mind, we can construct a simple classifier as follows:

    import numpy as np


    class NaiveBayes:

        def fit(self, X, y):
            labels, counts = np.unique(y, return_counts = True)

            self.priors = np.array([value / float(y.shape[0]) for value in counts])
            self.mean   = np.array([[X[y == i, j].mean() for j in range(X.shape[1])] for i in labels])
            self.var    = np.array([[X[y == i, j].var()  for j in range(X.shape[1])] for i in labels])

        def predict(self, X):
            return np.array([np.argmax(self.posterior(X[row])) for row in range(X.shape[0])])

        def posterior(self, X):
            posteriors = []

            for label, prior in enumerate(self.priors):
                posterior = prior

                for feature in range(X.shape[0]):
                    posterior *= self.gaussian(X[feature], self.mean[label, feature], self.var[label, feature])

                posteriors.append(posterior)

            return posteriors

        def gaussian(self, X, mean, var):
            return np.exp(-np.square(X - mean) / (2 * var)) / np.sqrt(2 * np.pi * var)

Going through each method in turn, we start with fit:

    def fit(self, X, y):
        labels, counts = np.unique(y, return_counts = True)

        self.priors = np.array([value / float(y.shape[0]) for value in counts])
        self.mean   = np.array([[X[y == i, j].mean() for j in range(X.shape[1])] for i in labels])
        self.var    = np.array([[X[y == i, j].var()  for j in range(X.shape[1])] for i in labels])

In the above method, we first calculate the priors - the $p(C_k)$ part of our equation - and then we calculate the staistics - mean and variance - for the data segmented by label and feature, to be used later. Moving on to predict:

    def predict(self, X):
        return np.array([np.argmax(self.posterior(X[row])) for row in range(X.shape[0])])

In the above method we construct an array of the indices of the highest values, which equates to the outcome or class with the highest probability. The posteriors are calculated as follows:

    def posterior(self, X):
        posteriors = []

        for label, prior in enumerate(self.priors):
            posterior = prior

            for feature in range(X.shape[0]):
                posterior *= self.gaussian(X[feature], self.mean[label, feature], self.var[label, feature])

            posteriors.append(posterior)

        return posteriors

In the above method we mutiply the prior for each label / outcome by the product of each posterior for each feature by label. We then store the result for each label. We assume a Gaussian distribution when calculating the posterior:

    def gaussian(self, X, mean, var):
        return np.exp(-np.square(X - mean) / (2 * var)) / np.sqrt(2 * np.pi * var)

Example

Lets see the above classifier in action. Firstly downlaod the iris dataset. Then load the file into memory using numpy:

    X = np.loadtxt('./data/csv/iris.csv', delimiter = ',', usecols = (0, 1, 2, 3))
    Y = np.loadtxt('./data/csv/iris.csv', delimiter = ',', usecols = 4, dtype = str)

You will need to make use of a label encoder to convert the the string labels into integer values for convenience - for example, Iris-setosa should be mapped to 0 and so on (you can use label encoder provided by scikit-learn)

    le = preprocessing.LabelEncoder()
    le.fit(Y)

Sprlit the data into train and test sets (you can use the model selection library provided by scikit-learn):

    X, X_t, Y, Y_t = model_selection.train_test_split(X, Y, train_size = 0.75)

Then fit the training data, and make predictions with the test data:

    nb = NaiveBayes()
    nb.fit(X, le.transform(Y))
    Y_hat = le.inverse_transform(nb.predict(X_t))

And finally, print the results:

    print('Result: {:.2f}% Correct'.format(100 * (Y_t == Y_hat).sum() / float(len(Y_t))))

Which should output the following:

    Result: 100.00% Correct

You can also look at a classification report, and / or a confusion matrix:

    print(metrics.classification_report(Y_t, Y_hat))
    print(metrics.confusion_matrix(Y_t, Y_hat))

Which should output the following:

                     precision    recall  f1-score   support

        Iris-setosa       1.00      1.00      1.00        11
    Iris-versicolor       1.00      1.00      1.00        13
     Iris-virginica       1.00      1.00      1.00        14

           accuracy                           1.00        38
          macro avg       1.00      1.00      1.00        38
       weighted avg       1.00      1.00      1.00        38


    [[11  0  0]
     [ 0 13  0]
     [ 0  0 14]]

Conclusion

We have seen how easy it is to write a simple Naive Bayes classifier in Python that performs well against a standard dataset. If you would like to play with the code for this example, and others examples in this series, you are welcome to visit this repository.

This series is a work in progress and will be edited and changed over time.

comments powered by Disqus