Naive Bayes Implementation using Pandas

Fr 13 Januar 2017 — under Bayes, Python, Machine Learning, Pandas

Naive Bayes¶

Naive Bayes is one of the simplest classification machine learning algorithm. As the name suggests its based on the Bayes theorem.

Doing my thesis using Probabilistic Programming I always had read about many models and how it compared with Naive Bayes classifier. Even though of the simplicity Naive bayes is a pretty solid and basic classifier every machine learning student should know. But I never had an opportunity to fully understand this simple tool mainly because I used it like a blackbox using many implementations available, the most famous being from scikit learn.

I got inspired by Sebastian Raschka about implementing machine learning algorithms from scarch. I completely agree to the argument that it just improves your learning. So here I start with the simple implementation of the Naive Bayes and later using the Probabilistic Programming prespective.

Bayes Theorem¶

$$ P(A|B) = \frac{P(B|A) P(A) }{P(B)}$$

The above formula can be reinterpreted in the machine learning terms of features and class label. Classification is a problem of assigning classes based on the features provided.

$$ P(class | features ) = \frac{P(features | class) P(class)}{P(features)}$$

For example we need to classify a person's sex based on the height and weight. So here the class={male,female} and features={height,weight}, and the formula can he re-written as :

$$ P(sex | height, weight ) = \frac{P(height, weight | sex) P(sex)}{P(height,weight)}$$

Based on these information lets go ahead and implement the algorithm on a discretized problem using the vectorization properties of the problem. To understand what is vectorization of problem please read From Python to Numpy.

Dataset¶

Lets see and example problem . This is the data set that we will be using is shown bellow .

In the above dataset if we give the hypothesis =

{"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}¶

then what is the probability that he will buy or will not buy a computer.

In [2]:

# Dataset
import pandas as pd

data = pd.read_csv ('./naive_bayes_dataset.csv')
print (data)

      Age  Income Student Credit_Rating Buys_Computer
0    <=30    high      no          fair            no
1    <=30    high      no     excellent            no
2   31-40    high      no          fair           yes
3     >40  medium      no          fair           yes
4     >40     low     yes          fair           yes
5     >40     low     yes     excellent            no
6   31-40     low     yes     excellent           yes
7    <=30  medium      no          fair            no
8    <=30     low     yes          fair           yes
9     >40  medium     yes          fair           yes
10   <=30  medium     yes     excellent           yes
11  31-40  medium      no     excellent           yes
12  31-40    high     yes          fair           yes
13    >40  medium      no     excellent            no

Formulation¶

$$ P(Buys computer | Age, Income, Student, Credit rating) = \frac{P(Age, Income, Student, Credit rating | Buys computer) P(Buys computer) } {P(Age, Income, Student, Credit rating)}$$

The Naive part of the Naive Bayes algorithm is the following assumptiong that the features are mutually independent(which is rarely true). But it allows us to do this simplfyied mathematics .

Calculating the prior¶

prior = P(Buys computer)

P(Buys computer ) = How many times (yes/no) appears / Total observations

P(Buys computer = Yes) P(Buys computer = No)

We use the groupby function from pandas

In [69]:

prior = data.groupby('Buys_Computer').size().div(len(data)) #count()['Age']/len(data)
print prior

Buys_Computer
no     0.357143
yes    0.642857
dtype: float64

Calculating likelihood¶

Likelihood is generated for each of the features of the dataset. Basicallay likelihood is probability of finding each feature given class label.

$$ P(Age | Buys computer) $$$$ P(Income | Buys computer) $$$$ P(Student | Buys computer) $$$$ P(Credit rating | Buys computer) $$

In [81]:

likelihood = {}
likelihood['Credit_Rating'] = data.groupby(['Buys_Computer', 'Credit_Rating']).size().div(len(data)).div(prior)
likelihood['Age'] = data.groupby(['Buys_Computer', 'Age']).size().div(len(data)).div(prior)
likelihood['Income'] = data.groupby(['Buys_Computer', 'Income']).size().div(len(data)).div(prior)
likelihood['Student'] = data.groupby(['Buys_Computer', 'Student']).size().div(len(data)).div(prior)

print (likelihood)

{'Credit_Rating': Buys_Computer  Credit_Rating
no             excellent        0.600000
               fair             0.400000
yes            excellent        0.333333
               fair             0.666667
dtype: float64, 'Age': Buys_Computer  Age  
no             <=30     0.600000
               >40      0.400000
yes            31-40    0.444444
               <=30     0.222222
               >40      0.333333
dtype: float64, 'Student': Buys_Computer  Student
no             no         0.800000
               yes        0.200000
yes            no         0.333333
               yes        0.666667
dtype: float64, 'Income': Buys_Computer  Income
no             high      0.400000
               low       0.200000
               medium    0.400000
yes            high      0.222222
               low       0.333333
               medium    0.444444
dtype: float64}

Calculating posterior¶

We need to if a person wil buy computer based on the following new information

{"Age":'<=30', "Income":"medium", "Student":'yes' , "Credit_Rating":'fair'}

Substituing the values in the likehood data and in the bayes formula, we get

In [77]:

# Probability that the person will buy
p_yes = likelihood['Age']['yes']['<=30'] * likelihood['Income']['yes']['medium'] * \
        likelihood['Student']['yes']['yes'] * likelihood['Credit_Rating']['yes']['fair'] \
        * prior['yes']

# Probability that the person will NOT buy
p_no = likelihood['Age']['no']['<=30'] * likelihood['Income']['no']['medium'] * \
       likelihood['Student']['no']['yes'] * likelihood['Credit_Rating']['no']['fair'] \
       * prior['no']

print ('Yes : ', p_yes)
print ('No :  ', p_no)

('Yes : ', 0.028218694885361544)
('No :  ', 0.0068571428571428551)

As we can see there is a higher probability that the person will buy the computer.

Note : We dont need to calculate the denominator of bayes as in the end we need to do comparison between the different probabilities so dividing by same number dsnt change the comparison.

Using Sklearn¶

We try to solve the same problem using Naive Bayes calssifier implemented in the sklearn library

In [30]:

from sklearn.preprocessing import LabelEncoder
encoded_data = data.apply(LabelEncoder().fit_transform)

In [29]:

from sklearn.naive_bayes import MultinomialNB
import numpy as np
clf = MultinomialNB()
clf.fit(encoded_data.drop(['Buys_Computer'], axis=1), encoded_data['Buys_Computer'])

# {"Age":'<=30', "Income":"medium", "Student":'yes' , "Credit_Rating":'fair'}
# The data is encoded as [1,2,1,1]
X = np.array([1,2,1,1])
print (clf._joint_log_likelihood(X.reshape(1,-1)))
print ("Prediction of : ", clf.predict(X.reshape(1,-1)))

[[-8.29709436 -7.15971488]]
('Prediction of : ', array([1]))

Thus even with Sklearn the answer is YES . In sklearn the log_likelihood is being used rather than the likelihood.

The beauty of the Naive Bayes for the discretized features set is that it just involves counting and multiplication to get the answer. The algorithm can be extended to continous feature variables. For continuous feature variables we need to decide which probability distributions to use and their likelihoods.

Thus we have implemented the Naive Bayes algorithm using Pandas library vectorization properties.¶

In the next blog we will solve the naive bayes problem using Probabilistic Programming techniques.¶

Find me on Twitter, GitHub, email.