From b481f2c1c72538c6bb74fb949ab4ef171b4302a0 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Tue, 6 Apr 2021 16:37:22 -0300 Subject: [PATCH 1/6] confusion matrix added, prediction code updated --- 04_naive_bayes/04_naive_bayes.jl | 113 +++++++++++++++++++++++++------ 1 file changed, 91 insertions(+), 22 deletions(-) diff --git a/04_naive_bayes/04_naive_bayes.jl b/04_naive_bayes/04_naive_bayes.jl index 505571df..6cf2b713 100644 --- a/04_naive_bayes/04_naive_bayes.jl +++ b/04_naive_bayes/04_naive_bayes.jl @@ -239,26 +239,25 @@ Finally we arrived to the point of actually testing our model. This is what the " # ╔═╡ 4e470cba-2850-11eb-3563-cd9ead36f468 -function spam_filter_accurracy(x_test, y_test, model::BayesSpamFilter, α, tol=200) - N = length(y_test) - predictions = Array{Int64,1}(undef, N) - correct = 0 - for i in 1:N - email = string([repeat(string(word, " "), n) for (word, n) in zip(model.vocabulary, x_test[:,i])]...) - p_ham, p_spam = spam_predict(email, model, α, tol) - if p_ham > p_spam - predictions[i] = 0 - if y_test[i] == 0 - correct += 1 - end - else - predictions[i] = 1 - if y_test[i] == 1 - correct += 1 - end - end - end - return correct/N +#This function classify each mail into Ham(0) or Spam(1) +function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200) + N = length(y_test) + predictions = Array{Int64, 1}(undef,N) + for i in 1:N + email = string([repeat(string(word, " "),N) for (word,N) in zip(model.vocabulary, x_test[:, i])]...) + pham, pspam = spam_predict(email, model, α, tol) + pred = argmax([pham, pspam]) - 1 + predictions[i] = pred + end + return predictions +end + +# ╔═╡ da3a76fc-96ee-11eb-2990-9902266f9e9c +function spam_filter_accurracy(predictions, actual) + N = length(predictions) + correct = sum(predictions .== actual) + accuracy = correct /N + return accuracy end # ╔═╡ 71cc0158-29e3-11eb-0206-8d29109f858f @@ -266,8 +265,68 @@ md" As you can see below, the model (at least under this simple metric) is performing very well! An accurray of almost 0.95 is quite astonishing for a model so *naive* and simple, but it works! " +# ╔═╡ 1f06c2d4-96ef-11eb-11e4-87f86b9d28f1 +predictions = get_predictions(x_test, y_test, spam_filter, 1) + # ╔═╡ aa9f7ea4-2850-11eb-33e2-ade40fd0a360 -spam_filter_accurracy(x_test, y_test, spam_filter, 1) +spam_filter_accurracy(predictions, y_test) + +# ╔═╡ bc6b59a0-96eb-11eb-08e0-87d26b1d1d44 +md"But let's take into account one more point. +Our model classifieds mail into spam or ham. And the amount of ham mails is considerably higher than the spam ones. We can calculated this percentage: +" + +# ╔═╡ a75fc9e2-970e-11eb-1e45-b14df45e0ccd +sum(predictions)/length(predictions) + +# ╔═╡ 66fbbd2e-96f6-11eb-0de6-0f7efe4fd3a1 +md"This classification problemas where there is an unequal distribution of classes in the dataset are called Imbalanced classification problems. + +So a good way to see how our model is performing is to construct a confusion matrix. +A Confusion matrix is an N x N matrix, where N is the number of target classes. +The matrix compares the actual target values with those predicted by the our model. +Lets construct one for our model: +" + +# ╔═╡ bdd5f9c0-96fb-11eb-252d-d976eedf81e9 +function splam_filter_confusion_matrix(y_test,predictions) + + #We create the matrix and calculated their values + confusion_matrix= [0 0 ; 0 0] + + confusion_matrix[1,1] = sum(isequal(y_test[i],0) & isequal(predictions[i],0) for i in 1:length(y_test)) + confusion_matrix[1,2] = sum(isequal(y_test[i],1) & isequal(predictions[i],0) for i in 1:length(y_test)) + confusion_matrix[2,1] = sum(isequal(y_test[i],0) & isequal(predictions[i],1) for i in 1:length(y_test)) + confusion_matrix[2,2] = sum(isequal(y_test[i],1) & isequal(predictions[i],1) for i in 1:length(y_test)) + + #Now we convert the confusion matrix into a DataFrame + confusion_df = DataFrame(Prediction = String[], + + Ham_mail = Int64[], + + Spam_mail = Int64[]) + + confusion_df = vcat(confusion_df,DataFrame(Prediction = "Model predicted Ham", Ham_mail = confusion_matrix[1,1] , Spam_mail = confusion_matrix[1,2])) + + confusion_df = vcat(confusion_df ,DataFrame(Prediction = "Model predicted Spam", Ham_mail = confusion_matrix[2,1] , Spam_mail = confusion_matrix[2,2])) + + return confusion_df + + +end + +# ╔═╡ 66863c2a-96f6-11eb-0cc4-0573fe5b890e +confusion_matrix = splam_filter_confusion_matrix(y_test[:],predictions) + +# ╔═╡ 894d8cd8-970d-11eb-0c37-4997f3d1b85b +md"So now we can calculate the accuracy of the model segmented by category. +" + +# ╔═╡ 6a3a8f08-9708-11eb-0343-2bb2055ef097 +ham_accuracy = confusion_matrix[1,"Ham_mail"]/(confusion_matrix[1,"Ham_mail"] + confusion_matrix[2,"Ham_mail"]) + +# ╔═╡ 2ef4aa8a-96fd-11eb-27dd-c32b885e51f8 +spam_accuracy = confusion_matrix[2,"Spam_mail"]/(confusion_matrix[1,"Spam_mail"] + confusion_matrix[2,"Spam_mail"]) # ╔═╡ 3ce2070a-7156-11eb-1204-3d1850c7abee md" @@ -275,7 +334,7 @@ md" In this chapter, we have used a naive-bayes approach to build a simple email spam filter. First, the dataset and the theoretical framework were introduced. Using Bayes' theorem and the data available, we assigned probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. Later, the data was pre-processed and a struct was defined for the spam filter object. Functions were then implemented to fit the spam filter object to the data. -Finally, a metric for evaluating the accuracy of the model was implemented, giving a result of approximately $0.95$. +Finally, we evaluated our model performance calculating the accuracy and making a confusion matrix. " # ╔═╡ 96834844-6d45-11eb-39a5-737dd8e43cb1 @@ -334,8 +393,18 @@ md" # ╠═4328faac-2850-11eb-3978-f9ccbf409a8a # ╟─89ee9bea-29e0-11eb-37a6-b16988b0a187 # ╠═4e470cba-2850-11eb-3563-cd9ead36f468 +# ╠═da3a76fc-96ee-11eb-2990-9902266f9e9c # ╟─71cc0158-29e3-11eb-0206-8d29109f858f +# ╠═1f06c2d4-96ef-11eb-11e4-87f86b9d28f1 # ╠═aa9f7ea4-2850-11eb-33e2-ade40fd0a360 +# ╟─bc6b59a0-96eb-11eb-08e0-87d26b1d1d44 +# ╠═a75fc9e2-970e-11eb-1e45-b14df45e0ccd +# ╟─66fbbd2e-96f6-11eb-0de6-0f7efe4fd3a1 +# ╠═bdd5f9c0-96fb-11eb-252d-d976eedf81e9 +# ╠═66863c2a-96f6-11eb-0cc4-0573fe5b890e +# ╟─894d8cd8-970d-11eb-0c37-4997f3d1b85b +# ╠═6a3a8f08-9708-11eb-0343-2bb2055ef097 +# ╠═2ef4aa8a-96fd-11eb-27dd-c32b885e51f8 # ╟─3ce2070a-7156-11eb-1204-3d1850c7abee # ╟─96834844-6d45-11eb-39a5-737dd8e43cb1 # ╟─603fcd02-8b6f-11eb-3290-d3f1b70dadfe From 6b22c0e961d69c6b1db31825a199c6801b2e8ec9 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Tue, 6 Apr 2021 16:45:58 -0300 Subject: [PATCH 2/6] html added --- 04_naive_bayes/04_naive_bayes.jl | 7 ++--- docs/04_naive_bayes.jl.html | 44 +++++++++++++++++--------------- 2 files changed, 28 insertions(+), 23 deletions(-) diff --git a/04_naive_bayes/04_naive_bayes.jl b/04_naive_bayes/04_naive_bayes.jl index 6cf2b713..3400fbc2 100644 --- a/04_naive_bayes/04_naive_bayes.jl +++ b/04_naive_bayes/04_naive_bayes.jl @@ -262,7 +262,7 @@ end # ╔═╡ 71cc0158-29e3-11eb-0206-8d29109f858f md" -As you can see below, the model (at least under this simple metric) is performing very well! An accurray of almost 0.95 is quite astonishing for a model so *naive* and simple, but it works! +As you can see below, the model (at least under this simple metric) is performing very well! An accurray of about 0.95 is quite astonishing for a model so *naive* and simple, but it works! " # ╔═╡ 1f06c2d4-96ef-11eb-11e4-87f86b9d28f1 @@ -272,8 +272,9 @@ predictions = get_predictions(x_test, y_test, spam_filter, 1) spam_filter_accurracy(predictions, y_test) # ╔═╡ bc6b59a0-96eb-11eb-08e0-87d26b1d1d44 -md"But let's take into account one more point. -Our model classifieds mail into spam or ham. And the amount of ham mails is considerably higher than the spam ones. We can calculated this percentage: +md"But we have to take into account one more thing. +Our model classify mails into spam or ham and the amount of ham mails is considerably higher than the spam ones. +We can calculated this percentage: " # ╔═╡ a75fc9e2-970e-11eb-1e45-b14df45e0ccd diff --git a/docs/04_naive_bayes.jl.html b/docs/04_naive_bayes.jl.html index 4acf3b5d..521865ee 100644 --- a/docs/04_naive_bayes.jl.html +++ b/docs/04_naive_bayes.jl.html @@ -149,19 +149,19 @@ -

To do list

+

To do list

We are currently working on:

  • Explain how to measure the model performance when the categories are not balanced #85.

-
14.6 μs

Naive Bayes: Spam or Ham?

-
5.8 μs
34.3 s

We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content.

+
20.0 μs

Naive Bayes: Spam or Ham?

+
3.0 μs
2.3 ms

We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content.

There are many possible origins of the 'Spam' word. Some people suggest Spam is a satirized way to refer to 'fake meat'. Hence, in the context of emails, this would just mean 'fake emails'. It makes sense, but the real story is another one. The origin of this term can be tracked to the 1970s, where the British surreal comedy troupe Monty Python gave life to it in a sketch of their Monty Python's Flying Circus series. In the sketch, a customer wants to make an order in a restaurant, but all the restaurant's items have spam in them. As the waitress describes the food, she repeats thw word spam, and as this happens, a group of Vikings sitting on another table nearby start singing 'Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam! Wonderful spam! until they are told to shut up.

-
1.4 ms
2.7 s

Although the exact moment where this was first translated to different types of internet messages such as emails or chat messages can't be stated clearly, it is a well known fact that users in each of these messaging instances chose the word 'spam' as a reference to Monty Python's sketch, where spam was itself something unwanted, popping all over the menu and annoyingly trying to drown out the conversation.

-
6.4 μs
364 ms

Now that we had made some historical overview of the topic, we can start designing our spam filter. One of the most important things for the filter to work properly will we to feed it with some good training data. What do we mean by this? In this context, we mean to have a large enough corpus of emails classified as spam or ham (that's the way no-spam emails are called!), that the emails are collected from an heterogeneous group of persons (spam and ham emails will be not be the same from a software developer, a social scientist or a graphics designer), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we recieve. Fortunately, there are a lot of very good datasets available online. We will be using one from Kaggle, a community of data science enthusiasts and practitioners who publish datasets, make competitions and share their knowledge.

+
6.8 μs
3.0 s

Although the exact moment where this was first translated to different types of internet messages such as emails or chat messages can't be stated clearly, it is a well known fact that users in each of these messaging instances chose the word 'spam' as a reference to Monty Python's sketch, where spam was itself something unwanted, popping all over the menu and annoyingly trying to drown out the conversation.

+
4.0 μs
209 ms

Now that we had made some historical overview of the topic, we can start designing our spam filter. One of the most important things for the filter to work properly will we to feed it with some good training data. What do we mean by this? In this context, we mean to have a large enough corpus of emails classified as spam or ham (that's the way no-spam emails are called!), that the emails are collected from an heterogeneous group of persons (spam and ham emails will be not be the same from a software developer, a social scientist or a graphics designer), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we recieve. Fortunately, there are a lot of very good datasets available online. We will be using one from Kaggle, a community of data science enthusiasts and practitioners who publish datasets, make competitions and share their knowledge.

This dataset is already a bit pre-processed, as you will probably notice. It consists of 5172 emails, represented by the rows of a matrix or DataFrame. Each column represents a word from the 3000 most frequent words in all mails, and picking a row and a column will tell us how many times a given word appears in a particulara email. The last column indicates a 0 for ham emails and 1 for spam. Let's give it a look:

-
6.4 μs
raw_df
Email No.thetoectandforofamore
StringInt64Int64Int64Int64Int64Int64Int64
1
"Email 1"
0
0
1
0
0
0
2
2
"Email 2"
8
13
24
6
6
2
102
3
"Email 3"
0
0
1
0
0
0
8
4
"Email 4"
0
5
22
0
5
1
51
5
"Email 5"
7
6
17
1
5
2
57
6
"Email 6"
4
5
1
4
2
3
45
7
"Email 7"
5
3
1
3
2
1
37
8
"Email 8"
0
2
2
3
1
2
21
9
"Email 9"
2
2
3
0
0
1
18
10
"Email 10"
4
4
35
0
1
0
49
more
5172
"Email 5172"
22
24
5
1
6
5
148
12.4 s

What we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes' theorem. In particular, we will be using naive Bayes. What we are going to do is to treat each email just as a collection of words. The particular relationship between words and the context will not be taken into account here. Our strategy will be to estimate a probability of an incoming email of being ham or spam and making a decision based on that. Our general approach can be summarized as:

+
4.9 μs
raw_df
Email No.thetoectandforofamore
StringInt64Int64Int64Int64Int64Int64Int64
1
"Email 1"
0
0
1
0
0
0
2
2
"Email 2"
8
13
24
6
6
2
102
3
"Email 3"
0
0
1
0
0
0
8
4
"Email 4"
0
5
22
0
5
1
51
5
"Email 5"
7
6
17
1
5
2
57
6
"Email 6"
4
5
1
4
2
3
45
7
"Email 7"
5
3
1
3
2
1
37
8
"Email 8"
0
2
2
3
1
2
21
9
"Email 9"
2
2
3
0
0
1
18
10
"Email 10"
4
4
35
0
1
0
49
more
5172
"Email 5172"
22
24
5
1
6
5
148
776 ms

What we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes' theorem. In particular, we will be using naive Bayes. What we are going to do is to treat each email just as a collection of words. The particular relationship between words and the context will not be taken into account here. Our strategy will be to estimate a probability of an incoming email of being ham or spam and making a decision based on that. Our general approach can be summarized as:

P(spam|email)P(email|spam)P(spam)

P(ham|email)P(email|ham)P(ham)

Where we use sign instead of = sign because the denominator from Bayes' theorem is missing, but we won't need to calculate it as it is the same for both probabilities and all we are going to care about is a comparation of these two probabilities.

@@ -177,21 +177,25 @@

P(email|ham)=i=1nP(wordi|ham)

The multiplication of each of the word probabilities here stands from the supposition that all the words in the email are statistically independent. We have to stress that this is not necessarily true, and most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires.

Let's start building a solution for our problem and the details will be discussed later.

-
16.2 μs
1.2 s

First, we would like to filter some words that are very common in the english language, such as articles and pronouns, and that will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. A good practice when dealing with models that learn from data like the one we are going to implement, is to divide our data in two: a training set and a testing set. We need to measure how good our model is performing, so we will train it with some data, and test it with some other data the model has never seen. This way we may be sure that the model is not tricking us. In Julia, the package MLDataUtils has some nice functionalities for data manipulations like this. We will use the functions splitobs to split our dataset in a train set and a test set and shuffleobs to randomize the order of our data in the split. It is important also to pass a labels array to our split function so that it knows how to properly split our dataset.

-
11.7 μs
211 ms

Now that we have our data clean and splitted for training and testing, let's return to the details of the calculations. The probability of a particular word, given that we have a spam email, can be calculated like so,

+
26.0 μs
171 ms

First, we would like to filter some words that are very common in the english language, such as articles and pronouns, and that will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. A good practice when dealing with models that learn from data like the one we are going to implement, is to divide our data in two: a training set and a testing set. We need to measure how good our model is performing, so we will train it with some data, and test it with some other data the model has never seen. This way we may be sure that the model is not tricking us. In Julia, the package MLDataUtils has some nice functionalities for data manipulations like this. We will use the functions splitobs to split our dataset in a train set and a test set and shuffleobs to randomize the order of our data in the split. It is important also to pass a labels array to our split function so that it knows how to properly split our dataset.

+
11.8 μs
308 μs

Now that we have our data clean and splitted for training and testing, let's return to the details of the calculations. The probability of a particular word, given that we have a spam email, can be calculated like so,

P(wordi|spam)=Nwordi|spam+αNspam+αNvocabulary

P(wordi|ham)=Nwordi|ham+αNham+αNvocabulary

With this formulas in mind, we now know exactly what we have to calculate from our data. We are going to need the numbers Nwordi|spam and Nwordi|ham for each word, that is, the number of times that a given word wi is used in the spam and ham categories, respectively. Then Nspam and Nham are the total number of times that words are used in the spam and ham categories (considering all the repetitions of the same words too), and finally, Nvocabulary is the total number of unique words in the dataset. α is just a smoothing parameter, so that probability of words that, for example, are not in the spam categoriy don't give 0 probability.

As all this information will be particular for our dataset, so a clever way to aggregate all this is tu use a Julia struct, and we can define the attributes of the struct that we will be using over and over for the prediction. Below we can see the implementation. The relevant attributes of the struct will be wordscountham and wordscountspam, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets, N_ham and N_spam the total number of words appearing in each category, and finally vocabulary, an array with all the unique words in our dataset. The line BayesSpamFilter() = new() is just the constructor of this struct. When we instantiate the filter, all the attributes will be undefined and we will have to define some functions to fill this variables with values relevant to our particular problem.

-
12.7 μs
923 μs

Now we are going to proceed to define some functions that will be important for our filter implementation. The function word_data below will help for counting the occurrencies of each word in ham and spam categories.

-
4.1 μs
words_count (generic function with 2 methods)
68.2 μs

Next, we will define the fit! function for our spam filter struct. We are using the bang(!) convention for the functions that modify in-place their arguments, in this case, the spam filter struc itself. This will be the function that will fit our model to the data, a typical procedure in Data Science and Machine Learning areas. This fit function will use mainly the words_count function defined before to fill all the undefined parameters in the filter's struct.

-
5.3 μs
fit! (generic function with 1 method)
31.3 μs

Now it is time to instantiate our spam filter and fit the model to the data. We do this with our training data so then we can measure how well it is working in our test data.

-
2.7 μs
499 ms

We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about. It will be used internally by the next function defined, spam_predict, which will recieve a new email –the one we would want to classify as spam or ham–, our fitted model, and two parameters, α which we have already discussed in the formula for P(wordi|spam) and P(wordi|ham), and tol. We saw that the calculation for P(email|spam) and P(email|ham) required the multiplication of each P(wordi|spam) and P(wordi|ham) term. When mails are too large, i.e., they have a lot of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This can't happen, as we need values of P(email|spam) and P(email|ham) that are largar than zero so we can multiply them by P(spam) and P(ham) respectively and compare these values to make a prediction. The parameter tol is the maximum tolerance for the number of unique words in an email. If this number is greater than the parameter tol, only the most frequent words will be considered and the rest will be neglected. How many of these most frequent words? the first 'tol' most frequent words!

-
10.8 μs
word_spam_probability (generic function with 1 method)
25.0 μs
spam_predict (generic function with 2 methods)
71.8 μs
0
967 ms

Finally we arrived to the point of actually testing our model. This is what the function below is all about. We feed it with our model fitted with the training data, and the test data we had splitted at the beginning, as well as with the labels of the classification of this data. This function makes a prediction for each email in our test data, using the values of our model and then checks if the prediction was right. We count all the correct predictions and then we divide this number by the total amount of mails, giving us an accurracy measurement.

-
5.4 μs
spam_filter_accurracy (generic function with 2 methods)
69.3 μs

As you can see below, the model (at least under this simple metric) is performing very well! An accurray of almost 0.95 is quite astonishing for a model so naive and simple, but it works!

-
5.6 μs
0.9536082474226805
15.8 s

Summary

-

In this chapter, we have used a naive-bayes approach to build a simple email spam filter. First, the dataset and the theoretical framework were introduced. Using Bayes' theorem and the data available, we assigned probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. Later, the data was pre-processed and a struct was defined for the spam filter object. Functions were then implemented to fit the spam filter object to the data. Finally, a metric for evaluating the accuracy of the model was implemented, giving a result of approximately 0.95.

-
4.4 μs

References

+
11.5 μs
1.1 ms

Now we are going to proceed to define some functions that will be important for our filter implementation. The function word_data below will help for counting the occurrencies of each word in ham and spam categories.

+
4.1 μs
words_count (generic function with 2 methods)
76.8 μs

Next, we will define the fit! function for our spam filter struct. We are using the bang(!) convention for the functions that modify in-place their arguments, in this case, the spam filter struc itself. This will be the function that will fit our model to the data, a typical procedure in Data Science and Machine Learning areas. This fit function will use mainly the words_count function defined before to fill all the undefined parameters in the filter's struct.

+
5.5 μs
fit! (generic function with 1 method)
31.6 μs

Now it is time to instantiate our spam filter and fit the model to the data. We do this with our training data so then we can measure how well it is working in our test data.

+
2.8 μs
445 ms

We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about. It will be used internally by the next function defined, spam_predict, which will recieve a new email –the one we would want to classify as spam or ham–, our fitted model, and two parameters, α which we have already discussed in the formula for P(wordi|spam) and P(wordi|ham), and tol. We saw that the calculation for P(email|spam) and P(email|ham) required the multiplication of each P(wordi|spam) and P(wordi|ham) term. When mails are too large, i.e., they have a lot of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This can't happen, as we need values of P(email|spam) and P(email|ham) that are largar than zero so we can multiply them by P(spam) and P(ham) respectively and compare these values to make a prediction. The parameter tol is the maximum tolerance for the number of unique words in an email. If this number is greater than the parameter tol, only the most frequent words will be considered and the rest will be neglected. How many of these most frequent words? the first 'tol' most frequent words!

+
7.6 μs
word_spam_probability (generic function with 1 method)
26.4 μs
spam_predict (generic function with 2 methods)
74.9 μs
0
57.6 ms

Finally we arrived to the point of actually testing our model. This is what the function below is all about. We feed it with our model fitted with the training data, and the test data we had splitted at the beginning, as well as with the labels of the classification of this data. This function makes a prediction for each email in our test data, using the values of our model and then checks if the prediction was right. We count all the correct predictions and then we divide this number by the total amount of mails, giving us an accurracy measurement.

+
3.0 μs
get_predictions (generic function with 2 methods)
63.9 μs
spam_filter_accurracy (generic function with 1 method)
41.5 μs

As you can see below, the model (at least under this simple metric) is performing very well! An accurray of about 0.95 is quite astonishing for a model so naive and simple, but it works!

+
2.2 ms
predictions
14.8 s
0.9503865979381443
28.0 μs

But we have to take into account one more thing. Our model classify mails into spam or ham and the amount of ham mails is considerably higher than the spam ones. We can calculated this percentage:

+
12.5 μs
0.3021907216494845
890 ns

This classification problemas where there is an unequal distribution of classes in the dataset are called Imbalanced classification problems.

+

So a good way to see how our model is performing is to construct a confusion matrix. A Confusion matrix is an N x N matrix, where N is the number of target classes. The matrix compares the actual target values with those predicted by the our model. Lets construct one for our model:

+
10.3 μs
splam_filter_confusion_matrix (generic function with 1 method)
203 μs
confusion_matrix
PredictionHam_mailSpam_mail
StringInt64Int64
1
"Model predicted Ham"
1054
29
2
"Model predicted Spam"
48
421
144 μs

So now we can calculate the accuracy of the model segmented by category.

+
10.8 μs
ham_accuracy
0.956442831215971
4.7 μs
spam_accuracy
0.9355555555555556
4.4 μs

Summary

+

In this chapter, we have used a naive-bayes approach to build a simple email spam filter. First, the dataset and the theoretical framework were introduced. Using Bayes' theorem and the data available, we assigned probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. Later, the data was pre-processed and a struct was defined for the spam filter object. Functions were then implemented to fit the spam filter object to the data. Finally, we evaluated our model performance calculating the accuracy and making a confusion matrix.

+
20.1 μs
2.5 ms

Give us feedback

+
13.7 μs

Give us feedback

This book is currently in a beta version. We are looking forward to getting feedback and criticism:

  • Submit a GitHub issue here.

    @@ -215,8 +219,8 @@

Thank you!

-
26.1 μs
10.0 μs
+
9.7 μs
3.9 μs
From dfad540a0054ef272bcce8e14d07b23cb1cfab77 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Tue, 6 Apr 2021 16:54:38 -0300 Subject: [PATCH 3/6] Added github ribbons and deleted issue --- 04_naive_bayes/04_naive_bayes.jl | 4 ++-- docs/04_naive_bayes.jl.html | 12 +++++------- 2 files changed, 7 insertions(+), 9 deletions(-) diff --git a/04_naive_bayes/04_naive_bayes.jl b/04_naive_bayes/04_naive_bayes.jl index 3400fbc2..0843d136 100644 --- a/04_naive_bayes/04_naive_bayes.jl +++ b/04_naive_bayes/04_naive_bayes.jl @@ -21,8 +21,8 @@ md"### To do list We are currently working on: -* Explain how to measure the model performance when the categories are not balanced [#85](https://github.com/unbalancedparentheses/data_science_in_julia_for_hackers/issues/85). -" + +"; # ╔═╡ 0de04b90-2835-11eb-1369-01c64bc38c42 md" diff --git a/docs/04_naive_bayes.jl.html b/docs/04_naive_bayes.jl.html index 521865ee..5b18482a 100644 --- a/docs/04_naive_bayes.jl.html +++ b/docs/04_naive_bayes.jl.html @@ -9,6 +9,8 @@ + + -

To do list

-

We are currently working on:

-
    -
  • Explain how to measure the model performance when the categories are not balanced #85.

    -
  • -
-
20.0 μs

Naive Bayes: Spam or Ham?

+
+ Fork me on GitHub +
131 μs

Naive Bayes: Spam or Ham?

3.0 μs
2.3 ms

We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content.

There are many possible origins of the 'Spam' word. Some people suggest Spam is a satirized way to refer to 'fake meat'. Hence, in the context of emails, this would just mean 'fake emails'. It makes sense, but the real story is another one. The origin of this term can be tracked to the 1970s, where the British surreal comedy troupe Monty Python gave life to it in a sketch of their Monty Python's Flying Circus series. In the sketch, a customer wants to make an order in a restaurant, but all the restaurant's items have spam in them. As the waitress describes the food, she repeats thw word spam, and as this happens, a group of Vikings sitting on another table nearby start singing 'Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam! Wonderful spam! until they are told to shut up.

6.8 μs
3.0 s

Although the exact moment where this was first translated to different types of internet messages such as emails or chat messages can't be stated clearly, it is a well known fact that users in each of these messaging instances chose the word 'spam' as a reference to Monty Python's sketch, where spam was itself something unwanted, popping all over the menu and annoyingly trying to drown out the conversation.

From 472c67bd1d88655102eec75fb342db68b4e3a6d7 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Thu, 8 Apr 2021 18:04:25 -0300 Subject: [PATCH 4/6] fixed typos --- 04_naive_bayes/04_naive_bayes.jl | 33 ++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/04_naive_bayes/04_naive_bayes.jl b/04_naive_bayes/04_naive_bayes.jl index 0843d136..bed0c8c0 100644 --- a/04_naive_bayes/04_naive_bayes.jl +++ b/04_naive_bayes/04_naive_bayes.jl @@ -34,7 +34,7 @@ md" We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content. There are many possible origins of the 'Spam' word. Some people suggest Spam is a satirized way to refer to 'fake meat'. Hence, in the context of emails, this would just mean 'fake emails'. It makes sense, but the real story is another one. -The origin of this term can be tracked to the 1970s, where the British surreal comedy troupe Monty Python gave life to it in a sketch of their *Monty Python's Flying Circus* series. In the sketch, a customer wants to make an order in a restaurant, but all the restaurant's items have *spam* in them. As the waitress describes the food, she repeats thw word spam, and as this happens, a group of Vikings sitting on another table nearby start singing '*Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam! Wonderful spam!* until they are told to shut up. +The origin of this term can be tracked to the 1970s, where the British surreal comedy troupe Monty Python gave life to it in a sketch of their *Monty Python's Flying Circus* series. In the sketch, a customer wants to make an order in a restaurant, but all the restaurant's items have *spam* in them. As the waitress describes the food, she repeats the word spam, and as this happens, a group of Vikings sitting on another table nearby start singing '*Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam! Wonderful spam!* until they are told to shut up. " # ╔═╡ 78c91c38-4eae-11eb-3569-07af96eb6881 @@ -49,10 +49,11 @@ imresize(load("./imgs/I-don-t-like-spam.jpg"), (300, 500)) # ╔═╡ c898651a-4eac-11eb-26ac-ddb1885afc13 md" -Now that we had made some historical overview of the topic, we can start designing our spam filter. One of the most important things for the filter to work properly will we to feed it with some good training data. What do we mean by this? In this context, we mean to have a large enough corpus of emails classified as spam or ham (that's the way no-spam emails are called!), that the emails are collected from an heterogeneous group of persons (spam and ham emails will be not be the same from a software developer, a social scientist or a graphics designer), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we recieve. +Now that we have made some historical overview of the topic, we can start designing our spam filter. +One of the most important things for the filter to work properly will be to feed it with some good training data. What do we mean by this? In this context, we mean to have a large enough corpus of emails classified as spam or ham (that's the way no-spam emails are called!), that the emails are collected from an heterogeneous group of persons (spam and ham emails will be not be the same from a software developer, a social scientist or a graphics designer), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we receive. Fortunately, there are a lot of very good datasets available online. We will be using one from [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv), a community of data science enthusiasts and practitioners who publish datasets, make competitions and share their knowledge. -This dataset is already a bit pre-processed, as you will probably notice. It consists of 5172 emails, represented by the rows of a matrix or DataFrame. Each column represents a word from the 3000 most frequent words in all mails, and picking a row and a column will tell us how many times a given word appears in a particulara email. The last column indicates a 0 for ham emails and 1 for spam. Let's give it a look: +This dataset is already a bit pre-processed, as you will probably notice. It consists of 5172 emails, represented by the rows of a matrix or DataFrame. Each column represents a word from the 3000 most frequent words in all mails, and picking a row and a column will tell us how many times a given word appears in a particular email. The last column indicates a 0 for ham emails and 1 for spam. Let's give it a look: " # ╔═╡ 4f79bc6c-2835-11eb-3ac9-5d49e01ee5d4 @@ -65,7 +66,7 @@ What we are facing here is a **classification** problem, and we will code from s $P(spam|email) \propto P(email|spam)P(spam)$ $P(ham|email) \propto P(email|ham)P(ham)$ -Where we use $\propto$ sign instead of $=$ sign because the denominator from Bayes' theorem is missing, but we won't need to calculate it as it is the same for both probabilities and all we are going to care about is a comparation of these two probabilities. +Where we use $\propto$ sign instead of $=$ sign because the denominator from Bayes' theorem is missing, but we won't need to calculate it as it is the same for both probabilities and all we are going to care about is a comparison of these two probabilities. So what do $P(email|spam)$ and $P(email|ham)$ mean and how do we calculate them? To answer this question, we have to remember that we are interpreting each email just as a collection of words, with no importance on their order within the text. In this naive approach, the semantics are not taken into account. In this scope, the conditional probability $P(email|spam)$ just means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. If this still sounds a bit confusing, let's make a quick example. Consider for a moment that our training spam set of emails consists just of these three emails: @@ -79,7 +80,7 @@ Also consider we have a new email and we want to ask ourselves what $P(email|spa new email: 'apply and win all this products!' -As we already said, $P(email|spam)$ stands for the plausibility of the new email being generated by the words we encountered in our training spam email set. We can see that words like 'win' –which in our training set appears in the form of 'won', but there is a standard technique in linguistics named **Lemmatisation**, which groups together inflected forms of a word, letting us consider 'win' and 'won' as the same word– and 'product' appear rather commonly in our training data. So we will expect $P(email|spam)$ to be relatively high in this fake and simple example, as it contains words that are repeated among our spam emails data. +As we already said, $P(email|spam)$ stands for the plausibility of the new email being generated by the words we encountered in our training spam email set. We can see that words like 'win' –which in our training set appears in the form of 'won', but there is a standard technique in linguistics named **Lemmatization**, which groups together inflected forms of a word, letting us consider 'win' and 'won' as the same word– and 'product' appear rather commonly in our training data. So we will expect $P(email|spam)$ to be relatively high in this fake and simple example, as it contains words that are repeated among our spam emails data. Let's make all this discussion a bit more explicitly mathematical. The simplest way to write this in a mathematical way is to take each word appearing in the email and calculate the probability of it appearing in spam emails and ham emails. Then, we do this for each word in the email and finally multiply them, @@ -125,10 +126,10 @@ Now that we have our data clean and splitted for training and testing, let's ret $P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$ $P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$ -With this formulas in mind, we now know exactly what we have to calculate from our data. We are going to need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each word, that is, the number of times that a given word $w_i$ is used in the spam and ham categories, respectively. Then $N_{spam}$ and $N_{ham}$ are the total number of times that words are used in the spam and ham categories (considering all the repetitions of the same words too), and finally, $N_{vocabulary}$ is the total number of unique words in the dataset. $α$ is just a smoothing parameter, so that probability of words that, for example, are not in the spam categoriy don't give 0 probability. +With these formulas in mind, we now know exactly what we have to calculate from our data. We are going to need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each word, that is, the number of times that a given word $w_i$ is used in the spam and ham categories, respectively. Then $N_{spam}$ and $N_{ham}$ are the total number of times that words are used in the spam and ham categories (considering all the repetitions of the same words too), and finally, $N_{vocabulary}$ is the total number of unique words in the dataset. $α$ is just a smoothing parameter, so that probability of words that, for example, are not in the spam category don't give 0 probability. -As all this information will be particular for our dataset, so a clever way to aggregate all this is tu use a Julia *struct*, and we can define the attributes of the struct that we will be using over and over for the prediction. Below we can see the implementation. The relevant attributes of the struct will be *words_count_ham* and *words_count_spam*, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets, *N_ham* and *N_spam* the total number of words appearing in each category, and finally *vocabulary*, an array with all the unique words in our dataset. -The line *BayesSpamFilter() = new()* is just the constructor of this struct. When we instantiate the filter, all the attributes will be undefined and we will have to define some functions to fill this variables with values relevant to our particular problem. +As all this information will be particular for our dataset, so a clever way to aggregate all this is to use a Julia *struct*, and we can define the attributes of the struct that we will be using over and over for the prediction. Below we can see the implementation. The relevant attributes of the struct will be *words_count_ham* and *words_count_spam*, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets, *N_ham* and *N_spam* the total number of words appearing in each category, and finally *vocabulary*, an array with all the unique words in our dataset. +The line *BayesSpamFilter() = new()* is just the constructor of this struct. When we instantiate the filter, all the attributes will be undefined and we will have to define some functions to fill these variables with values relevant to our particular problem. " # ╔═╡ 018f2c24-28e1-11eb-1de2-53ad33f4fd61 @@ -143,7 +144,7 @@ end # ╔═╡ 4a067d1e-28e5-11eb-3a2a-232cedcb83b6 md" -Now we are going to proceed to define some functions that will be important for our filter implementation. The function *word_data* below will help for counting the occurrencies of each word in ham and spam categories. +Now we are going to proceed to define some functions that will be important for our filter implementation. The function *word_data* below will help for counting the occurrences of each word in ham and spam categories. " # ╔═╡ f409ac78-284f-11eb-349a-d3314219032c @@ -186,7 +187,7 @@ end # ╔═╡ 3b5cd01c-29d8-11eb-2260-a3f029106a08 md" -We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about. It will be used internally by the next function defined, *spam_predict*, which will recieve a new email –the one we would want to classify as spam or ham–, our fitted model, and two parameters, α which we have already discussed in the formula for $P(word_i|spam)$ and $P(word_i|ham)$, and *tol*. We saw that the calculation for $P(email|spam)$ and $P(email|ham)$ required the multiplication of each $P(word_i|spam)$ and $P(word_i|ham)$ term. When mails are too large, i.e., they have a lot of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This can't happen, as we need values of $P(email|spam)$ and $P(email|ham)$ that are largar than zero so we can multiply them by $P(spam)$ and $P(ham)$ respectively and compare these values to make a prediction. The parameter *tol* is the maximum tolerance for the number of unique words in an email. If this number is greater than the parameter *tol*, only the most frequent words will be considered and the rest will be neglected. How many of these most frequent words? the first '*tol*' most frequent words! +We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about. It will be used internally by the next function defined, *spam_predict*, which will receive a new email –the one we would want to classify as spam or ham–, our fitted model, and two parameters, α which we have already discussed in the formula for $P(word_i|spam)$ and $P(word_i|ham)$, and *tol*. We saw that the calculation for $P(email|spam)$ and $P(email|ham)$ required the multiplication of each $P(word_i|spam)$ and $P(word_i|ham)$ term. When mails are too large, i.e., they have a lot of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This can't happen, as we need values of $P(email|spam)$ and $P(email|ham)$ that are larger than zero so we can multiply them by $P(spam)$ and $P(ham)$ respectively and compare these values to make a prediction. The parameter *tol* is the maximum tolerance for the number of unique words in an email. If this number is greater than the parameter *tol*, only the most frequent words will be considered and the rest will be neglected. How many of these most frequent words? the first '*tol*' most frequent words! " # ╔═╡ 327eca7e-2850-11eb-22a0-3b25c80c3a10 @@ -235,7 +236,7 @@ end # ╔═╡ 89ee9bea-29e0-11eb-37a6-b16988b0a187 md" -Finally we arrived to the point of actually testing our model. This is what the function below is all about. We feed it with our model fitted with the training data, and the test data we had splitted at the beginning, as well as with the labels of the classification of this data. This function makes a prediction for each email in our test data, using the values of our model and then checks if the prediction was right. We count all the correct predictions and then we divide this number by the total amount of mails, giving us an accurracy measurement. +Finally we arrived to the point of actually testing our model. This is what the function below is all about. We feed it with our model fitted with the training data, and the test data we had splitted at the beginning, as well as with the labels of the classification of this data. This function makes a prediction for each email in our test data, using the values of our model and then checks if the prediction was right. We count all the correct predictions and then we divide this number by the total amount of mails, giving us an accuracy measurement. " # ╔═╡ 4e470cba-2850-11eb-3563-cd9ead36f468 @@ -253,7 +254,7 @@ function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200) end # ╔═╡ da3a76fc-96ee-11eb-2990-9902266f9e9c -function spam_filter_accurracy(predictions, actual) +function spam_filter_accuracy(predictions, actual) N = length(predictions) correct = sum(predictions .== actual) accuracy = correct /N @@ -262,18 +263,18 @@ end # ╔═╡ 71cc0158-29e3-11eb-0206-8d29109f858f md" -As you can see below, the model (at least under this simple metric) is performing very well! An accurray of about 0.95 is quite astonishing for a model so *naive* and simple, but it works! +As you can see below, the model (at least under this simple metric) is performing very well! An accuracy of about 0.95 is quite astonishing for a model so *naive* and simple, but it works! " # ╔═╡ 1f06c2d4-96ef-11eb-11e4-87f86b9d28f1 predictions = get_predictions(x_test, y_test, spam_filter, 1) # ╔═╡ aa9f7ea4-2850-11eb-33e2-ade40fd0a360 -spam_filter_accurracy(predictions, y_test) +spam_filter_accuracy(predictions, y_test) # ╔═╡ bc6b59a0-96eb-11eb-08e0-87d26b1d1d44 md"But we have to take into account one more thing. -Our model classify mails into spam or ham and the amount of ham mails is considerably higher than the spam ones. +Our model classifies mails into spam or ham and the amount of ham mails is considerably higher than the spam ones. We can calculated this percentage: " @@ -393,7 +394,7 @@ md" # ╠═3fe86ada-2850-11eb-12db-cf51560e9f75 # ╠═4328faac-2850-11eb-3978-f9ccbf409a8a # ╟─89ee9bea-29e0-11eb-37a6-b16988b0a187 -# ╠═4e470cba-2850-11eb-3563-cd9ead36f468 +# ╟─4e470cba-2850-11eb-3563-cd9ead36f468 # ╠═da3a76fc-96ee-11eb-2990-9902266f9e9c # ╟─71cc0158-29e3-11eb-0206-8d29109f858f # ╠═1f06c2d4-96ef-11eb-11e4-87f86b9d28f1 From 20074c6cd9b26aaa7781fadc10aa06ebd4d0d2d3 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Thu, 8 Apr 2021 18:06:46 -0300 Subject: [PATCH 5/6] html updated --- docs/04_naive_bayes.jl.html | 73 ++++++++++++++++++------------------- 1 file changed, 36 insertions(+), 37 deletions(-) diff --git a/docs/04_naive_bayes.jl.html b/docs/04_naive_bayes.jl.html index 5b18482a..ca4fa9d5 100644 --- a/docs/04_naive_bayes.jl.html +++ b/docs/04_naive_bayes.jl.html @@ -9,7 +9,6 @@ - -
- Fork me on GitHub -
13.1 μs

Naive Bayes: Spam or Ham?

-
6.2 μs
36.8 s

We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content.

+
11.4 μs

Naive Bayes: Spam or Ham?

+
6.5 μs
30.5 s

We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content.

There are many possible origins of the 'Spam' word. Some people suggest Spam is a satirized way to refer to 'fake meat'. Hence, in the context of emails, this would just mean 'fake emails'. It makes sense, but the real story is another one. The origin of this term can be tracked to the 1970s, where the British surreal comedy troupe Monty Python gave life to it in a sketch of their Monty Python's Flying Circus series. In the sketch, a customer wants to make an order in a restaurant, but all the restaurant's items have spam in them. As the waitress describes the food, she repeats the word spam, and as this happens, a group of Vikings sitting on another table nearby start singing 'Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam! Wonderful spam! until they are told to shut up.

-
1.8 ms
2.9 s

Although the exact moment where this was first translated to different types of internet messages such as emails or chat messages can't be stated clearly, it is a well known fact that users in each of these messaging instances chose the word 'spam' as a reference to Monty Python's sketch, where spam was itself something unwanted, popping all over the menu and annoyingly trying to drown out the conversation.

-
6.0 μs
224 ms

Now that we have made some historical overview of the topic, we can start designing our spam filter. One of the most important things for the filter to work properly will be to feed it with some good training data. What do we mean by this? In this context, we mean to have a large enough corpus of emails classified as spam or ham (that's the way no-spam emails are called!), that the emails are collected from an heterogeneous group of persons (spam and ham emails will be not be the same from a software developer, a social scientist or a graphics designer), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we receive. Fortunately, there are a lot of very good datasets available online. We will be using one from Kaggle, a community of data science enthusiasts and practitioners who publish datasets, make competitions and share their knowledge.

+
1.5 ms
2.9 s

Although the exact moment where this was first translated to different types of internet messages such as emails or chat messages can't be stated clearly, it is a well known fact that users in each of these messaging instances chose the word 'spam' as a reference to Monty Python's sketch, where spam was itself something unwanted, popping all over the menu and annoyingly trying to drown out the conversation.

+
6.5 μs
431 ms

Now that we have made some historical overview of the topic, we can start designing our spam filter. One of the most important things for the filter to work properly will be to feed it with some good training data. What do we mean by this? In this context, we mean to have a large enough corpus of emails classified as spam or ham (that's the way no-spam emails are called!), that the emails are collected from an heterogeneous group of persons (spam and ham emails will be not be the same from a software developer, a social scientist or a graphics designer), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we receive. Fortunately, there are a lot of very good datasets available online. We will be using one from Kaggle, a community of data science enthusiasts and practitioners who publish datasets, make competitions and share their knowledge.

This dataset is already a bit pre-processed, as you will probably notice. It consists of 5172 emails, represented by the rows of a matrix or DataFrame. Each column represents a word from the 3000 most frequent words in all mails, and picking a row and a column will tell us how many times a given word appears in a particular email. The last column indicates a 0 for ham emails and 1 for spam. Let's give it a look:

-
8.7 μs
raw_df
Email No.thetoectandforofamore
StringInt64Int64Int64Int64Int64Int64Int64
1
"Email 1"
0
0
1
0
0
0
2
2
"Email 2"
8
13
24
6
6
2
102
3
"Email 3"
0
0
1
0
0
0
8
4
"Email 4"
0
5
22
0
5
1
51
5
"Email 5"
7
6
17
1
5
2
57
6
"Email 6"
4
5
1
4
2
3
45
7
"Email 7"
5
3
1
3
2
1
37
8
"Email 8"
0
2
2
3
1
2
21
9
"Email 9"
2
2
3
0
0
1
18
10
"Email 10"
4
4
35
0
1
0
49
more
5172
"Email 5172"
22
24
5
1
6
5
148
13.2 s

What we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes' theorem. In particular, we will be using naive Bayes. What we are going to do is to treat each email just as a collection of words. The particular relationship between words and the context will not be taken into account here. Our strategy will be to estimate a probability of an incoming email of being ham or spam and making a decision based on that. Our general approach can be summarized as:

-

P(spam|email)P(email|spam)P(spam)

-

P(ham|email)P(email|ham)P(ham)

-

Where we use sign instead of = sign because the denominator from Bayes' theorem is missing, but we won't need to calculate it as it is the same for both probabilities and all we are going to care about is a comparison of these two probabilities.

-

So what do P(email|spam) and P(email|ham) mean and how do we calculate them? To answer this question, we have to remember that we are interpreting each email just as a collection of words, with no importance on their order within the text. In this naive approach, the semantics are not taken into account. In this scope, the conditional probability P(email|spam) just means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. If this still sounds a bit confusing, let's make a quick example. Consider for a moment that our training spam set of emails consists just of these three emails:

+
8.3 μs
raw_df
Email No.thetoectandforofamore
StringInt64Int64Int64Int64Int64Int64Int64
1
"Email 1"
0
0
1
0
0
0
2
2
"Email 2"
8
13
24
6
6
2
102
3
"Email 3"
0
0
1
0
0
0
8
4
"Email 4"
0
5
22
0
5
1
51
5
"Email 5"
7
6
17
1
5
2
57
6
"Email 6"
4
5
1
4
2
3
45
7
"Email 7"
5
3
1
3
2
1
37
8
"Email 8"
0
2
2
3
1
2
21
9
"Email 9"
2
2
3
0
0
1
18
10
"Email 10"
4
4
35
0
1
0
49
more
5172
"Email 5172"
22
24
5
1
6
5
148
13.4 s

What we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes' theorem. In particular, we will be using naive Bayes. What we are going to do is to treat each email just as a collection of words. The particular relationship between words and the context will not be taken into account here. Our strategy will be to estimate a probability of an incoming email of being ham or spam and making a decision based on that. Our general approach can be summarized as:

+

P(spam|email)P(email|spam)P(spam)

+

P(ham|email)P(email|ham)P(ham)

+

Where we use sign instead of = sign because the denominator from Bayes' theorem is missing, but we won't need to calculate it as it is the same for both probabilities and all we are going to care about is a comparison of these two probabilities.

+

So what do P(email|spam) and P(email|ham) mean and how do we calculate them? To answer this question, we have to remember that we are interpreting each email just as a collection of words, with no importance on their order within the text. In this naive approach, the semantics are not taken into account. In this scope, the conditional probability P(email|spam) just means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. If this still sounds a bit confusing, let's make a quick example. Consider for a moment that our training spam set of emails consists just of these three emails:

email 1: 'are you interested in buying my product?'

email 2: 'congratulations! you've won 1000!'

email 3: 'check out this product!'

-

Also consider we have a new email and we want to ask ourselves what P(email|spam). This new email looks like this:

+

Also consider we have a new email and we want to ask ourselves what P(email|spam). This new email looks like this:

new email: 'apply and win all this products!'

-

As we already said, P(email|spam) stands for the plausibility of the new email being generated by the words we encountered in our training spam email set. We can see that words like 'win' –which in our training set appears in the form of 'won', but there is a standard technique in linguistics named Lemmatization, which groups together inflected forms of a word, letting us consider 'win' and 'won' as the same word– and 'product' appear rather commonly in our training data. So we will expect P(email|spam) to be relatively high in this fake and simple example, as it contains words that are repeated among our spam emails data.

+

As we already said, P(email|spam) stands for the plausibility of the new email being generated by the words we encountered in our training spam email set. We can see that words like 'win' –which in our training set appears in the form of 'won', but there is a standard technique in linguistics named Lemmatization, which groups together inflected forms of a word, letting us consider 'win' and 'won' as the same word– and 'product' appear rather commonly in our training data. So we will expect P(email|spam) to be relatively high in this fake and simple example, as it contains words that are repeated among our spam emails data.

Let's make all this discussion a bit more explicitly mathematical. The simplest way to write this in a mathematical way is to take each word appearing in the email and calculate the probability of it appearing in spam emails and ham emails. Then, we do this for each word in the email and finally multiply them,

-

P(email|spam)=i=1nP(wordi|spam)

-

P(email|ham)=i=1nP(wordi|ham)

+

P(email|spam)=i=1nP(wordi|spam)

+

P(email|ham)=i=1nP(wordi|ham)

The multiplication of each of the word probabilities here stands from the supposition that all the words in the email are statistically independent. We have to stress that this is not necessarily true, and most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires.

Let's start building a solution for our problem and the details will be discussed later.

-
43.8 μs
1.4 s

First, we would like to filter some words that are very common in the english language, such as articles and pronouns, and that will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. A good practice when dealing with models that learn from data like the one we are going to implement, is to divide our data in two: a training set and a testing set. We need to measure how good our model is performing, so we will train it with some data, and test it with some other data the model has never seen. This way we may be sure that the model is not tricking us. In Julia, the package MLDataUtils has some nice functionalities for data manipulations like this. We will use the functions splitobs to split our dataset in a train set and a test set and shuffleobs to randomize the order of our data in the split. It is important also to pass a labels array to our split function so that it knows how to properly split our dataset.

-
16.7 μs
219 ms

Now that we have our data clean and splitted for training and testing, let's return to the details of the calculations. The probability of a particular word, given that we have a spam email, can be calculated like so,

-

P(wordi|spam)=Nwordi|spam+αNspam+αNvocabulary

-

P(wordi|ham)=Nwordi|ham+αNham+αNvocabulary

-

With these formulas in mind, we now know exactly what we have to calculate from our data. We are going to need the numbers Nwordi|spam and Nwordi|ham for each word, that is, the number of times that a given word wi is used in the spam and ham categories, respectively. Then Nspam and Nham are the total number of times that words are used in the spam and ham categories (considering all the repetitions of the same words too), and finally, Nvocabulary is the total number of unique words in the dataset. α is just a smoothing parameter, so that probability of words that, for example, are not in the spam category don't give 0 probability.

+
17.7 μs
219 ms

First, we would like to filter some words that are very common in the english language, such as articles and pronouns, and that will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. A good practice when dealing with models that learn from data like the one we are going to implement, is to divide our data in two: a training set and a testing set. We need to measure how good our model is performing, so we will train it with some data, and test it with some other data the model has never seen. This way we may be sure that the model is not tricking us. In Julia, the package MLDataUtils has some nice functionalities for data manipulations like this. We will use the functions splitobs to split our dataset in a train set and a test set and shuffleobs to randomize the order of our data in the split. It is important also to pass a labels array to our split function so that it knows how to properly split our dataset.

+
13.8 μs
152 μs

Now that we have our data clean and splitted for training and testing, let's return to the details of the calculations. The probability of a particular word, given that we have a spam email, can be calculated like so,

+

P(wordi|spam)=Nwordi|spam+αNspam+αNvocabulary

+

P(wordi|ham)=Nwordi|ham+αNham+αNvocabulary

+

With these formulas in mind, we now know exactly what we have to calculate from our data. We are going to need the numbers Nwordi|spam and Nwordi|ham for each word, that is, the number of times that a given word wi is used in the spam and ham categories, respectively. Then Nspam and Nham are the total number of times that words are used in the spam and ham categories (considering all the repetitions of the same words too), and finally, Nvocabulary is the total number of unique words in the dataset. α is just a smoothing parameter, so that probability of words that, for example, are not in the spam category don't give 0 probability.

As all this information will be particular for our dataset, so a clever way to aggregate all this is to use a Julia struct, and we can define the attributes of the struct that we will be using over and over for the prediction. Below we can see the implementation. The relevant attributes of the struct will be wordscountham and wordscountspam, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets, N_ham and N_spam the total number of words appearing in each category, and finally vocabulary, an array with all the unique words in our dataset. The line BayesSpamFilter() = new() is just the constructor of this struct. When we instantiate the filter, all the attributes will be undefined and we will have to define some functions to fill these variables with values relevant to our particular problem.

-
25.6 μs
936 μs

Now we are going to proceed to define some functions that will be important for our filter implementation. The function word_data below will help for counting the occurrences of each word in ham and spam categories.

-
12.9 μs
words_count (generic function with 2 methods)
67.4 μs

Next, we will define the fit! function for our spam filter struct. We are using the bang(!) convention for the functions that modify in-place their arguments, in this case, the spam filter struc itself. This will be the function that will fit our model to the data, a typical procedure in Data Science and Machine Learning areas. This fit function will use mainly the words_count function defined before to fill all the undefined parameters in the filter's struct.

-
13.9 μs
fit! (generic function with 1 method)
28.1 μs

Now it is time to instantiate our spam filter and fit the model to the data. We do this with our training data so then we can measure how well it is working in our test data.

-
6.8 μs
547 ms

We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about. It will be used internally by the next function defined, spam_predict, which will receive a new email –the one we would want to classify as spam or ham–, our fitted model, and two parameters, α which we have already discussed in the formula for P(wordi|spam) and P(wordi|ham), and tol. We saw that the calculation for P(email|spam) and P(email|ham) required the multiplication of each P(wordi|spam) and P(wordi|ham) term. When mails are too large, i.e., they have a lot of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This can't happen, as we need values of P(email|spam) and P(email|ham) that are larger than zero so we can multiply them by P(spam) and P(ham) respectively and compare these values to make a prediction. The parameter tol is the maximum tolerance for the number of unique words in an email. If this number is greater than the parameter tol, only the most frequent words will be considered and the rest will be neglected. How many of these most frequent words? the first 'tol' most frequent words!

-
23.9 μs
word_spam_probability (generic function with 1 method)
23.7 μs
spam_predict (generic function with 2 methods)
72.8 μs
0
1.0 s

Finally we arrived to the point of actually testing our model. This is what the function below is all about. We feed it with our model fitted with the training data, and the test data we had splitted at the beginning, as well as with the labels of the classification of this data. This function makes a prediction for each email in our test data, using the values of our model and then checks if the prediction was right. We count all the correct predictions and then we divide this number by the total amount of mails, giving us an accuracy measurement.

-
7.9 μs
get_predictions (generic function with 2 methods)
165 μs
spam_filter_accuracy (generic function with 1 method)
75.1 μs

As you can see below, the model (at least under this simple metric) is performing very well! An accuracy of about 0.95 is quite astonishing for a model so naive and simple, but it works!

-
10.0 μs
predictions
16.9 s
0.9503865979381443
80.4 μs

But we have to take into account one more thing. Our model classifies mails into spam or ham and the amount of ham mails is considerably higher than the spam ones. We can calculated this percentage:

-
12.8 μs
0.29510309278350516
956 ns

This classification problemas where there is an unequal distribution of classes in the dataset are called Imbalanced classification problems.

+
15.2 μs
1.1 ms

Now we are going to proceed to define some functions that will be important for our filter implementation. The function word_data below will help for counting the occurrences of each word in ham and spam categories.

+
6.1 μs
words_count (generic function with 2 methods)
157 μs

Next, we will define the fit! function for our spam filter struct. We are using the bang(!) convention for the functions that modify in-place their arguments, in this case, the spam filter struc itself. This will be the function that will fit our model to the data, a typical procedure in Data Science and Machine Learning areas. This fit function will use mainly the words_count function defined before to fill all the undefined parameters in the filter's struct.

+
9.8 μs
fit! (generic function with 1 method)
31.2 μs

Now it is time to instantiate our spam filter and fit the model to the data. We do this with our training data so then we can measure how well it is working in our test data.

+
27.5 μs
392 ms

We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about. It will be used internally by the next function defined, spam_predict, which will receive a new email –the one we would want to classify as spam or ham–, our fitted model, and two parameters, α which we have already discussed in the formula for P(wordi|spam) and P(wordi|ham), and tol. We saw that the calculation for P(email|spam) and P(email|ham) required the multiplication of each P(wordi|spam) and P(wordi|ham) term. When mails are too large, i.e., they have a lot of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This can't happen, as we need values of P(email|spam) and P(email|ham) that are larger than zero so we can multiply them by P(spam) and P(ham) respectively and compare these values to make a prediction. The parameter tol is the maximum tolerance for the number of unique words in an email. If this number is greater than the parameter tol, only the most frequent words will be considered and the rest will be neglected. How many of these most frequent words? the first 'tol' most frequent words!

+
1.6 ms
word_spam_probability (generic function with 1 method)
29.2 μs
spam_predict (generic function with 2 methods)
61.9 μs
0
47.6 ms

Finally we arrived to the point of actually testing our model. This is what the function below is all about. We feed it with our model fitted with the training data, and the test data we had splitted at the beginning, as well as with the labels of the classification of this data. This function makes a prediction for each email in our test data, using the values of our model and then checks if the prediction was right. We count all the correct predictions and then we divide this number by the total amount of mails, giving us an accuracy measurement.

+
3.7 μs
get_predictions (generic function with 2 methods)
63.2 μs

Here we can see our model predicts the test mails, 0 for ham and 1 for spam

+
14.2 μs
predictions
7.1 s
spam_filter_accuracy (generic function with 1 method)
22.3 μs

As you can see below, the model (at least under this simple metric) is performing very well! An accuracy of about 0.95 is quite astonishing for a model so naive and simple, but it works!

+
4.8 μs
0.9574742268041238
12.3 μs

But we have to take into account one more thing. Our model classifies mails into spam or ham and the amount of ham mails is considerably higher than the spam ones. Let's see the percentages

+
8.1 μs
0.367012117554204
176 ms

So we know that only the 0.37% of the mails in the train section are spam

+
194 ms

This classification problemas where there is an unequal distribution of classes in the dataset are called Imbalanced classification problems.

So a good way to see how our model is performing is to construct a confusion matrix. A Confusion matrix is an N x N matrix, where N is the number of target classes. The matrix compares the actual target values with those predicted by the our model. Lets construct one for our model:

-
10.0 μs
splam_filter_confusion_matrix (generic function with 1 method)
118 μs
confusion_matrix
PredictionHam_mailSpam_mail
StringInt64Int64
1
"Model predicted Ham"
1059
35
2
"Model predicted Spam"
42
416
455 μs

So now we can calculate the accuracy of the model segmented by category.

-
5.7 μs
ham_accuracy
0.9618528610354223
10.1 μs
spam_accuracy
0.9223946784922394
4.1 μs

Summary

+
3.4 μs
splam_filter_confusion_matrix (generic function with 1 method)
95.9 μs
confusion_matrix
PredictionHam_mailSpam_mail
StringInt64Int64
1
"Model predicted Ham"
1079
36
2
"Model predicted Spam"
30
407
183 μs

So now we can calculate the accuracy of the model segmented by category.

+
2.5 μs
ham_accuracy
0.9729486023444545
5.1 μs
spam_accuracy
0.9187358916478555
5.3 μs

Summary

In this chapter, we have used a naive-bayes approach to build a simple email spam filter. First, the dataset and the theoretical framework were introduced. Using Bayes' theorem and the data available, we assigned probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. Later, the data was pre-processed and a struct was defined for the spam filter object. Functions were then implemented to fit the spam filter object to the data. Finally, we evaluated our model performance calculating the accuracy and making a confusion matrix.

-
8.6 μs

References

+
5.9 μs
3.8 ms

Give us feedback

+
3.1 ms

Give us feedback

This book is currently in a beta version. We are looking forward to getting feedback and criticism:

  • Submit a GitHub issue here.

    @@ -216,8 +215,8 @@

Thank you!

-
14.3 μs
8.4 μs
+
13.9 μs
4.3 μs