Email Classifier using Mahout on Hadoop


There are three branches of Machine learning one among them is called "Classification".

What is classification?

Classification is a supervised learning technique that learns, builds experience from the existing categorised documents (i. e. training data set) and tries to predict a category to previously unseen data.

Some of the examples are predicting diseases, spam email filtering and detection of fraudulent bank transactions.

hadoop blog.png

What is supervised learning?

Supervised learning is a "Machine Learning" technique wherein the training dataset is given and their appropriate results to build concepts in the system. For example: Naive Bayes Classifier.

As humans, probably we have been doing human supervised learning unknowingly. We do not open mails with subject line "YOU WON THE LOTTERY" or "CHEAP MEDICINES". With prior experience, these words in the subject line specify that this email is a SPAM. There is no compulsion that sequence of words would be in same sequence, rather it keeps changing but will have similar wordings.might have words in the same sequence, but we could have seen enough emails with similar wordings.

Supervised learning also functions in a similar manner. In case, of building a classifier say for example "email spam classifier", we train using data which has already been labelled as "Spam" or "Non-Spam", and then use that classifier to make predictions on unseen emails.

Following are the steps involved in building a classifier

1) Get/build the training set

For building a classifier, we need training data which needs to be similar with the actual data that is to be classified. Here, a point to note is that, the classifier can only be as good as training data. For example:-email spam classifier, we will require the subject lines and their label spam/non-spam.

2) Selecting the features/dimensions

Once we have the dataset, the features/dimensions need to be selected which would be used to build the classification model. For example:- a)For email spam filter, it could be words in the subject line b)For bank transactions it can be amount, account number, location of the transaction, et

3) Dimension reduction/data preparation

Once we have identified the dimension, we need to bring it to the format which can be used with algorithm or can further split the input data set into test and training dataset.

4) Build and train the classifier

Build a classifier and train it using training data set.

5) Validate

Once, we have the classifier ready, run it on the test data set and verify if it works fine. If not, we might have to  change the selected model or features.

Here is an email classifier built on Mahout which uses the free email data set from (This website provides classified data into spam and non-spam (termed as ‘ham’).)

We can use Mahout to build the Naive Baise Classifier to classify the emails.

  1. Download the spam and ham corpus
curl -O
curl -O <a href=""></a>


  1. Extract them; we will end up with two directories spam and easy_ham
tar xvf 20030228_spam.tar.bz2
tar xvf 20030228_easy_ham.tar.bz2


  1. Creating a directory for dataset.
mkdir dataset


  1. Move spam and easy_ham directories in dataset.
mv -R easy_ham/ spam/ dataset/


  1. Copy dataset on HDFS:
hadoop fs -put dataset


  1. Convert the dataset into SequenceFile.
mahout seqdirectory -i dataset -o dataset-seq


  1. Convert the SequenceFile into vectors.
mahout seq2sparse -i dataset-seq -o dataset-vectors  -lnorm -nv  -wt tfidf


  1. Split dataset into two datasets. One for testing and one for training. Randomly splitting them for training 85% of records and for training 15%
mahout split -i dataset-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 15 --overwrite --sequenceFiles -xm sequential

  1. Training the classifier:
mahout trainnb -i train-vectors -el -o model -li labelindex -ow –c

  1. Test against the test dataset:
mahout testnb -i test-vectors -m model -l labelindex -ow -o testing-test -c


Output Confusion Matrix

a b   <--Classified as
382 1 383 a = easy_ham
1 69 70 b = spam

Interpretations of the Matrix

  • Out of 453 emails, 451 were classified correctly
  • 382 were ham and were classified accurately as ham (True positive)
  • 69 were spam and were classified as spam (True negative)
  • 1 record was spam, but it has been classified as ham. (False negative)
  • 1 record was ham, but it has been classified as spam. (False positive)

This matrix reveals that the classifier has classified the test data set with 99.5585% Accuracy.

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at

Other useful links:

Big Data Analytics in the Travel Industry

Your data goldmine - how to capture it, hold it, categorise it and use it

Big Data in Retail


Big Data

Recent Posts