Lesson 2 - SVM¶

"""
This is the code to accompany the Lesson 2 (SVM) mini-project.

Use a SVM to identify emails from the Enron corpus by their authors:    
Sara has label 0
Chris has label 1

"""
import os
import sys
from time import time
os.chdir("C:\\Users\\PR043\\OneDrive for Business\\Training\\Datacamp\\Python\\Udacity\\Machine Learning\\ud120-projects\\tools")
sys.path.append(r"../tools/")
#sys.path.append("C:\\Users\\PR043\\OneDrive for Business\\Training\\Datacamp\\Python\\Udacity\\Machine Learning\\ud120-projects\\tools")
from email_preprocess import preprocess
import numpy
from sklearn.svm import SVC

### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884

Go to the svm directory to find the starter code (svm/svm_author_id.py). Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?

#########################################################
clf = SVC(kernel = 'linear', C=1)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
      
#########################################################

training time: 219.316 s
881
prediction time: 22.984 s
0.984072810011

One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.

features_train = features_train[:len(features_train)/100] labels_train = labels_train[:len(labels_train)/100]

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

clf = SVC(kernel = 'linear', C=1)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.143 s
1046
prediction time: 1.392 s
0.884527872582

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?

clf = SVC(kernel = 'rbf', C=1)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.154 s
1540
prediction time: 1.535 s
0.616040955631

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?

clf = SVC(kernel = 'rbf', C=10)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.147 s
1540
prediction time: 1.589 s
0.616040955631

clf = SVC(kernel = 'rbf', C=100)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.154 s
1540
prediction time: 1.542 s
0.616040955631

clf = SVC(kernel = 'rbf', C=1000)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.147 s
1177
prediction time: 1.488 s
0.821387940842

clf = SVC(kernel = 'rbf', C=10000)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.132 s
1018
prediction time: 1.233 s
0.892491467577

C=10000 gives the best accuracy.

Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?

features_train, features_test, labels_train, labels_test = preprocess()
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

no. of Chris training emails: 7936
no. of Sara training emails: 7884
training time: 149.075 s
877
prediction time: 14.902 s
0.990898748578

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.) And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))

training time: 0.154 s
1018
prediction time: 1.233 s
0.892491467577

pred = clf.predict(features_test)
print "Prediction for element 10th, 26th and 50th are:", pred[10], pred[26], pred[50]

Prediction for element 10th, 26th and 50th are: 1 0 1

There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

features_train, features_test, labels_train, labels_test = preprocess()
clf.fit(features_train, labels_train)
print('Number of events predicted in Chris class is', sum(clf.predict(features_test) ==1))

no. of Chris training emails: 7936
no. of Sara training emails: 7884
('Number of events predicted in Chris class is', 877)