"""
This is the code to accompany the Lesson 2 (SVM) mini-project.
Use a SVM to identify emails from the Enron corpus by their authors:
Sara has label 0
Chris has label 1
"""
import os
import sys
from time import time
os.chdir("C:\\Users\\PR043\\OneDrive for Business\\Training\\Datacamp\\Python\\Udacity\\Machine Learning\\ud120-projects\\tools")
sys.path.append(r"../tools/")
#sys.path.append("C:\\Users\\PR043\\OneDrive for Business\\Training\\Datacamp\\Python\\Udacity\\Machine Learning\\ud120-projects\\tools")
from email_preprocess import preprocess
import numpy
from sklearn.svm import SVC
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
Go to the svm directory to find the starter code (svm/svm_author_id.py). Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?
#########################################################
clf = SVC(kernel = 'linear', C=1)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
#########################################################
One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.
features_train = features_train[:len(features_train)/100] labels_train = labels_train[:len(labels_train)/100]
These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?
features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]
clf = SVC(kernel = 'linear', C=1)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?
clf = SVC(kernel = 'rbf', C=1)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?
clf = SVC(kernel = 'rbf', C=10)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
clf = SVC(kernel = 'rbf', C=100)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
clf = SVC(kernel = 'rbf', C=1000)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
clf = SVC(kernel = 'rbf', C=10000)
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
C=10000 gives the best accuracy.
Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?
features_train, features_test, labels_train, labels_test = preprocess()
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.) And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]
features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t0 = time()
print(sum(clf.predict(features_test) ==1))
print "prediction time:", round(time()-t0, 3), "s"
print(clf.score(features_test, labels_test))
pred = clf.predict(features_test)
print "Prediction for element 10th, 26th and 50th are:", pred[10], pred[26], pred[50]
There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)
features_train, features_test, labels_train, labels_test = preprocess()
clf.fit(features_train, labels_train)
print('Number of events predicted in Chris class is', sum(clf.predict(features_test) ==1))