Support Vector Machine



Support Vector Machine (SVM) is an algorithm that we can use to better identify a boundary to classify a data point. Instead of a singular line, the SVM algorithm finds an optimal hyperplane in an N-dimensional space (N = the number of features) that separates the data points with the largest margin possible to distinctly classify them. The example below shows a points in a 2 dimensional space with a number of linear classifiers that could potetntially separate the data (left graph), but the SVM algorithm can provide the optimal plane of separation (right graph).

(This image was found on towardsdatascience.com)

After preprocessing our Titanic dataset, we trained an SVM classifier using the scikit-learn python package to classify people in the dataset on whether or not they survived the disaster. When testing this model, we received the following results:


Precision is the ratio of the correctly predicted positive (or negative) observations to the total predicted positive (or negative) observations. Based on the table above, this means that 67% of the people our model predicted to survive actually did, and 55% of the people our model predicted to die actually did. Recall is the ratio of the correctly predicted positive (or negative) observations to the total number of actual positive (or negative) observations. Based on our results, this means that our model correctly classified 83% of the actual survivors in the dataset, and 33% of the people who died. The F1-Score is the harmonic average of the precision and recall scores. An F1-Score of 1 is the best score a model can get, and an F1-Score of 0 is the worst.

Overall, it appears the SVM model is much better at predicting those who would survive the disaster than who would not. The last result presented in the table above is the accuracy score of the SVM model. The accuracy score of 0.64 means that overall, our model made correct predictions of the data 64% of the time.