Final Analysis


The goal of this project was to test five machine learning models on how accurately they could predict whether or not a person would survive or die during the sinking of the titanic. As such, after training and testing each of our models we compared each of their accuracy scores. As the results show, the logistic regression classifier was the most accurate of the models we trained, and K-Nearest Neighbors was the least. Here is a table of the results:


While cleaning the data we noticed that there were many null values for age in the dataset. Since we thought age may be an important feature, we wanted to keep it in our models but had to consider different ways to deal with the null age values. In our primary analysis (results presented above), we decided to simply remove all rows in the dataset that contained null age values. As a secondary analysis, instead of dropping null age values, we wanted to see how our models would perform if we replaced them with random age values. We used the maximum and minimum ages in the dataset to determine the range of values to randomly pick from. We then used this dataset to train and test the top three most accurate models from our primary analysis. After comparing the accuracy scores of the new models, the random forest classifier was the most accurate we tested. Here is a table of those results:


After finding the most accurate classifiers, we decided to put our own information and some characters from the film (Rose and Jack) into our models and predict whether or not we would have survived or died. Here are the results (the logistic regression model has null ages removed, the random forest model has null ages replaced with random ages):



Do you see any patterns?

For The Future
There are a number of things that we could have done differently in this project that would be worth exploring in the future. We chose to use two different methods to manage null values in the age feature of our models, but there are other methods we could try. Additionally, we could do a deeper analysis on each feature to determine if any features may be related to one another when it comes to predictive capacity. We could also run this experiment again with a neural network added to the list of models we test. Finally, with more time we would have liked to create a user interaction on the page where people could input values to run through one of the models we created and make their own predictions!