Gathering the data
We found the Titanic data set on Kaggle. The data sets were already split into the test and training set. But we found that
data cleaning was still necessary before we could properly run
machine learning algorithms.
The data files included the following information:
Field Name |
Field Description |
pclass
| Ticket Class - A proxy for socio-economic status
|
sex
| Gender of the passenger
|
age
| Age of the passenger in years
|
sibsp
| Number of siblings/ spouses aboard the Titanic of the passenger
|
parch
| Number of parents/ children aboard the TItanic of the passenger
|
ticket
| Ticket Number of the passenger
|
fare
| Amount of the ticket fare of the passenger in dollars
|
cabin
| Cabin number of the passenger
|
embarked
| Port of Embarkation of the passenger
|
name
| Name of the passenger
|
The training set included the “Survived” column (0 for perished, 1 for survived) to assist with the
supervised learning model.
From the dataset, we determined our "features" for our machine learning models by including inputs that
would have an impact on our prediction. We also looked at the data quality in making the decision for
choosing the features.
Here are the features we selected:
pclass
sex
age
sibsp
parch
fare
We removed name, cabin, and embarked, since these data elements did have an impact on the prediction.
Establish and clean the datasets
Since the datasets were already in the x and y sets, we only needed to drop columns and reshape. We
ensured
that the target, the
survived column, was in the y_train set.
We found 87 records without ages in the testing set, we needed to clean the notebook.
Encoding and Scaling
Within the file, the Sex data values were male or female. To prep it for the model, we encoded the data
using .get_dummies.
We then needed to scale the data, using the StandardScaler(), so the factors could have the proper
weighting
on the model.
Now our data is ready for our machine learning models!
Data Source: https://www.kaggle.com/c/titanic/data