Preparing the Data for Machine Learning Models

Gathering the data

We found the Titanic data set on Kaggle . The data sets were already split into the test and training set. But we found that data cleaning was still necessary before we could properly run machine learning algorithms.

The data files included the following information:

Field Name	Field Description
pclass	Ticket Class - A proxy for socio-economic status
sex	Gender of the passenger
age	Age of the passenger in years
sibsp	Number of siblings/ spouses aboard the Titanic of the passenger
parch	Number of parents/ children aboard the TItanic of the passenger
ticket	Ticket Number of the passenger
fare	Amount of the ticket fare of the passenger in dollars
cabin	Cabin number of the passenger
embarked	Port of Embarkation of the passenger
name	Name of the passenger

The training set included the “Survived” column (0 for perished, 1 for survived) to assist with the supervised learning model.

From the dataset, we determined our "features" for our machine learning models by including inputs that would have an impact on our prediction. We also looked at the data quality in making the decision for choosing the features.

Here are the features we selected:

sex

age

We removed name, cabin, and embarked, since these data elements did have an impact on the prediction.

Establish and clean the datasets

Since the datasets were already in the x and y sets, we only needed to drop columns and reshape. We ensured that the target, the survived column, was in the y_train set. We found 87 records without ages in the testing set, we needed to clean the notebook.

Encoding and Scaling

Within the file, the Sex data values were male or female. To prep it for the model, we encoded the data using .get_dummies. We then needed to scale the data, using the StandardScaler(), so the factors could have the proper weighting on the model.

Now our data is ready for our machine learning models!

Data Source: https://www.kaggle.com/c/titanic/data