Preparing the Data for Machine Learning Models



Gathering the data


We found the Titanic data set on Kaggle. The data sets were already split into the test and training set. But we found that data cleaning was still necessary before we could properly run machine learning algorithms.

The data files included the following information:

Field Name Field Description
pclass Ticket Class - A proxy for socio-economic status
sex Gender of the passenger
age Age of the passenger in years
sibsp Number of siblings/ spouses aboard the Titanic of the passenger
parch Number of parents/ children aboard the TItanic of the passenger
ticket Ticket Number of the passenger
fare Amount of the ticket fare of the passenger in dollars
cabin Cabin number of the passenger
embarked Port of Embarkation of the passenger
name Name of the passenger


The training set included the “Survived” column (0 for perished, 1 for survived) to assist with the supervised learning model.

From the dataset, we determined our "features" for our machine learning models by including inputs that would have an impact on our prediction. We also looked at the data quality in making the decision for choosing the features.

Here are the features we selected:
  • pclass
  • sex
  • age
  • sibsp
  • parch
  • fare
  • We removed name, cabin, and embarked, since these data elements did have an impact on the prediction.

    Establish and clean the datasets


    Since the datasets were already in the x and y sets, we only needed to drop columns and reshape. We ensured that the target, the survived column, was in the y_train set. We found 87 records without ages in the testing set, we needed to clean the notebook.

    Encoding and Scaling


    Within the file, the Sex data values were male or female. To prep it for the model, we encoded the data using .get_dummies. We then needed to scale the data, using the StandardScaler(), so the factors could have the proper weighting on the model.

    Now our data is ready for our machine learning models!

    Data Source: https://www.kaggle.com/c/titanic/data