Training & Test Data
After cleaning your dataset, the next job is to split the data into two segments for training and testing, also known as split validation. The first split of data is the training data, which is the initial reserve of data used to develop your model. After you have developed a model based on patterns extracted from the training data, you can test the model on the remaining data, which is called the test data. The test data should be used to assess model performance rather than optimize the model.
When splitting the data, the ratio of the two splits should be approximately 70/30 or 80/20. This means that your training data should account for 70 percent to 80 percent of the rows in your dataset, and the remaining 20 percent to 30 percent of rows are left for your test data. It’s vital to split your data by rows and not by columns.
Other options for splitting the data are k-fold validation or a three-way split with a validation set.
As the test data cannot be used to build and optimize the model, data scientists sometimes use a third independent dataset called the validation set. After building an initial model with the training set, the validation set can be fed into the prediction model and used as feedback to optimize the model’s hyperparameters. The test set is then used to assess the prediction error of the final model.
In k-fold validation, the k-fold validation technique involves splitting data into k assigned buckets and reserving one of those buckets for testing the training model at each round. To perform k-fold validation, data is randomly assigned to k number of equal-sized buckets. One bucket is reserved as the test bucket and is used to measure and evaluate the performance of the remaining (k-1) buckets.