What are training and test data sets

606 Views Asked by At

I am getting started in kaggle.

I have just gone through various data science and machine learning competition

I have seen that for every competition they have uploaded training data, test data and Original data.

Can someone explain me what are those and how do we use those datasets while solving a problem.

3

There are 3 best solutions below

0
On

Training data: Used to train the AI.
Test data: Used to assess the strength of the AI that used the previous training data.
Original data: Well, it's the original data.

When doing machine learning, the AI has to be trained in some way. This is why we break the data up, and give the AI a subset of the original data (training data) so that it can learn. We test its knowledge with the test data, then once that is done we can feed it the original data and see how it does.

0
On

In ML, the Original data set is divided into training and test set (sometime cross-validation set as well).

Training set: The data set you use to fit the parameters for your algorithm.

Test set: The data set to evaluate how accurate your parameters for the algorithms is.

The training set, test set split is usually 80%,20% or 70%,30% respectively. It is advised to have the original data set randomized before making the split. Always remember, in ML the error will always be lower on the data set that was used to fit the parameters. Never evaluate your algorithm using the training set.

0
On

To evaluate how well a trained model performs on unseen data, you gotta split the original data into separate training and test sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(features_all,pred_var,test_size=0.3, random_state=42)

With this you randomly split the features and y arrays into 30% test data and 70% training data. Then, you fit your regression model, as follows

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train,y_train) # fit regressor to training data
y_pred = reg.predict(X_test) # predict on test data

Hope this help.