WEBVTT

00:00.120 --> 00:00.920
Welcome back.

00:00.960 --> 00:03.840
Let's continue with data preprocessing.

00:03.840 --> 00:08.120
And let's split the data into test and train sets.

00:08.160 --> 00:19.360
Here, as we did in the previous videos x train, x test, y train and y test equals to train underscore

00:19.400 --> 00:21.560
test underscore split.

00:21.600 --> 00:28.280
This is a function that we used before from the library as key learn.

00:28.280 --> 00:30.640
And we use it in the previous videos.

00:30.640 --> 00:38.720
And in the introductory section of this course for splitting the data between the training and testing.

00:38.880 --> 00:42.720
We have the inputs x and y.

00:43.040 --> 00:48.600
The test the x features all columns except PG.

00:48.880 --> 00:54.840
So here features or all columns except mpg.

00:54.920 --> 01:01.560
Also we have the target y mpg values we want to predict.

01:01.720 --> 01:09.640
The output would be x train, y train x test, and y test X train.

01:09.760 --> 01:17.790
Features for training 80% of data X test features for testing 20% of data.

01:17.830 --> 01:26.070
Y train corresponding mpg miles per gallon and the mileage values for trading.

01:26.070 --> 01:31.390
And those are the corresponding mileage values for testing.

01:31.510 --> 01:34.870
Okay, so we have two um x and y.

01:35.070 --> 01:40.510
Those are the features and target that are passed as parameters for train and test and split.

01:40.510 --> 01:47.870
And the output would be four parameters x train, y test, y train and y test.

01:47.910 --> 01:56.590
We use splitting data to teach the machine learning model, and model learns patterns and relationship

01:56.790 --> 01:58.390
from this data.

01:58.430 --> 02:02.710
So this is a very important step for splitting the data.

02:02.710 --> 02:05.310
And here we have the training set.

02:05.510 --> 02:09.550
So the training set is xtrain and Ytrain.

02:09.830 --> 02:14.630
And testing set is x test and y test variables.

02:14.630 --> 02:19.510
The training set is used to teach the ML model.

02:19.710 --> 02:25.390
And the ML model learns patterns and relationships from this data.

02:25.430 --> 02:25.910
Y.

02:28.300 --> 02:37.460
The testing set used to evaluate the model's performance tests how well the model generalizes to unseen

02:37.460 --> 02:38.060
data.

02:38.220 --> 02:40.260
Preventing cheating.

02:40.380 --> 02:44.820
Model has not seen this data during training.

02:44.860 --> 02:54.900
To see that training and the testing data, we used X train shape and test X test shape run the cell.

02:55.100 --> 03:04.460
We have 313 rows are used for training and xtest 79 rows.

03:04.700 --> 03:13.420
So our testing data contains 79 rows and the training data is 313.

03:13.460 --> 03:19.740
This is used for building the model and the test set used for evaluating the model.

03:19.860 --> 03:29.980
This split is essential for building reliable machine learning models that work while on a new unseen

03:30.020 --> 03:30.540
data.

03:30.580 --> 03:32.220
Preventing overfitting.

03:32.540 --> 03:33.340
Honest.

03:33.380 --> 03:34.020
Honest.

03:34.260 --> 03:37.340
Evaluation and model selection.
