WEBVTT

00:00.080 --> 00:01.080
Welcome back.

00:01.120 --> 00:09.120
We used numpy library to generate random numbers and create a linear relationship between size and price.

00:09.320 --> 00:16.840
Then we use the data frames from the pandas library to create this table size and the price.

00:17.120 --> 00:28.120
Also, we use the Matplot library to visualize and display the created and generated data into this

00:28.160 --> 00:28.880
chart.

00:29.040 --> 00:35.600
So this chart displays only the first ten results of the data that we created.

00:35.600 --> 00:38.120
Also, this table do the same job.

00:38.280 --> 00:38.600
This.

00:38.600 --> 00:45.160
Thanks for the head function that gets all the first ten rows.

00:45.200 --> 00:55.160
Now let's deep dive into machine learning how to use those data into our model to train and prepare

00:55.440 --> 00:56.560
the model.

00:56.600 --> 00:59.600
The first step is to prepare the data.

00:59.600 --> 01:02.310
We split data into two things.

01:02.310 --> 01:05.910
Two parts training sets to train the model.

01:05.950 --> 01:09.150
Testing set to evaluate performance.

01:09.270 --> 01:13.630
We use train test split for scikit learn.

01:13.670 --> 01:14.990
Let me show you guys.

01:15.030 --> 01:19.630
First we have created two variables x and y.

01:19.950 --> 01:22.190
X equals to data.

01:22.230 --> 01:22.910
Data.

01:23.150 --> 01:26.990
Did you remember the data which is the data frame that we created before.

01:26.990 --> 01:34.030
So if you scroll up we see that we stored all the data inside the data variable.

01:34.030 --> 01:36.110
In order to access the size.

01:36.150 --> 01:45.430
We access it using this angled bracket and y equals to data and accessing the price and square in thousands

01:45.430 --> 01:46.670
of dollars.

01:46.710 --> 01:56.070
Now create those random numbers and random variables again guys X is the feature, Y is the target.

01:56.110 --> 01:58.710
Here we have the split.

01:58.790 --> 02:01.780
We have to split the data at first.

02:01.820 --> 02:12.140
Xtrain Xtest Ytrain y test equals to train test split x y test size and Randomstate.

02:12.300 --> 02:14.500
What does this means?

02:14.660 --> 02:25.780
Train test split is a function from SK learn dot model selection that splits your data set into training,

02:25.820 --> 02:33.620
test set and testing set training set X train and Y train used to train the model.

02:33.740 --> 02:42.540
Testing set X test and Y test are used for for evaluating the model's performance.

02:42.580 --> 02:52.180
The parameters are x, y test size and random set x and y that do constants, the data frame and the

02:52.180 --> 02:52.940
series.

02:52.980 --> 03:04.360
The feature and the target that test size 0.2 is 20% of the data will be for testing and 80% for training.

03:04.360 --> 03:06.760
So let me write this node down.

03:06.960 --> 03:17.480
0.2 equals to 20% of the data will be for testing and 80% for training.

03:17.600 --> 03:23.880
Random state equals to 42 ensures that the split is reproducible.

03:24.120 --> 03:27.960
You get the same split every time you run it.

03:28.000 --> 03:29.880
Okay, it's very simple.

03:29.920 --> 03:34.040
Now checking the numbers of the sample training.

03:34.040 --> 03:36.680
Sample x train dot shape.

03:37.000 --> 03:42.200
Train shape zero and testing the samples Xtest dot shape zero.

03:42.320 --> 03:44.360
The X train shape.

03:44.360 --> 03:45.960
Here we are training.

03:45.960 --> 03:50.520
The model gives the number of rows samples in the training set.

03:50.520 --> 03:56.280
So gives the number of rows in the training set.

03:56.280 --> 04:00.960
And this line gives the number of samples in the test set.

04:01.310 --> 04:08.190
This helps confirm that the split worked and as expected, let's run and here we go.

04:08.230 --> 04:20.070
Training sample 80% and testing sample is 20% okay so this is how we split the data into uh into testing

04:20.070 --> 04:22.270
and training sets.

04:22.310 --> 04:31.590
Again as a quick summary, what we've done before X is the feature, the house size, the y axis and

04:31.590 --> 04:37.910
the y variable is the target which is the price split to 20%.

04:37.910 --> 04:46.910
So the test size would be 20% and 80% for training and random state equals to 42.

04:46.950 --> 04:53.990
Print the number of samples in each set and every time get the same split every time you run it.

04:54.150 --> 04:58.350
Okay, this is how we split the data into testing and to train.
