WEBVTT

00:00.080 --> 00:01.120
Welcome back.

00:01.120 --> 00:02.600
Now scroll down.

00:02.640 --> 00:05.880
We have a data set simulated house prices.

00:05.880 --> 00:13.160
We're going to create a simple data set where the size in square feet which is a feature on x axis,

00:13.440 --> 00:18.120
and the price in thousands, which is the target on y axis.

00:18.120 --> 00:21.480
Did you remember the feature and the target terminology?

00:21.520 --> 00:31.120
Please go back to the previous lessons and in the in the ML introduction section for more information.

00:31.320 --> 00:34.760
This is a supervised regression problem.

00:34.760 --> 00:44.520
Supervised means predicting a continuous numerical value based on past examples where we know both the

00:44.520 --> 00:47.200
inputs and correct outputs.

00:47.240 --> 00:49.600
Okay, let me scroll down here.

00:49.600 --> 00:54.360
We have NP dot random dot seed 42.

00:54.520 --> 00:54.920
Here.

00:54.960 --> 01:06.720
The first step we're going to set the random seed for reproducibility means every time you run the code

01:06.720 --> 01:09.730
you'll get the same random numbers.

01:09.770 --> 01:16.370
Okay, so here we are getting the same random numbers every time we run the cell.

01:16.410 --> 01:26.570
Then we have size equals to two times NP dot random dot random from 101 plus one.

01:26.610 --> 01:33.930
This is used for generating 100 random numbers 100 rows and one column.

01:33.970 --> 01:36.410
So let me explain everything.

01:36.450 --> 01:39.210
100 rows and one column.

01:39.250 --> 01:39.650
Okay.

01:39.690 --> 01:40.810
It's very simple.

01:41.010 --> 01:51.770
The third line is price equals to three times size plus two plus NP random times 0.5.

01:51.930 --> 01:56.890
This is a linear relationship three times size plus two.

01:56.930 --> 02:02.290
This is a linear linear relationship between size and price.

02:02.290 --> 02:06.210
So the price depends on size to be productive.

02:06.210 --> 02:14.060
For example, if the size is equals to two three times two six plus two, the price would be eight.

02:14.060 --> 02:21.420
So as the size increased, the price increase multiplied by 0.5 for the random numbers.

02:21.460 --> 02:28.740
This, the NP random number generates random noise from a normal distribution with mean equal to zero,

02:28.740 --> 02:32.780
and standard deviation equals to one times 0.5.

02:32.820 --> 02:34.420
Reduces noise level.

02:34.620 --> 02:40.220
Adding noise makes the data looks more realistic, not perfectly linear.

02:40.580 --> 02:44.620
Okay, this is a simple mathematical things.

02:44.620 --> 02:50.340
In order to get and make our data looks more realistic.

02:50.500 --> 02:59.020
In short, this code simulates a data set where price depends on size in a roughly linear way with some

02:59.020 --> 03:01.060
random noise added.

03:01.100 --> 03:08.380
Okay, now let me explain the data visualization using and the data frame.

03:08.380 --> 03:14.660
Here we have PD which is Panda Pandas library dot data frame.

03:14.860 --> 03:21.900
We talked about the data frame in the previous videos, and inside it we have MP concatenate and the

03:21.980 --> 03:25.380
columns the MP concatenate.

03:25.620 --> 03:29.780
The size shape 101 the shape.

03:29.820 --> 03:40.300
The price is a shape 100 to 1 which is there the range and NP concatenate joins the columns side by

03:40.300 --> 03:41.580
side column wise.

03:41.620 --> 03:44.700
Okay, and the result would be a shape.

03:44.740 --> 03:48.780
Each row has a size and price array.

03:48.940 --> 03:54.260
Then we have the PD dot data frame.

03:54.620 --> 03:55.700
We talked about it.

03:55.700 --> 04:00.660
This converts the numpy array into a pandas data frame.

04:00.700 --> 04:01.140
Let.

04:01.300 --> 04:03.780
Let me put this note here.

04:03.780 --> 04:05.540
So did you remember pandas?

04:05.740 --> 04:10.140
This is used for table like structure visualization.

04:10.140 --> 04:15.100
Also we have the two columns the size and the price.

04:15.100 --> 04:19.380
The size in square meters in thousands of square square feet.

04:19.620 --> 04:23.420
And the price would be in thousand dollars.

04:23.660 --> 04:28.470
Okay then we used to print the first ten results.

04:28.510 --> 04:31.110
Let me run and here we go.

04:31.430 --> 04:38.950
This is a random data generated using the random MP random.

04:39.150 --> 04:40.270
The numpy.

04:40.630 --> 04:45.990
We use the random number generation using the numpy.

04:46.310 --> 04:47.550
Also we have.

04:47.590 --> 04:54.750
We have the size and the price size is depending on random numbers plus one shifting.

04:54.990 --> 05:02.710
And the price depends on size times three plus two plus this random number times 0.5.

05:02.750 --> 05:03.110
Okay.

05:03.150 --> 05:10.630
So this is the linear relationship between the columns and the rows and x and y axes.

05:10.670 --> 05:12.990
This is the size in square feet.

05:12.990 --> 05:15.030
For example the first house is

05:15.230 --> 05:21.310
1749ft².

05:21.350 --> 05:27.790
The price is $7,290.
