WEBVTT

00:00.120 --> 00:00.920
Welcome back.

00:00.960 --> 00:03.840
Let's continue with data preprocessing.

00:04.120 --> 00:12.440
Here we need to convert the categorical origin column into one hot encoding, which is a crucial step

00:12.600 --> 00:16.800
for preparing categorical data for ML models.

00:16.960 --> 00:21.520
Don't worry, we're going to clarify everything in the next couple of minutes.

00:21.560 --> 00:23.800
PD dot get dummies here.

00:23.800 --> 00:29.480
We need to pass the parameters x columns and the prefix.

00:29.520 --> 00:35.000
Prefix is origin columns origin and x is the data.

00:35.160 --> 00:44.240
One hot encoding converts categorical variables into a binary zero or ones format that ML algorithm

00:44.240 --> 00:45.640
can understand.

00:45.680 --> 00:49.640
Here we have, for example, origin one.

00:49.760 --> 00:57.440
If we go up with our data, we have origin one means American origin two.

00:57.480 --> 00:59.720
European Origin three.

01:00.040 --> 01:01.400
Um, Japanese.

01:01.600 --> 01:10.400
So if we look at the data again, you see this column number one American, three Japanese and two European.

01:10.560 --> 01:11.080
Okay.

01:11.360 --> 01:22.440
So if we have, uh, like this data, we need to inform the ML algorithm and make ML understand that

01:22.600 --> 01:26.280
one represent American origin.

01:26.320 --> 01:29.320
Two European, three Japanese.

01:29.480 --> 01:36.160
We need to transform this into something that ML can understand.

01:36.280 --> 01:43.480
So before converting we have data like this one, two and three for the origin.

01:43.480 --> 01:51.520
And after converting after one hot encoding we're going to create three columns origin one, origin

01:51.520 --> 01:52.920
two and origin three.

01:53.240 --> 01:58.630
And every um and one represents that it is true.

01:58.910 --> 02:01.270
Zero represents that it is false.

02:01.310 --> 02:02.550
Let me show you guys.

02:02.550 --> 02:09.670
So here print features after one hot encoding print x dot head.

02:09.950 --> 02:13.990
Let me get the first cells run again.

02:13.990 --> 02:15.110
And here we go.

02:15.270 --> 02:20.710
We have cylinders displacement horsepower weight acceleration model year.

02:20.710 --> 02:22.430
Those are the same.

02:22.590 --> 02:30.310
But the newly created columns are origin one, origin two and origin three.

02:30.590 --> 02:34.230
Origin one means that it is American.

02:34.230 --> 02:41.750
Origin two means that it is European origin three represents Japanese origin.

02:41.790 --> 02:42.270
Okay.

02:42.510 --> 02:52.430
In this way we can convert this into something that it is readable by ML algorithms.
