WEBVTT

00:00.080 --> 00:04.880
We have 398 rows and eight columns.

00:04.920 --> 00:06.080
This is big data.

00:06.360 --> 00:09.000
So let's understand data.

00:09.040 --> 00:10.520
Create a new code.

00:10.720 --> 00:15.640
And here we need to get the basic information about the data set.

00:15.760 --> 00:23.440
Start with print data set info and use dataset.info function run.

00:23.440 --> 00:24.440
And here we go.

00:24.760 --> 00:37.640
We get that our data columns are eight range index from 398 entries from 0 to 397.

00:37.640 --> 00:40.840
And here we have our data.

00:40.840 --> 00:46.960
We have the columns now one null count and data type.

00:46.960 --> 00:58.520
So the MPG is float64 cylinders is integer displacement float horsepower float weight float acceleration

00:58.520 --> 01:03.640
float and model year is integer and origin is integer.

01:03.640 --> 01:07.920
So we understand that data types of our columns.

01:07.960 --> 01:09.160
Data types.

01:09.160 --> 01:15.640
We have five floats and three integers, and memory usage is 25 kilobyte.

01:15.840 --> 01:18.280
And this is our data.

01:18.520 --> 01:22.560
Now let's get the missing values.

01:22.560 --> 01:26.000
So here print missing values.

01:26.000 --> 01:30.520
Printing missing values using dataset dot is n a.

01:30.960 --> 01:39.640
This line of code prints the count of missing and not available or not assigned values for each column

01:39.640 --> 01:41.000
in the data frame.

01:41.120 --> 01:43.240
Run it again and here we go.

01:43.280 --> 01:46.000
The missing values for the MPG.

01:46.040 --> 01:46.840
Nothing.

01:46.880 --> 01:47.560
Cylinders.

01:47.560 --> 01:48.560
Displacement.

01:48.600 --> 01:53.240
Horsepower is six and the other are zeros.

01:53.560 --> 01:56.520
The data type is int 64.

01:56.560 --> 02:05.040
If we go to the, uh the horsepower, we will find six values six not available values.

02:05.040 --> 02:15.150
So this data set dot is na dot some creates a boolean data frame of the same shape as data set.

02:15.190 --> 02:23.390
Each cell contains true if the value is missing, not available, false otherwise, and dot.

02:23.430 --> 02:24.030
Some.

02:24.070 --> 02:29.670
Here sums the true values which count as one for each column.

02:29.670 --> 02:31.630
False counts zero.

02:31.870 --> 02:38.670
Returning a series with column names as index and missing value counts as value.

02:38.670 --> 02:45.310
So in this case we get five six missing values in the horsepower column.

02:45.310 --> 02:49.030
So if we go to here we search our data set.

02:49.070 --> 02:54.510
We will get six missing values in the horsepower.

02:54.550 --> 02:55.110
Okay.

02:55.310 --> 03:00.710
We don't need to import missing values and missing data.

03:00.710 --> 03:05.110
So I'm gonna delete those six values.

03:05.110 --> 03:14.190
To do that here we are going to create a new cell and handle missing values.

03:14.190 --> 03:21.030
Start with print before handling and dropping before dropping.

03:21.190 --> 03:22.910
Not available data set.

03:22.950 --> 03:23.790
Dot shape.

03:23.870 --> 03:24.830
Data set.

03:24.870 --> 03:26.710
Drop not available.

03:26.710 --> 03:28.230
And after dropping.

03:28.430 --> 03:30.110
Give me the data shape.

03:30.270 --> 03:30.870
Run!

03:30.870 --> 03:31.710
And here we go.

03:31.990 --> 03:38.110
Before dropping, we have 398 rows and eight columns.

03:38.230 --> 03:50.310
After dropping, we have 392 rows and eight columns, so the six not available values from the horsepower

03:50.550 --> 03:53.750
are being removed completely.

03:53.750 --> 03:59.630
So the complete row like this will be removed from our data set.

03:59.630 --> 04:14.030
So our data now contains 392 rows with eight columns and without no no values or not available values.
