WEBVTT

00:00.080 --> 00:01.040
Welcome back.

00:01.080 --> 00:06.240
We learned about Standardscaler and why it's very important.

00:06.480 --> 00:10.320
Now let's run this cell and everything should be fine.

00:10.480 --> 00:17.320
Also, we can check the TensorFlow version by using TensorFlow version.

00:17.480 --> 00:18.760
Let me run again.

00:19.040 --> 00:19.760
There we go.

00:19.800 --> 00:27.360
TensorFlow is already installed and all of those libraries are added and imported correctly.

00:27.680 --> 00:33.000
Now let's move to the second step which is loading and exploring the data.

00:33.280 --> 00:41.360
We'll use the classic auto mpg miles per gallon data set from an online website.

00:41.480 --> 00:43.000
This website is

00:43.360 --> 00:50.840
archived.1.edu/ml/machine

00:50.840 --> 00:52.080
learning databases.

00:52.120 --> 00:55.360
Auto mpg auto mpg data.

00:55.400 --> 00:56.600
This is the link.

00:56.720 --> 01:05.200
Also, you can get it from the resource folder or from the notebook you import from the resources folder.

01:05.200 --> 01:14.160
Under each section, you notice there are a lot of data with the cars, the name of cars, number of

01:14.320 --> 01:18.000
uh, the number, the name of the car, the mileage.

01:18.000 --> 01:22.480
This is then the number that we need to, uh, to predict.

01:22.520 --> 01:24.600
We have number of cylinders.

01:24.640 --> 01:35.360
We have, uh, the displacement, horsepower, weight, acceleration, model year, and the origin.

01:35.360 --> 01:42.480
So one refers for American, two refers for European, and three refers for Japanese.

01:42.520 --> 01:42.920
Okay.

01:42.960 --> 01:53.960
This is our data that we're going to work on and to develop our model that learns all of those data

01:53.960 --> 01:58.120
and predict the mileage and mile per gallon.

01:58.240 --> 01:58.720
Okay.

01:59.160 --> 02:01.360
So in order to load this.

02:01.400 --> 02:03.160
Copy the link.

02:03.200 --> 02:06.720
Load the data by using the URI.

02:07.040 --> 02:16.680
So in the special thing in this application and in this tutorial that we're going to use and load real

02:16.680 --> 02:18.480
data from online website.

02:18.480 --> 02:24.120
So this is real data real statistics and real numbers.

02:24.120 --> 02:27.360
So we're going to deal with those real numbers.

02:27.360 --> 02:29.520
So here this is the URL.

02:29.840 --> 02:32.680
Then we need to get the column name.

02:32.680 --> 02:42.880
So column names mpg miles per gallon cylinders displacement horsepower weight acceleration model year

02:43.080 --> 02:44.560
and origin.

02:44.600 --> 02:54.200
Here we need to read the CSV file from a URL into a pandas data frame with a specific parsing options,

02:54.200 --> 02:58.040
as we did in the previous sections in the previous application.

02:58.240 --> 03:04.160
We collect our data and store it inside the data frame.

03:04.360 --> 03:13.320
So to do that we need to start with row data set equals to PD dot read csv.

03:13.560 --> 03:17.160
Here we have six parameters.

03:17.160 --> 03:19.400
Let me start with the URL.

03:19.840 --> 03:25.800
The path to the CSV file could be a web URL or a local file path.

03:25.960 --> 03:34.400
Also, you can upload your data sample and the data CSV file here in the folders and files for this

03:34.400 --> 03:35.080
notebook.

03:35.280 --> 03:41.080
Also, you can receive the link of your data exactly like we did here.

03:41.080 --> 03:45.120
So I'm passing this URL here in the first parameter.

03:45.320 --> 03:48.400
The second parameter is names.

03:48.760 --> 03:52.200
This provides custom column names.

03:52.200 --> 04:00.280
Instead of using the first row of the file, column names should be a list of strings containing your

04:00.280 --> 04:02.000
desired column names.

04:02.000 --> 04:08.210
So here we have the we have nine columns.

04:08.210 --> 04:13.890
So I should specify nine names of our columns.

04:14.010 --> 04:24.370
The third parameter name, underscore values, treats the question mark as a missing or non-available

04:24.370 --> 04:25.250
value.

04:25.290 --> 04:33.290
Any question mark in the data set will be converted to an Nan not available or not assigned.

04:33.290 --> 04:40.690
So if we scroll down, you notice that in this row we have the question mark.

04:40.690 --> 04:44.570
Also there are a lot of missing.

04:44.970 --> 04:47.250
So if you scroll down here we go.

04:47.490 --> 04:49.890
This is another question mark.

04:49.890 --> 04:57.290
So those question marks will be dealt with as Nan as not available.

04:57.330 --> 05:07.970
The fourth parameter comment backward slash t treats stab characters as comment indicators.

05:08.250 --> 05:11.850
Any text after a tab character will be ignored.

05:11.890 --> 05:13.850
Treated as comment.

05:13.890 --> 05:19.930
The fifth parameter is Sep equals to this empty space.

05:19.970 --> 05:26.730
Uses a single space as delimiter or separator between columns.

05:27.010 --> 05:32.370
The file is space separated rather than comma separated.

05:32.530 --> 05:41.730
So instead of separating the columns by comma, we're going to use a single space as a separator between

05:41.890 --> 05:42.690
columns.

05:42.730 --> 05:50.410
The last parameter is skip initial space equals to true skips any white space that appears immediately

05:50.410 --> 05:52.330
after the delimiter.

05:52.370 --> 05:56.010
Helps clean up extra space in the data.

05:56.050 --> 05:58.250
Okay, now this code.

05:58.570 --> 06:08.170
Use custom column names from column names or array Treat the question mark as missing data.

06:08.210 --> 06:17.450
Ignore that t comment here, for example, as a comment, and parse the values separated by space and

06:17.490 --> 06:22.050
handle the extra space before 6.0 properly.

06:22.050 --> 06:29.770
This is a commonly used for reading formatted data set, where the default CSV settings don't apply

06:29.810 --> 06:30.770
perfectly.

06:30.770 --> 06:37.930
And you can see we use pandas to read the CSV file using the readcsv method.

06:37.970 --> 06:42.610
Now dataset equals to row dataset dot copy.

06:42.810 --> 06:47.330
This line of code creates a copy of the data frame.

06:47.330 --> 06:50.090
What it does creates a deep copy.

06:50.130 --> 06:53.250
Makes a complete independent copy of row data.

06:53.290 --> 07:01.730
Set the new dataset variable points to the separated object in memory, and the changes made to dataset

07:01.770 --> 07:05.610
will not affect the row data set and vice versa.

07:05.650 --> 07:12.650
To print the data set shape we use and to print the data set.

07:12.690 --> 07:14.330
The complete data set.

07:14.370 --> 07:22.490
We use data set if we need to get only for, uh, the first ten numbers, we use the head function and

07:22.490 --> 07:25.650
pass the value of the first columns that we need.

07:25.850 --> 07:28.410
Run the cell and here we go.

07:28.650 --> 07:30.690
This is our data set shape.

07:30.690 --> 07:36.050
We have 389 rows with eight columns.

07:36.170 --> 07:36.690
Okay.

07:36.890 --> 07:44.250
So this is our data set 398 rows times eight columns.

07:44.250 --> 07:52.610
This is a big and a moderate and uh, and real data set that we're going to.
