WEBVTT

00:00.080 --> 00:01.000
Welcome back.

00:01.120 --> 00:05.160
Now let's talk about the input and output shapes of our model.

00:05.200 --> 00:11.480
We have YOLO input Float32 one 640 640 and three.

00:11.520 --> 00:17.000
This defines what kind of image you need to feed into the model.

00:17.240 --> 00:19.520
Float32 is the data type.

00:19.760 --> 00:26.120
The pixel values should be 32 bit floating point numbers.

00:26.160 --> 00:32.880
This is usually means your image pixels need to be normalized.

00:32.960 --> 00:37.960
Example scaled from 0 to 255.

00:38.080 --> 00:42.400
We need to normalize them from 0 to 1.

00:42.440 --> 00:46.160
Now let's talk about the array or the tensor.

00:46.280 --> 00:50.120
The shape or dimensions of the expected data array.

00:50.160 --> 00:54.520
It's read as batch size height.

00:54.560 --> 00:56.320
Width channels.

00:56.440 --> 01:03.770
First of all, the batch size number one here means That this is the batch size.

01:04.050 --> 01:10.130
This is or this model is configured to process one image at a time.

01:10.290 --> 01:17.170
640 and 640 is the height and the width of the image in pixels.

01:17.330 --> 01:19.810
Three the channels.

01:19.850 --> 01:24.530
The three color channels red, green and blue RGB.

01:24.810 --> 01:36.010
In simple terms, you must preprocess your input image to be 640 by 640 pixel RGB image, normalized

01:36.010 --> 01:41.050
to Float32 values and submitted as a batch of one.

01:41.170 --> 01:44.810
Now let's talk about the output tensor.

01:45.010 --> 01:49.650
This is the raw, unprocessed output of the YOLO model.

01:49.890 --> 01:57.530
It contains all the potential directions and needs to be parsed to be useful.

01:57.530 --> 01:58.690
Float32.

01:58.730 --> 02:05.180
The data type of the output values one, five and 808,400.

02:05.220 --> 02:05.860
The shape.

02:06.100 --> 02:08.260
This is the most complex part.

02:08.300 --> 02:16.220
It's best understood as batch size data per detection and total possible detections.

02:16.380 --> 02:24.460
Let's start with the batch size corresponds to one image we fed in and one image at a time.

02:24.500 --> 02:27.540
Five data per detection.

02:27.780 --> 02:35.780
This is the crucial part for each of the 8000 808,400 potential directions.

02:36.060 --> 02:39.100
The model outputs five values.

02:39.340 --> 02:46.900
The composition of these five values depends on the number of classes your model was trained to detect.

02:46.940 --> 02:51.860
8400 the total possible detections.

02:52.180 --> 02:58.780
This is the number of the anchor boxes or potential object locations.

02:58.780 --> 03:08.020
The model checks YOLO divides the image into a grid, and the checks for objects at each cell at different

03:08.020 --> 03:08.780
scales.

03:08.780 --> 03:15.900
For a modern YOLO version eight model, this number is typically 80 times 80 plus, 40 times 40 plus

03:15.900 --> 03:19.820
20 times 20 equals to 8400.

03:19.860 --> 03:23.380
And this is the total possible detections.

03:23.380 --> 03:27.020
The most important part is number five.

03:27.060 --> 03:30.620
We have the most common structure of these five values.

03:30.780 --> 03:33.180
We have center x.

03:33.220 --> 03:34.620
We talked about it.

03:34.620 --> 03:38.020
And center y center y.

03:38.380 --> 03:40.580
This is the second value.

03:40.820 --> 03:49.340
The third value is the height and the fourth value is the object Objectness uh score.

03:49.580 --> 03:52.740
And the fifth value is the class score.

03:52.980 --> 03:59.780
We're going to work with those values and data per detection in the Android studio.

03:59.780 --> 04:00.820
So don't worry.

04:01.060 --> 04:09.110
The x and y center x and y, the coordinates of the center of the detected bounding box width and height.

04:09.230 --> 04:13.950
Um, the dimensions of the detected bounding box and the object.

04:13.990 --> 04:20.910
Objective object score the model's confidence that this box contains any object at all.

04:20.990 --> 04:22.150
That's all.

04:22.750 --> 04:27.310
And that's for values with only one value left.

04:27.310 --> 04:32.030
It must be the class probability for that single class.

04:32.030 --> 04:34.870
There is no need for multiple class scores.

04:34.910 --> 04:49.150
Okay, so, uh, in short, your model takes a 640 by 640 image and outputs 8400 candidate boxes for

04:49.150 --> 04:57.550
a single object class, which you must then filter and clean up to get the final useful detections.

04:57.590 --> 04:58.070
Okay.

04:58.310 --> 05:01.910
This is what we're going to work with in Android.