WEBVTT

00:00.720 --> 00:05.440
And of course, computers didn't just use bytes for encoding and processing integers.

00:05.440 --> 00:13.320
They would also often store and process human readable letters and numbers and these called characters.

00:18.280 --> 00:22.720
And early character encodings such as H.

00:23.280 --> 00:31.560
CII Ascii had settled on using eight bits per byte.

00:37.360 --> 00:45.920
But this gave only a limited set of 128 possible characters.

00:47.040 --> 00:53.760
Now this allowed for encoding English language letters and digits, as well as few symbol characters

00:53.760 --> 01:00.290
and control characters, but could not represent many of the letters used in other languages.

01:01.970 --> 01:17.690
So the ABC, a, e, b, C, D, IC standard using its eight bit bytes, which you learned in previous

01:17.690 --> 01:18.250
lecture.

01:18.530 --> 01:25.890
Choose a different character set entirely with the code pages for swapping to different languages,

01:25.890 --> 01:30.530
but ultimately these characters it was too cumbersome and inflexible.

01:31.130 --> 01:35.290
So over time it became clear that so we.

01:35.610 --> 01:37.730
This was the first version Ascii.

01:37.930 --> 01:41.210
Sometimes Ascii is still using, but not much.

01:43.050 --> 01:51.370
And after that, over time it became clear that we needed a truly universal character set supporting

01:51.370 --> 01:54.930
all the world's living languages and special symbols.

01:55.250 --> 01:57.370
And this culminated.

01:57.370 --> 02:10.140
Dominated, as I can say that supporting all the world's living languages, and this created a U, T,

02:10.900 --> 02:13.860
F or Unicode project.

02:13.860 --> 02:15.540
First, let's start with the Unicode.

02:15.980 --> 02:20.380
So Unicode universal code.

02:23.980 --> 02:30.980
This is the Unicode project uh in 1987.

02:31.700 --> 02:34.540
Now a few different Unicode encodings exist here.

02:34.540 --> 02:41.620
But the dominant encoding used on the web is you eight.

02:41.660 --> 02:52.780
You have like 100% encountered this when writing a document on word or LibreOffice, so characters within

02:52.780 --> 03:01.470
the Ascii character set are included Verbatim in the UTF eight, which is extended characters can spread

03:01.470 --> 03:03.870
out over multiple consecutive bytes.

03:04.430 --> 03:10.950
Now, since characters are now encoded as bytes, we can represent characters using two hexadecimal

03:10.950 --> 03:11.510
digits.

03:11.550 --> 03:18.550
Remember, one byte equals to one hexadecimal digit.

03:21.110 --> 03:33.270
So as the UTF eight is two bytes, we can fully represent the UTF eight in a two hexadecimal digit.

03:33.630 --> 03:39.630
So, um, we can represent characters using two hexadecimal digits.

03:39.670 --> 03:42.630
And for example, the characters a.

03:45.630 --> 03:49.070
Let's actually use R and let's say m.

03:49.750 --> 03:54.190
So these characters are not normally encoded with the octets here.

03:54.550 --> 04:07.630
So a a is zero x 41, r is zero x 52 and m is 0X4D.

04:08.350 --> 04:09.510
Now this zero x.

04:09.510 --> 04:11.670
This means this is a hexadecimal.

04:15.390 --> 04:24.750
And the each letters in UTF I said has two hexadecimal numbers since they are two bytes.

04:25.630 --> 04:31.430
Now each hexadecimal digit can be encoded with a four bit pattern.

04:31.430 --> 04:40.670
So one byte four bit ranging from 0.0.0.0 and 1111.

04:41.830 --> 04:44.150
So in the A here.

04:46.790 --> 04:52.310
Four and one is what four is 0100.

04:52.350 --> 04:56.800
You have learned that in previous lecture and one is 0001.

04:57.360 --> 05:04.720
So this is a full A in a hex UTF eight.

05:05.440 --> 05:20.920
And so in R what we have in R we have uh the five is what 0101 and the two is zero two.

05:20.960 --> 05:26.160
In hex is 0010.

05:26.600 --> 05:28.880
So this is a full character.

05:29.600 --> 05:35.320
And we also have the 4D4 and D in hexadecimal.

05:35.480 --> 05:47.560
Uh four in hex is 0100 and D in hex is 1100.

05:48.040 --> 05:49.840
Now 1101.

05:50.890 --> 05:54.690
And that is a full UTF eight character as well.

05:54.730 --> 06:02.050
Now, since two hexadecimal values are required to encode an Ascii character, eight bits seem like

06:02.050 --> 06:08.370
the ideal for storing text in most written languages around the world, or a multiple of eight bits

06:08.370 --> 06:11.890
for characters that cannot be represented in eight bits alone.

06:12.090 --> 06:18.210
Now, using this pattern, we can more easily interpret the meaning of a long string of bits.

06:18.810 --> 06:23.850
Now, um, let's use some text here.

06:23.850 --> 06:25.610
So this bit.

06:28.170 --> 06:34.690
0100000101.

06:36.810 --> 06:40.610
010010.

06:41.570 --> 06:42.050
So.

06:42.090 --> 06:42.490
Yeah.

06:43.130 --> 06:45.970
And 1101.

06:46.210 --> 06:51.380
So I'm asking you In UTF eight.

06:53.620 --> 06:55.780
What these bits stands for.

06:56.540 --> 07:01.500
So we will divide it so we know that UTF eight is.

07:02.380 --> 07:03.140
Oops sorry.

07:03.900 --> 07:14.900
So we know that UTF eight is basically what using eight bits which is two bytes or two hex.

07:15.300 --> 07:22.140
So this here means oh this looks familiar right.

07:22.180 --> 07:25.380
So this means here a.

07:32.980 --> 07:33.500
Yeah.

07:33.500 --> 07:34.300
This means here.

07:34.300 --> 07:34.540
Ah.

07:35.780 --> 07:37.340
And this looks familiar to.

07:40.100 --> 07:47.780
M so in bits we have basically written a thank you for watching and I'm waiting you in the next lecture.