WEBVTT

00:00.120 --> 00:04.600
Another type of machine learning is the reinforcement learning.

00:04.760 --> 00:06.680
Reinforcement learning.

00:06.720 --> 00:14.320
RL is a branch of machine learning that focuses on how agents can learn to make decisions through trial

00:14.320 --> 00:17.080
and error to maximize cumulative rewards.

00:17.320 --> 00:25.680
RL allows machines to learn by interacting with an environment and receiving feedback based on their

00:25.680 --> 00:26.480
actions.

00:26.640 --> 00:31.840
This feedback comes in the form of rewards or penalties.

00:31.880 --> 00:39.240
Reinforcement learning revolves around the idea that an agent, the learner or decision maker, interacts

00:39.240 --> 00:41.880
with an environment to achieve a goal.

00:42.040 --> 00:44.960
So this is the environment or the agent.

00:45.000 --> 00:46.560
This is the raw data.

00:46.600 --> 00:50.000
The raw data is given to the agent and the environment.

00:50.200 --> 00:54.120
Based on this agent, the prediction is made.

00:54.240 --> 01:00.840
The agent performs actions and receives feedback to optimize its decision making.

01:00.840 --> 01:01.720
Over time.

01:02.000 --> 01:05.960
The agent the decision maker that performs the actions.

01:06.080 --> 01:11.000
Environment the word or system on which the agent operates.

01:11.280 --> 01:20.480
State is the situation or condition the agent is currently in action, the possible moves or decisions

01:20.520 --> 01:28.040
the agent can make, and the reward the feedback or result from the environment based on the agent's

01:28.200 --> 01:29.520
action on.

01:29.680 --> 01:39.360
Under unsupervised learning or reinforcement learning, machines learn by trial and error using reinforcement

01:39.360 --> 01:40.080
learning.

01:40.280 --> 01:49.520
So here guys, adapting its approach to the situation based on previous experiences helps it achieve

01:49.520 --> 01:50.720
the best outcome.

01:50.880 --> 01:56.000
This is the main summary of the reinforcement learning.

01:56.120 --> 02:06.210
As a quick recap for how reinforcement learning works, the agent interacts Iteratively with its environment.

02:06.210 --> 02:15.210
In the feedback, the agent observes the current state of the environment it chooses and performs an

02:15.210 --> 02:17.210
action based on its policy.

02:17.610 --> 02:26.010
The environment responds by transitioning to a new state and providing a reward or penalty.

02:26.170 --> 02:34.370
The agent updates its knowledge, policy, or values or value function based on the reward received

02:34.370 --> 02:36.170
and the new state.

02:36.330 --> 02:45.250
This cycle repeats with the agent balancing exploration, trying new actions, and exploiting using

02:45.250 --> 02:51.210
known good actions to maximize the cumulative reward over time.

02:51.330 --> 03:00.490
This process is mathematically framed as a Markov decision process MDP, where future states depend

03:00.490 --> 03:07.210
only on the current state and action, not on the prior sequence or events.