You've tried to understand how Neural Networks actually learn, right?

%%{init: {'theme': 'dark'}}%% graph LR x1((x1)) --> h1((h1)) x1 --> h2((h2)) x2((x2)) --> h1 x2 --> h2 h1 --> y((Y)) h2 --> y

You open a textbook... you click on a video...

and within minutes, you're drowning.

Backpropagation!

Stochastic Gradient Descent!

The Chain Rule!

It feels like an impossibly complex black box.

A machine that just... performs magic.

And you're told to just accept that it works.

Here's the secret:

You've been taught this the wrong way around.

The entire engine that powers all of modern AI...

...is built on just a few simple, incredibly intuitive ideas.

Idea #1: The Valley

How do you find the bottom of a valley when you’re stuck in a thick fog?

Idea #2: The Blame Game

And how do you figure out who to "blame" when a team project goes wrong?

Once you truly grasp these two concepts...

...all of that scary math suddenly clicks into place.

It’s not a barrier; it's just the language we use to describe a logic you already understand.

So here’s my promise to you.

Give me 40 minutes, and you will MASTER how Neural Networks learn.

We will go step-by-step, building the entire learning process from scratch.

No skipped steps. No magic.

By the end of this video, you will have a deep, foundational understanding of Neural Networks.

You won't just know the buzzwords.

You will finally get that 'Aha!' moment.

Alright, let's get straight to the code.


INPUT: function f(x)
OUTPUT: argmin_x f(x)

FOR 100 iterations:
  gradient = f'(x)
  x = x - η × gradient
RETURN x

This tiny loop right here?

...is the beating heart of EVERY single AI system you have ever heard of.

ChatGPT, Midjourney, Self-Driving Cars... they ALL run on this exact logic.

So, what is it doing?

Every neural network is trying to get better by minimizing its "error"

—the gap between its guess and the right answer.

This algorithm is the engine that drives that error down to zero.

The intuition is dead simple.

Remember that foggy hill I mentioned?

You're lost, you need to get to the bottom of the valley

but you can only see your own two feet.

You're lost, you need to get to the bottom of the valley

but you can only see your own two feet.

What do you do?

You don't need to see the whole map. You just:

1. Feel the slope right where you're standing.

2. Take a small step in the steepest downhill direction.

3. Repeat.

That's it.

That's Gradient Descent.

You just feel the slope, take a step

and do it again and again until you reach the bottom where the ground is flat.

Our algorithm does the exact same thing, but with math.

Let's make this crystal clear with an example.

\[ f(x) = x^2 \]

The gradient is found with the derivative:

\[ f(x) = x^2 \implies f'(x) = 2x \]

The gradient is found with the derivative:

\[ f'(x) = 2x \]

At x = 3, slope is f'(3) = 2×3 = 6. Positive slope means "downhill" is to the left.

The gradient is found with the derivative:

\[ f'(x) = 2x \]

At x = -2, slope is f'(-2) = 2×(-2) = -4. Negative slope means "downhill" is to the right.

The gradient is found with the derivative:

\[ f'(x) = 2x \]

At x = 0, slope is f'(0) = 0. The ground is flat. You've arrived!

The Update Rule

\[ x_{new} = x_{old} - \eta \times f'(x) \]

This just automates the process. It subtracts the slope, forcing x to always move downhill towards the minimum.

Let's Watch It In Action

We'll start at a random spot, x = 3, and use a small step size (learning rate), η = 0.1.

Let's Watch It In Action

Start: x = 3, Learning Rate: η = 0.1, Update Rule: x - 0.1 × (2x)

Iteration	Current x	f(x)=x²	Gradient f'(x)=2x	New x ← x - 0.1 × (2x)
0	3.000	9.000	6.000	3 - 0.1×6 = 2.400
1	2.400	5.760	4.800	2.4 - 0.1×4.8 = 1.920
2	1.920	3.686	3.840	1.92 - 0.1×3.84 = 1.536
3	1.536	2.359	3.072	1.536 - 0.1×3.072 = 1.229
...	...	...	...	...
10	0.322	0.104	0.644	0.322 - 0.1×0.644 = 0.258

Look at that! We started at x=3 with a huge error of 9. After just a few steps, it’s plummeting.

The algorithm is literally sliding down the curve of the parabola

But... here comes the plot twist.

Gradient descent has tunnel vision.

It only sees the slope directly under its feet.

What if the landscape isn't a simple valley?

\[ f(x) = x^4 - 4x^2 + x + 1 \]

The TRUE global minimum

A local minimum

If you start at `x = 0.5`...

The algorithm slides into the shallow valley and gets stuck.

It found an answer, but not the best answer.

But if you start at `x = -0.5`...

It finds the true, deep valley perfectly.

Same algorithm, different starting points, wildly different results.

This raises a terrifying question.

A neural network has millions of parameters.

Creating an error landscape with billions of traps.

How could this work for GPT-4?

Here's the surprising, almost unbelievable answer:

...for huge neural networks, it almost doesn't matter.

The Surprising Luck of High Dimensions

✓ Most "local minima" are pretty good solutions

✓ Truly bad traps are incredibly rare

It's one of the luckiest coincidences in the history of AI,

and we're still trying to fully understand why.

Empirically, gradient descent just... works.

Now you understand the core engine.

But so far, our valley only has one dimension. We can only move left or right.

A real neural network is like a massive soundboard with a million knobs to tune.

How do you adjust all of them at once to find the perfect sound?

How do you figure out the slope in a million different directions simultaneously?

Alright, so we've mastered finding the bottom of a 1D valley. We can move left and right.

But a real neural network isn't a single slider.

It's a massive soundboard with millions of knobs.

How do you find the steepest downhill path...

...when 'downhill' is in a million different directions?

The answer is surprisingly elegant.

You don't.

Instead, you figure out the slope for each knob individually...

...as if it were the only one you were turning.

You focus on one knob at a time...

listen to its effect...

adjust it... and move to the next.

This is the core intuition behind one of the most important tools in machine learning:

The Partial Derivative.

When our function has multiple variables...

We can't just ask for "the slope"

We need to specify which direction

This uses the partial derivative symbol:

\[ \frac{\partial f}{\partial x_1} \]

"Slope in the x₁ direction only"

And for the other direction:

\[ \frac{\partial f}{\partial x_2} \]

"Slope in the x₂ direction only"

Our 3D Example: The Bowl

Let's upgrade our valley to 3D with the function:

\[ f(x_1, x_2) = x_1^2 + 2x_2^2 \]

This creates a beautiful, oval-shaped bowl. The lowest point is at (0, 0).

So how do we calculate these partial derivatives?

Here is the one magic rule you need to remember.

To find the partial derivative with respect to one variable...

...you treat ALL OTHER variables as if they are just constant numbers.

Let's find \( \frac{\partial f}{\partial x_1} \) for \( f = x_1^2 + 2x_2^2 \)

1. Pretend \(x_2\) is frozen. The \(2x_2^2\) term becomes a constant.

2. Constants disappear when we take derivatives.

3. Only \(x_1^2\) remains.

Answer: \( \frac{\partial f}{\partial x_1} = 2x_1\)

Now let's find \( \frac{\partial f}{\partial x_2} \) for \( f = x_1^2 + 2x_2^2 \)

1. Pretend \(x_1\) is frozen. The \(x_1^2\) term becomes a constant.

2. Constants disappear when we take derivatives.

3. Only \(2x_2^2\) remains.

Answer: \( \frac{\partial f}{\partial x_2} = 4x_2\)

You've just calculated a multi-dimensional gradient.

With this tool, our Gradient Descent algorithm gets a simple upgrade.

Instead of one update rule, we now have one for each variable...

and here's the key: we apply them all at the same time.


INPUT: function f(x1,x2)
FOR 100 iterations:
  grad_x1 = ∂f/∂x1
  grad_x2 = ∂f/∂x2

  x1 = x1 - η × grad_x1
  x2 = x2 - η × grad_x2
RETURN (x1,x2)

Let's Navigate the 3D Bowl

Function: \( f(x_1, x_2) = x_1^2 + 2x_2^2 \)

Start at random point: (x1, x2) = (3, 2)

Learning rate: η = 0.1

Initial Error: \( f(3,2) = 3^2 + 2(2^2) = \mathbf{17} \)

Let's watch the algorithm work.

Let's Watch It In Action

Function: \( f(x_1, x_2) = x_1^2 + 2x_2^2 \), Start: (3, 2), Learning Rate: η = 0.1

Gradients: \( \frac{\partial f}{\partial x_1} = 2x_1 \), \( \frac{\partial f}{\partial x_2} = 4x_2 \)

Iter	x₁	x₂	f(x₁,x₂)	∂f/∂x₁=2x₁	∂f/∂x₂=4x₂	New (x₁,x₂) ← (x₁-0.1×2x₁, x₂-0.1×4x₂)
0	3.000	2.000	17.000	6.000	8.000	(2.40, 1.20) ← (3-0.1×6, 2-0.1×8)
1	2.400	1.200	8.640	4.800	4.800	(1.92, 0.72) ← (2.4-0.1×4.8, 1.2-0.1×4.8)
2	1.920	0.720	4.722	3.840	2.880	(1.54, 0.43) ← (1.92-0.1×3.84, 0.72-0.1×2.88)
3	1.536	0.432	2.734	3.072	1.728	(1.23, 0.26) ← (1.54-0.1×3.07, 0.43-0.1×1.73)
...	...	...	...	...	...	...
10	0.403	0.028	0.164	0.806	0.112	(0.32, 0.017) ← (0.40-0.1×0.81, 0.028-0.1×0.11)

Look at that beautiful convergence!

In the first step, the gradient tells it to move 0.6 in the `x1` direction and 0.8 in the `x2` direction...

...slashing the error in half.

As it gets closer to the bottom, the slopes get smaller, so it takes smaller, more careful steps.

`x1` and `x2` both spiral down towards zero, perfectly finding the minimum of our 3D bowl.

This is it. This is the fundamental technique for training a neural network.

We treat every single weight as its own "knob".

We calculate its partial derivative—its individual contribution to the total error...

...and then we nudge it slightly in the right direction.

Scale this up from 2 knobs to 2 million, and you have modern machine learning.

But this raises a new, much more subtle problem.

In our bowl example, `x1` and `x2` directly affected the final error.

In a deep neural network...

A weight in the first layer doesn't directly touch the final error.

Its influence travels through a long, complex chain.

How do you calculate the "blame" for a single knob when its effect is buried 20 layers deep?

This is the single biggest problem in deep learning...

...and its solution is one of the most elegant ideas in all of mathematics.

The Chain Rule.

The Formula

\[ \frac{dy}{dx} = \frac{dy}{du} \times \frac{du}{dx} \]

This simple formula is the masterstroke.

It's what makes deep learning possible.

And the intuition behind it?

It's literally a blame game.

Imagine your team's final presentation fails. That's the error.

To find out why, you trace the problem backward:

1. The presentation was bad...

To find out why, you trace the problem backward:

1. The presentation was bad...

2. ...because the slides were confusing. (50% blame)

To find out why, you trace the problem backward:

1. The presentation was bad...

2. ...because the slides were confusing. (50% blame)

3. ...because the data analysis was flawed. (80% blame)

To find out why, you trace the problem backward:

1. The presentation was bad...

2. ...because the slides were confusing. (50% blame)

3. ...because the data analysis was flawed. (80% blame)

4. ...because the data collection was sloppy. (90% blame)

To find out how much the initial data collector is responsible for the final failed presentation...

...you just multiply the blame at each step.

90% × 80% × 50% = 36%

The Chain Rule does exactly this.

It multiplies the influence at each link in the chain to find the total impact of a variable far, far away.

Let's see this mathematical blame game in action.

The Problem: A Nested Function

Look at this beast of a function:

\[ f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \]

This looks intimidating. But we can break it down.

The Problem: A Nested Function

\[ f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \]

First, we calculate \( u = 2x_1 + x_2 \)
Then, \( v = u^2 + 3x_2^2 \)
Finally, \( f = v^3 \)

The chain of influence is clear:

%%{init: {'theme': 'dark'}}%% graph LR A[(x1, x2)] --> B[u] B --> C[v] C --> D[f]

Finding \( \frac{\partial f}{\partial x_1} \): Tracing the Blame

\( f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \), where \( u = 2x_1 + x_2 \), \( v = u^2 + 3x_2^2 \), \( f = v^3 \)

Step	Question	Function	Derivative
1	How much does \(f\) blame \(v\)?	\( f = v^3 \)	\( 3v^2 \)
2	How much does \(v\) blame \(u\)?	\( v = u^2 + 3x_2^2 \)	\( 2u \)
3	How much does \(u\) blame \(x_1\)?	\( u = 2x_1 + x_2 \)	\( 2 \)

Total Blame: \( \frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial v} \times \frac{\partial v}{\partial u} \times \frac{\partial u}{\partial x_1} = 3v^2 \times 2u \times 2 = \mathbf{12uv^2} \)

Now for \( \frac{\partial f}{\partial x_2} \). It's trickier!

\( f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \), where \( u = 2x_1 + x_2 \), \( v = u^2 + 3x_2^2 \), \( f = v^3 \)

\(x_2\) influences \(v\) in two ways: indirectly through \(u\), and directly.

%%{init: {'theme': 'dark'}}%% graph LR A(x2) --> B(u) B --> C(v) A -- direct --> C C --> D(f)

The total blame is just the sum of the blame from all paths.

Finding \( \frac{\partial f}{\partial x_2} \): Summing the Blame

\( f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \), where \( u = 2x_1 + x_2 \), \( v = u^2 + 3x_2^2 \), \( f = v^3 \)

\( \frac{\partial f}{\partial x_2} = \frac{\partial f}{\partial v} \times \frac{\partial v}{\partial x_2} \) we know \( \frac{\partial f}{\partial v} = 3v^2 \). We need \( \frac{\partial v}{\partial x_2} \) from both paths.

Path	Calculation	Result
Indirect: \(x_2 \rightarrow u \rightarrow v\)	\( \frac{\partial v}{\partial u} \times \frac{\partial u}{\partial x_2} = 2u \times 1 \)	\( 2u \)
Direct: \(x_2 \rightarrow v\)	\( \frac{\partial}{\partial x_2}(3x_2^2) \)	\( 6x_2 \)
Total \( \frac{\partial v}{\partial x_2} \)	Sum both paths	\( 2u + 6x_2 \)

Final Result: \( \frac{\partial f}{\partial x_2} = 3v^2 \times (2u + 6x_2) \)

And just like that...

the Chain Rule has untangled that complex nested function for us.

Now, look what we can do.


FOR 100 iterations:
  # Calculate current values
  u = 2*x1 + 3*x2
  v = x1 + x2**2
  
  # Calculate the blame for each variable
  grad_x1 = 12 * u * v**2
  grad_x2 = (3 * v**2) * (2*u + 6*x2)

  # Nudge each variable in the right direction
  x1 = x1 - η * grad_x1
  x2 = x2 - η * grad_x2
RETURN (x1,x2)

This is the key.

The Chain Rule gives us a systematic way to find the gradient for any variable...

This is the key.

The Chain Rule gives us a systematic way to find the gradient for any variable...

...no matter how deeply it's buried inside a complex function.

This is the final piece of the puzzle.

1. We know how to go downhill (Gradient Descent).

2. We know how to find the slope for each knob (Partial Derivatives).

3. And now, we can trace blame through a long chain (The Chain Rule).

We have all the mathematical tools we need.

Now, it's time to stop playing with abstract functions.

Let's use these tools to build our very first, functioning AI brain...

...and watch it actually learn, right before your eyes.

So, what even is a 'Neural Network'?

Forget the hype. Forget the sci-fi.

It's just a collection of simple functions, called "neurons," organized in "layers."

Each neuron is just a tiny calculation. Nothing magical.


f(a,b) = a + b²

We stack them in layers, like an assembly line.

This creates a Forward Pass—a one-way flow of information from data to answer.

%%{init: {'theme': 'dark'}}%% graph TD A[Inputs] --> B(Layer 1) B --> C(Layer 2) C --> D[Final Answer]

Let's build one right now.

Our First AI Brain: The Architecture

Our network has 2 inputs, 2 hidden neurons, 1 output neuron, and 5 tunable weights (w1, w2, w3, w4, w5).

Layer 1 (Hidden):


neuron 1: h1 = (x1 + w1*x2)²
neuron 2: h2 = w2*x1*x2

Layer 2 (Output):


y_pred = w3*h1 + w4*h2 + w5

%%{init: {'theme': 'dark'}}%% graph LR x1((x1)) --> h1((h1)) x1 --> h2((h2)) x2((x2)) --> h1 x2 --> h2 h1 --> y((Y)) h2 --> y

The Task: Learn a New Function

Target Function: \( f(x_1,x_2) = 2x_1^2 + 3x_2 \)

We collect three training examples:

Input (x1, x2)	True Output (y_true)
(3, 2)	2×3² + 3×2 = 18 + 6 = 24
(1, 4)	2×1² + 3×4 = 2 + 12 = 14
(2, 1)	2×2² + 3×1 = 8 + 3 = 11

The Setup: An Untrained Brain

Training Examples: (3,2)→24, (1,4)→14, (2,1)→11

Architecture: \( h_1=(x_1+w_1 \times x_2)^2 \), \( h_2=w_2 \times x_1 \times x_2 \), \( y=w_3 \times h_1+w_4 \times h_2+w_5 \)

Initial Random Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Our network is a blank slate. Let's see how badly it does.

%%{init: {'theme': 'dark'}}%% graph LR x1((x1)) --> h1((h1)) x1 --> h2((h2)) x2((x2)) --> h1 x2 --> h2 h1 --> y((Y)) h2 --> y

This is the Forward Pass.

We just feed inputs through and see what comes out.

No learning yet. Just pure calculation.

Forward Pass Results

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Formulas: \( h_1=(x_1+w_1 \times x_2)^2 \), \( h_2=w_2 \times x_1 \times x_2 \), \( y=w_3 \times h_1+w_4 \times h_2+w_5 \)

%%{init: {'theme': 'dark'}}%% graph LR x1((x1)) --> h1((h1)) x1 --> h2((h2)) x2((x2)) --> h1 x2 --> h2 h1 --> y((Y)) h2 --> y

Input (x1, x2)	True Output	h1	h2	Predicted Output (y)	Difference
(3, 2)	24	(3+1×2)² = 5² = 25	2×3×2 = 12	1×25+1×12+0 = 37	37-24 = 13
(1, 4)	14	(1+1×4)² = 5² = 25	2×1×4 = 8	1×25+1×8+0 = 33	33-14 = 19
(2, 1)	11	(2+1×1)² = 3² = 9	2×2×1 = 4	1×9+1×4+0 = 13	13-11 = 2

Ouch. Not great.

For the second example, it was off by a massive 19 points.

But how do we quantify this failure?

We need a single number that tells us, overall, how wrong our network is.

We can't just add them up...

...because a `-19` and a `+19` would cancel out, making it look perfect when it's terrible.

The solution?

We SQUARE the differences.

Why Squaring Works:

1. Makes Every Error Positive

No more canceling out!

                (-13)² = 169
            

2. Punishes Big Mistakes

Huge penalties for large errors!

Error 2 → Penalty 4
Error 19 → Penalty 361

Let's calculate the Total Squared Error.

Error = (-13)² + (-19)² + (-2)²

Error = 169 + 361 + 4

Total Squared Error = 534

There it is: 534

That is our antagonist. Our enemy.

Our goal for the rest of this video...

...is to make this number—534—

...as close to zero as possible.

We've completed the Forward Pass.

Our network has made its first, terrible predictions, and we've measured its failure.

Now for the 'Aha!' moment.

We're going to take all the tools we've learned

—Gradient Descent, Partial Derivatives, and the Chain Rule—

and unleash them on this network.

Remember the "Blame Game" with the Chain Rule?

Tracing responsibility back through a complex chain?

And remember our complex nested function from that section?

\[ f = ((2x_1 + x_2)^2 + 3x_2^2)^3 \]

Here is the single most important insight of this entire video.

A Neural Network IS just a giant, nested function.

That's it. That's the whole secret.

Look at the comparison.

	Math Functions	Neural Networks
Structure	Nested operations `f(g(h(x)))`	Layered operations `Layer2(Layer1(inputs))`
Variables	We control `x1, x2`	We control ???
Target	Minimize `f(x1,x2)`	Minimize `Error(???)`

This raises the most critical question:

What are the "knobs" we are allowed to tune in our neural network? What are the variables?

Is it the inputs, `x1` and `x2`?

NO!

The inputs are the data we're given.

We can't change the fact that we were given `(x1=3, x2=2)`.

That's the problem we have to solve.

The ONLY things we have control over...

... are the weights. `w1, w2, w3, w4, w5`.

This is the breakthrough.

The Real Comparison

Math Functions

Variables we controlled:
x1, x2

Target: Minimize
f(x1, x2)

Neural Networks

Variables we control:
w1, w2, w3, w4, w5

Target: Minimize
Error(w1, w2, w3, w4, w5)

We're not trying to find the best inputs.

We're trying to find the best weights that turn the inputs into the correct outputs.

We already know exactly how to do this!

We use Gradient Descent to find the values of our variables...

...that minimize a function's output.

The process is identical.

It's time to play the Blame Game.

Let's focus on that first terrible prediction.

The Crime Scene

Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Layer 1 (Hidden):


neuron 1: h1 = (x1 + w1*x2)²
             = (3 + 1*2)² = 25
neuron 2: h2 = w2*x1*x2
             = 2*3*2 = 12

Layer 2 (Output):


y_pred = w3*h1 + w4*h2 + w5
       = 1*25 + 1*12 + 0 = 37

%%{init: {'theme': 'dark'}}%% graph LR x1((x1)) --> h1((h1)) x1 --> h2((h2)) x2((x2)) --> h1 x2 --> h2 h1 --> y((Y)) h2 --> y

The Crime Scene

Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Layer 1 (Hidden):


neuron 1: h1 = (x1 + w1*x2)²
             = (3 + 1*2)² = 25
neuron 2: h2 = w2*x1*x2
             = 2*3*2 = 12

Layer 2 (Output):


y_pred = w3*h1 + w4*h2 + w5
       = 1*25 + 1*12 + 0 = 37

Let's start the investigation with our first suspect: `w5`.

Its path to the error is short and direct:

Error ← y_pred ← w5

Investigating `w5` - The Crime Scene Context

Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Forward Pass:


x1 = 3, x2 = 2
w1 = 1, w2 = 2, w3 = 1,
w4 = 1, w5 = 0

h1 = (x1 + w1*x2)² 
   = (3 + 1*2)² = 5² = 25

h2 = w2*x1*x2 = 2*3*2 = 12

y_pred = w3*h1 + w4*h2 + w5 
       = 1*25 + 1*12 + 0 = 37

Error = (y_true - y_pred)² 
      = (24 - 37)² = 169

Path: Error ← y_pred ← w5

\[ \frac{\partial Error}{\partial w_5} = \frac{\partial Error}{\partial y_{pred}} \times \frac{\partial y_{pred}}{\partial w_5} \]

1. \( \frac{\partial Error}{\partial y_{pred}} \): deriv of \( (y_{true} - y_{pred})^2 \) w.r.t. `y_pred` is \( 2(y_{pred} - y_{true}) \) So, \( 2 \times (37 - 24) = \mathbf{26} \)

2. \( \frac{\partial y_{pred}}{\partial w_5} \): deriv of \( w_3 h_1 + w_4 h_2 + w_5 \) w.r.t. `w5` is 1 (because the rest are constants)

Total Blame for `w5` = \( 26 \times 1 = \mathbf{\color{#ff6b6b}{26}} \)

Next Suspect: `w1` - The Crime Scene Context

Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Forward Pass:


x1 = 3, x2 = 2
w1 = 1, w2 = 2, w3 = 1,
w4 = 1, w5 = 0

h1 = (x1 + w1*x2)² 
   = (3 + 1*2)² = 5² = 25

h2 = w2*x1*x2 = 2*3*2 = 12

y_pred = w3*h1 + w4*h2 + w5 
       = 1*25 + 1*12 + 0 = 37

Error = (y_true - y_pred)² 
      = (24 - 37)² = 169

Path: Error ← y_pred ← h1 ← w1

Now for a tougher suspect: `w1`

It's buried much deeper. Its path to the error is longer:

Error ← y_pred ← h1 ← w1

This means more chain rule steps!

Investigating `w1` - The Crime Scene Context

Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Forward Pass:


x1 = 3, x2 = 2
w1 = 1, w2 = 2, w3 = 1,
w4 = 1, w5 = 0

h1 = (x1 + w1*x2)² 
   = (3 + 1*2)² = 5² = 25

h2 = w2*x1*x2 = 2*3*2 = 12

y_pred = w3*h1 + w4*h2 + w5 
       = 1*25 + 1*12 + 0 = 37

Error = (y_true - y_pred)² 
      = (24 - 37)² = 169

Path: Error ← y_pred ← h1 ← w1

\[ \frac{\partial Error}{\partial w_1} = \frac{\partial Error}{\partial y_{pred}} \times \frac{\partial y_{pred}}{\partial h_1} \times \frac{\partial h_1}{\partial w_1} \]

1. \( \frac{\partial Error}{\partial y_{pred}} \): Wait... we already calculated this. It's 26.

2. \( \frac{\partial y_{pred}}{\partial h_1} \): deriv of \( w_3 h_1 + w_4 h_2 + w_5 \) w.r.t. `h1` is just `w3` = 1.

3. \( \frac{\partial h_1}{\partial w_1} \): deriv of \( (x_1 + w_1 x_2)^2 \) is \( 2(x_1+w_1 x_2) \cdot x_2 \)
So, \( 2 \times (3 + 1 \times 2) \times 2 = 2 \times 5 \times 2 = \mathbf{20} \).

Total Blame for `w1` = \( 26 \times 1 \times 20 = \mathbf{\color{#ff6b6b}{520}} \)

Now stop.

Look at what we just did. Look closer.

\[ \frac{\partial Error}{\partial w_5} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_5} = {\color{#ff6b6b}{26}} \times 1 = 26 \]

\[ \frac{\partial Error}{\partial w_1} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial h_1} \times \frac{\partial h_1}{\partial w_1} = {\color{#ff6b6b}{26}} \times 1 \times 20 = 520 \]

Do you see it? The term `∂Error/∂y_pred = 26` is in BOTH calculations!

This is the multi-trillion dollar trick that makes deep learning efficient.

We don't recalculate everything from scratch for every weight.

We calculate the error gradient once...

...then propagate it backwards, reusing calculations.

This is Backpropagation.

It's not magic.

It's just being clever and not re-doing work you've already done.

Systematically Finding the Blame

\[ \frac{\partial Error}{\partial w_5} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_5}, \quad \frac{\partial Error}{\partial w_4} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_4}, \quad \frac{\partial Error}{\partial w_3} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_3} \]

\[ \frac{\partial Error}{\partial w_2} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times {\color{#4CAF50}{\frac{\partial y_{pred}}{\partial h_2}}} \times \frac{\partial h_2}{\partial w_2}, \quad \frac{\partial Error}{\partial w_1} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times {\color{#00bfff}{\frac{\partial y_{pred}}{\partial h_1}}} \times \frac{\partial h_1}{\partial w_1} \]

Forward Pass: Computing the Output


# Inputs: x1, x2
# Weights: w1, w2, w3, w4, w5

# Hidden Layer
h1 = (x1 + w1*x2)²
h2 = w2*x1*x2

# Output Layer  
y_pred = w3*h1 + w4*h2 + w5

# Error
Error = (y_true - y_pred)²

%%{init: {'theme': 'dark'}}%% graph LR input1[ ] --> |"x1,x2"| h1((h1)) input2[ ] --> |"x1,x2"| h2((h2)) w1[w1] --> h1 w2[w2] --> h2 h1 --> y((Y)) h2 --> y w345[w3,w4,w5] --> y y --> |"y_true"| error((Error)) style input1 fill:transparent,stroke:transparent style input2 fill:transparent,stroke:transparent classDef weightNode fill:#4CAF50,stroke:#fff classDef variableNode fill:#333,stroke:#fff class w1,w2,w345 weightNode class h1,h2,y,error variableNode

Backward Pass: Computing Gradients


# Start with error gradient
∂E/∂y_pred = 2(y_pred - y_true)

# Output layer weights
∂E/∂w5 = ∂E/∂y_pred * 1
∂E/∂w4 = ∂E/∂y_pred * h2  
∂E/∂w3 = ∂E/∂y_pred * h1

# Hidden layer gradients
∂E/∂h1 = ∂E/∂y_pred * w3
∂E/∂h2 = ∂E/∂y_pred * w4

# Hidden layer weights  
∂E/∂w1 = ∂E/∂h1 * 2(x1+w1*x2)*x2
∂E/∂w2 = ∂E/∂h2 * x1*x2

%%{init: {'theme': 'dark'}}%% graph RL h1((h1)) --> |"x1,x2"| input1[ ] h2((h2)) --> |"x1,x2"| input2[ ] h1 --> |"∂h1/∂w1"| w1[w1] h2 --> |"∂h2/∂w2"| w2[w2] y((Y)) --> |"∂Y/∂h1"| h1 y --> |"∂Y/∂h2"| h2 y --> |"∂Y/∂w3,w4,w5"| w345[w3,w4,w5] error((Error)) --> |"∂E/∂Y"| y style input1 fill:transparent,stroke:transparent style input2 fill:transparent,stroke:transparent classDef weightNode fill:#4CAF50,stroke:#fff classDef variableNode fill:#333,stroke:#fff class w1,w2,w345 weightNode class h1,h2,y,error variableNode

Computing All Gradients: ∂E/∂w

Inputs: x1=3, x2=2, y_true=24

Weights: w1=1, w2=2, w3=1, w4=1, w5=0

Forward Pass: h1=25, h2=12, y_pred=37

Error: E=(37-24)²=169

Gradient Formulas:


# Start with error gradient
∂E/∂y_pred = 2(y_pred - y_true)

# Output layer weights
∂E/∂w5 = ∂E/∂y_pred * 1
∂E/∂w4 = ∂E/∂y_pred * h2  
∂E/∂w3 = ∂E/∂y_pred * h1

# Hidden layer gradients
∂E/∂h1 = ∂E/∂y_pred * w3
∂E/∂h2 = ∂E/∂y_pred * w4

# Hidden layer weights  
∂E/∂w1 = ∂E/∂h1 * 2(x1+w1*x2)*x2
∂E/∂w2 = ∂E/∂h2 * x1*x2

Actual Calculations:

∂E/∂y_pred = 2(37 - 24) = 26

∂E/∂w5 = 26 × 1 = 26

∂E/∂w4 = 26 × 12 = 312

∂E/∂w3 = 26 × 25 = 650

∂E/∂h1 = 26×1 = 26

∂E/∂h2 = 26×1 = 26

∂E/∂w2 = 26 × (3×2) = 26 × 6 = 156

∂E/∂w1 = 26 × 2×(3+1×2)×2 = 26 × 2×5×2 = 520

Gradient Descent: Using Our Computed Gradients

Learning Rate: η = 0.0001

Update Rule: new_weight = old_weight - η × gradient

Computed Gradients:

∂E/∂w5 = 26

∂E/∂w4 = 312

∂E/∂w3 = 650

∂E/∂w2 = 156

∂E/∂w1 = 520

Weight Updates:

Weight	Old	Calculation	New
w5	0	0 - 0.0001×26	-0.0026
w4	1	1 - 0.0001×312	0.9688
w3	1	1 - 0.0001×650	0.935
w2	2	2 - 0.0001×156	1.9844
w1	1	1 - 0.0001×520	0.948

That's it. The learning has happened.

We played the Blame Game, and we commanded each weight to adjust itself.

But... did it work?

This is the moment of truth.

Forward Pass with Updated Weights

Inputs: x1=3, x2=2, y_true=24

NEW Weights: w1=0.948, w2=1.9844, w3=0.935, w4=0.9688, w5=-0.0026

Formulas: h₁=(x₁+w₁×x₂)², h₂=w₂×x₁×x₂, y=w₃×h₁+w₄×h₂+w₅

Question: Did the weights improve our prediction?

Step-by-Step Calculation:

h₁ = (x₁ + w₁×x₂)²
= (3 + 0.948×2)² = (4.896)² = 23.97

h₂ = w₂×x₁×x₂
= 1.9844×3×2 = 11.91

y = w₃×h₁ + w₄×h₂ + w₅
= 0.935×23.97 + 0.9688×11.91 - 0.0026

y = 22.41 + 11.54 - 0.0026 = 33.95

%%{init: {'theme': 'dark'}}%% graph LR x1((x1)) --> h1((h1)) x1 --> h2((h2)) x2((x2)) --> h1 x2 --> h2 h1 --> y((Y)) h2 --> y

Improvement: 37 → 33.95 ✓

Let's compare.

	Before Learning	After ONE Update	Target
Prediction	37	33.95	24
How far off	13 units	~10 units	0

YES! It worked!

The prediction moved from 37 down to 33.95.

It got closer to the target of 24. The error got smaller.

This is not a guess. This is not magic.

This is intelligence emerging directly from mathematics.

Following the chain of blame backwards

Nudging parameters downhill

The network learns automatically

Repeat this process a thousand times, and it will get even closer.

A million times, and it will be nearly perfect.

But hang on.

Think about what we just did.

We trained our network on one single example.

`(3, 2) -> 24`

That's like studying for a final exam by memorizing the answer to a single practice question.

You might ace that one question, but you haven't learned the concept.

You'll fail the actual test.

To become truly intelligent, our network can't just memorize one data point.

It needs to find the underlying pattern that works for ALL of our examples.

So, the next logical question is:

How do you learn from the entire crowd of data at once?

There are two main strategies for this.

Online Learning

Look at one example.
Calculate the blame.
Update the weights immediately.
Move to the next example.

Batch Learning

Look at ALL examples.
Calculate the blame for each one.
Add up ALL the blame.
Update the weights ONCE, at the end.

The Algorithms Side-by-Side

Let's say we have N training examples: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}

Online Learning Formula:

For each example i = 1 to N:
  // Calculate gradient for ONE example
  gradient_i = ∇Loss(θ, xᵢ, yᵢ)
  // Update weights IMMEDIATELY
  θ = θ - η × gradient_i

Batch Learning Formula:

total_gradient = 0
For each example i = 1 to N:
  // Calculate gradient for this example
  gradient_i = ∇Loss(θ, xᵢ, yᵢ)
  // ADD to the running total
  total_gradient += gradient_i
// Update weights ONCE at the end
θ = θ - η × total_gradient

Online learning is impulsive.

It learns from one example and immediately changes its mind.

Online learning is impulsive.

Batch learning is more deliberate.

It listens to the opinion of every single example before making a collective, democratic decision.

Let's try it.

Let's perform one update using Batch Learning and see what happens.

Batch Learning Setup

We'll start from our original, dumb weights:

w1=1, w2=2, w3=1, w4=1, w5=0

Batch Learning Setup

Start weights: w1=1, w2=2, w3=1, w4=1, w5=0

Step 1: The Forward Pass for ALL Examples

Example	(x1,x2)	y_true	h1	h2	y_pred	Error²
1 (already done)	(3, 2)	24	25	12	37	169
2	(1, 4)	14	25	8	33	361
3	(2, 1)	11	9	4	13	4

Total Error: 169 + 361 + 4 = 534

Batch Learning Setup

Start weights: w1=1, w2=2, w3=1, w4=1, w5=0

Step 2: Calculate Gradients for ALL Examples

We play the blame game for each example, but we don't update the weights yet.

Example	∂E/∂w1	∂E/∂w2	∂E/∂w3	∂E/∂w4	∂E/∂w5
1 (already done)	520	156	650	312	26
2	380	152	950	304	38
3	24	16	36	16	4

Batch Learning Setup

Start weights: w1=1, w2=2, w3=1, w4=1, w5=0

Step 3: Sum the Blame and Update ONCE

Example	∂E/∂w1	∂E/∂w2	∂E/∂w3	∂E/∂w4	∂E/∂w5
1 (already done)	520	156	650	312	26
2	380	152	950	304	38
3	24	16	36	16	4
TOTAL	924	324	1636	632	68

Batch Learning Setup

Start weights: w1=1, w2=2, w3=1, w4=1, w5=0

Now we use these total gradients to update our weights one time. (η = 0.0001)

Weight	Old	Total Gradient	Update	New Value
w1	1	924	1 - 0.0001×924	0.9076
w2	2	324	2 - 0.0001×324	1.9676
w3	1	1636	1 - 0.0001×1636	0.8364
w4	1	632	1 - 0.0001×632	0.9368
w5	0	68	0 - 0.0001×68	-0.0068

This update direction is a compromise.

It’s the average best direction that helps reduce the error across ALL our data, not just one example.

But did it work?

Let's verify.

Verification: Forward Pass with Batch-Updated Weights

OLD Weights: w1=1, w2=2, w3=1, w4=1, w5=0

NEW Weights: w1=0.9076, w2=1.9676, w3=0.8364, w4=0.9368, w5=-0.0068

Architecture: h₁=(x₁+w₁×x₂)², h₂=w₂×x₁×x₂, y=w₃×h₁+w₄×h₂+w₅

Step-by-Step Forward Pass for ALL Examples

Example	Target	h₁ = (x₁+w₁×x₂)²	h₂ = w₂×x₁×x₂	y_pred = w₃×h₁+w₄×h₂+w₅	BEFORE	AFTER
(3, 2)	24	(3+0.9076×2)² = (4.8152)² = 23.19	1.9676×3×2 = 11.81	0.8364×23.19 + 0.9368×11.81 - 0.0068 = 19.40 + 11.06 - 0.01 = 30.45	37 Error²: (24-37)² = 169	30.45 Error²: (24-30.45)² = 41.6
(1, 4)	14	(1+0.9076×4)² = (4.6304)² = 21.44	1.9676×1×4 = 7.87	0.8364×21.44 + 0.9368×7.87 - 0.0068 = 17.93 + 7.37 - 0.01 = 25.29	33 Error²: (14-33)² = 361	25.29 Error²: (14-25.29)² = 127.5
(2, 1)	11	(2+0.9076×1)² = (2.9076)² = 8.45	1.9676×2×1 = 3.94	0.8364×8.45 + 0.9368×3.94 - 0.0068 = 7.07 + 3.69 - 0.01 = 10.75	13 Error²: (11-13)² = 4	10.75 Error²: (11-10.75)² = 0.06

🎉 ALL examples improved with a single batch update! 🎉

ALL examples improved!

One update. Works for everyone.

This is how a network learns to generalize.

So, which one is better?

Pure Online Learning

Fast updates, but the learning is noisy and chaotic.

Full Batch Learning

Stable direction, but incredibly slow and memory-hungry.

So what do we use in the real world?

We compromise.

We use Mini-batch Learning.

This is the sweet spot. It's the industry standard that powers virtually all modern AI.

Mini-batch:

1. Take a small batch (32, 64, or 256 examples)

2. Update weights

3. Repeat with next batch

Mini-batch Learning Algorithm

For each mini-batch of size B (where B < N):
  total_gradient = 0
  For each example j in mini-batch:
    // Calculate gradient for this example
    gradient_j = ∇Loss(θ, xⱼ, yⱼ)
    // ADD to mini-batch total
    total_gradient += gradient_j
  // Update weights using mini-batch average
  θ = θ - η × (total_gradient / B)

It's the best of both worlds.

The gradients are stable enough from the small crowd, and the memory usage is perfectly manageable.

This is how you train a network on a dataset of billions of images.

You feed it one small handful at a time.

So now we have a complete, practical learning loop that can handle massive datasets.

We have all the pieces.

The only question left is... how does this simple process, which we ran on a tiny network with only 5 weights...

...possibly scale up to train a monster like GPT-4 with over a trillion weights?

Is it really the same algorithm?

The answer is the most beautiful part of this entire story.

Our Tiny Network

5 weights

GPT-4

1,760,000,000,000 weights

Identical Process:

✓ Forward Pass

✓ Backpropagation

✓ Gradient Descent

It's the same three-step dance.

Scale changes everything, and yet, it changes nothing.

The stunning truth is this:

You just mastered the core algorithm running inside every AI system on Earth.

ChatGPT, autonomous vehicles, medical diagnosis AI...

...they are all just scaled-up, engineered versions of what we just built together from scratch.

So, how do we get from our simple model to these massive ones?

The core principles don't change. Same engine, bigger scale.

1. More Layers and More Neurons

Our Network	Real Networks
2 layers	10-100+ layers
2 hidden neurons	Millions-billions of neurons
5 weights total	Trillions of weights

This just means the "chain" for the Chain Rule gets much, much longer. The process is identical.

2. More Practical Activation Functions

Function	Formula	Gradient	Why It's Popular
ReLU	`max(0, x)`	1 if x>0, else 0	Simple, fast, solves a huge problem.
Sigmoid	`1/(1+e⁻ˣ)`	`sigmoid(x) * (1-sigmoid(x))`	Perfect for probabilities (0 to 1).
Our x²	`x²`	`2x`	Works, but `2x` gradient can explode.

3. Better Loss Functions for Different Jobs

Task	Loss Function	Formula	Example
Regression	Mean Squared Error	`(y_true - y_pred)²`	Predicting house prices
Classification	Cross-Entropy Loss	`-log(predicted_probability)`	Image recognition: 80% "cat"

The specific formula changes, but its job is always the same: give us a number to kick off the blame game.

And that's it. That's the whole secret.

You have seen the entire process.

From Midjourney to Tesla...

...they all use the exact same loop.

The Universal Pattern

1. Forward Pass: Make a guess.

2. Loss Function: Measure how wrong the guess is.

3. Backpropagation: Calculate the "blame" for every weight.

4. Gradient Descent: Nudge every weight in the right direction.

5. Repeat: Do this millions, or even billions, of times.

You started this hour thinking neural networks were an impenetrable black box.

But now you know the truth.

It's not magic.

It's just a beautiful cascade of simple, intuitive ideas:

finding the bottom of a valley, tuning one knob at a time, and playing a clever game of blame.

You've mastered the fundamentals.

Welcome to the world of AI.

You've tried to understand how Neural Networks actually learn, right?

You open a textbook... you click on a video...

You open a textbook... you click on a video...

and within minutes, you're drowning.

Backpropagation!

Stochastic Gradient Descent!

The Chain Rule!

It feels like an impossibly complex black box.

A machine that just... performs magic.

A machine that just... performs magic.

And you're told to just accept that it works.

Here's the secret:

Here's the secret:

You've been taught this the wrong way around.

The entire engine that powers all of modern AI...

The entire engine that powers all of modern AI...

...is built on just a few simple, incredibly intuitive ideas.

Idea #1: The Valley

How do you find the bottom of a valley when you’re stuck in a thick fog?

Idea #2: The Blame Game

And how do you figure out who to "blame" when a team project goes wrong?

Once you truly grasp these two concepts...

Once you truly grasp these two concepts...

...all of that scary math suddenly clicks into place.

It’s not a barrier; it's just the language we use to describe a logic you already understand.

So here’s my promise to you.

Give me 40 minutes, and you will MASTER how Neural Networks learn.

We will go step-by-step, building the entire learning process from scratch.

No skipped steps. No magic.

By the end of this video, you will have a deep, foundational understanding of Neural Networks.

By the end of this video, you will have a deep, foundational understanding of Neural Networks.

You won't just know the buzzwords.

You will finally get that 'Aha!' moment.

Alright, let's get straight to the code.

This tiny loop right here?

This tiny loop right here?

...is the beating heart of EVERY single AI system you have ever heard of.

ChatGPT, Midjourney, Self-Driving Cars... they ALL run on this exact logic.

So, what is it doing?

Every neural network is trying to get better by minimizing its "error"

Every neural network is trying to get better by minimizing its "error"

—the gap between its guess and the right answer.

This algorithm is the engine that drives that error down to zero.

The intuition is dead simple.

Remember that foggy hill I mentioned?

You're lost, you need to get to the bottom of the valley

but you can only see your own two feet.

You're lost, you need to get to the bottom of the valley

but you can only see your own two feet.

What do you do?

You don't need to see the whole map. You just:

That's it.

That's Gradient Descent.

You just feel the slope, take a step

and do it again and again until you reach the bottom where the ground is flat.

Our algorithm does the exact same thing, but with math.

Let's make this crystal clear with an example.

The gradient is found with the derivative:

The gradient is found with the derivative:

The gradient is found with the derivative:

The gradient is found with the derivative:

The Update Rule

Let's Watch It In Action

Let's Watch It In Action

Look at that! We started at x=3 with a huge error of 9. After just a few steps, it’s plummeting.

The algorithm is literally sliding down the curve of the parabola

But... here comes the plot twist.

Gradient descent has tunnel vision.

It only sees the slope directly under its feet.

What if the landscape isn't a simple valley?

If you start at x = 0.5...

If you start at x = 0.5...

If you start at x = 0.5...

The algorithm slides into the shallow valley and gets stuck.

It found an answer, but not the best answer.

But if you start at x = -0.5...

But if you start at x = -0.5...

But if you start at x = -0.5...

It finds the true, deep valley perfectly.

Same algorithm, different starting points, wildly different results.

If you start at `x = 0.5`...

If you start at `x = 0.5`...

If you start at `x = 0.5`...

But if you start at `x = -0.5`...

But if you start at `x = -0.5`...

But if you start at `x = -0.5`...