You've tried to understand how Neural Networks actually learn, right?
%%{init: {'theme': 'dark'}}%%
graph LR
x1((x1)) --> h1((h1))
x1 --> h2((h2))
x2((x2)) --> h1
x2 --> h2
h1 --> y((Y))
h2 --> y
You open a textbook... you click on a video...
You open a textbook... you click on a video...
and within minutes, you're drowning.
Stochastic Gradient Descent!
It feels like an impossibly complex black box.
A machine that just... performs magic.
A machine that just... performs magic.
And you're told to just accept that it works.
Here's the secret:
You've been taught this the wrong way around.
The entire engine that powers all of modern AI...
The entire engine that powers all of modern AI...
...is built on just a few simple, incredibly intuitive ideas.
Idea #1: The Valley
How do you find the bottom of a valley when you’re stuck in a thick fog?
Idea #2: The Blame Game
And how do you figure out who to "blame" when a team project goes wrong?
Once you truly grasp these two concepts...
Once you truly grasp these two concepts...
...all of that scary math suddenly clicks into place.
It’s not a barrier; it's just the language we use to describe a logic you already understand.
So here’s my promise to you.
Give me 40 minutes, and you will MASTER how Neural Networks learn.
We will go step-by-step, building the entire learning process from scratch.
No skipped steps. No magic.
By the end of this video, you will have a deep, foundational understanding of Neural Networks.
By the end of this video, you will have a deep, foundational understanding of Neural Networks.
You won't just know the buzzwords.
You will finally get that 'Aha!' moment.
Alright, let's get straight to the code.
INPUT: function f(x)
OUTPUT: argmin_x f(x)
FOR 100 iterations:
gradient = f'(x)
x = x - η × gradient
RETURN x
This tiny loop right here?
This tiny loop right here?
...is the beating heart of EVERY single AI system you have ever heard of.
ChatGPT, Midjourney, Self-Driving Cars... they ALL run on this exact logic.
Every neural network is trying to get better by minimizing its "error"
Every neural network is trying to get better by minimizing its "error"
—the gap between its guess and the right answer.
This algorithm is the engine that drives that error down to zero.
The intuition is dead simple.
Remember that foggy hill I mentioned?
You're lost, you need to get to the bottom of the valley
but you can only see your own two feet.
You're lost, you need to get to the bottom of the valley
but you can only see your own two feet.
What do you do?
You don't need to see the whole map. You just:
1. Feel the slope right where you're standing.
2. Take a small step in the steepest downhill direction.
3. Repeat.
That's it.
That's Gradient Descent.
You just feel the slope, take a step
and do it again and again until you reach the bottom where the ground is flat.
Our algorithm does the exact same thing, but with math.
Let's make this crystal clear with an example.
\[ f(x) = x^2 \]
The gradient is found with the derivative:
\[ f(x) = x^2 \implies f'(x) = 2x \]
The gradient is found with the derivative:
\[ f'(x) = 2x \]
At x = 3, slope is f'(3) = 2×3 = 6. Positive slope means "downhill" is to the left.
The gradient is found with the derivative:
\[ f'(x) = 2x \]
At x = -2, slope is f'(-2) = 2×(-2) = -4. Negative slope means "downhill" is to the right.
The gradient is found with the derivative:
\[ f'(x) = 2x \]
At x = 0, slope is f'(0) = 0. The ground is flat. You've arrived!
The Update Rule
\[ x_{new} = x_{old} - \eta \times f'(x) \]
This just automates the process. It subtracts the slope, forcing x to always move downhill towards the minimum.
Let's Watch It In Action
We'll start at a random spot, x = 3, and use a small step size (learning rate), η = 0.1.
Let's Watch It In Action
Start: x = 3, Learning Rate: η = 0.1, Update Rule: x - 0.1 × (2x)
| Iteration |
Current x |
f(x)=x² |
Gradient f'(x)=2x |
New x ← x - 0.1 × (2x) |
| 0 | 3.000 | 9.000 | 6.000 | 3 - 0.1×6 = 2.400 |
| 1 | 2.400 | 5.760 | 4.800 | 2.4 - 0.1×4.8 = 1.920 |
| 2 | 1.920 | 3.686 | 3.840 | 1.92 - 0.1×3.84 = 1.536 |
| 3 | 1.536 | 2.359 | 3.072 | 1.536 - 0.1×3.072 = 1.229 |
| ... | ... | ... | ... | ... |
| 10 | 0.322 | 0.104 | 0.644 | 0.322 - 0.1×0.644 = 0.258 |
Look at that! We started at x=3 with a huge error of 9. After just a few steps, it’s plummeting.
The algorithm is literally sliding down the curve of the parabola
But... here comes the plot twist.
Gradient descent has tunnel vision.
It only sees the slope directly under its feet.
What if the landscape isn't a simple valley?
\[ f(x) = x^4 - 4x^2 + x + 1 \]
\[ f(x) = x^4 - 4x^2 + x + 1 \]
The TRUE global minimum
A local minimum
If you start at x = 0.5...
If you start at x = 0.5...
If you start at x = 0.5...
The algorithm slides into the shallow valley and gets stuck.
It found an answer, but not the best answer.
But if you start at x = -0.5...
But if you start at x = -0.5...
But if you start at x = -0.5...
It finds the true, deep valley perfectly.
Same algorithm, different starting points, wildly different results.
This raises a terrifying question.
A neural network has millions of parameters.
Creating an error landscape with billions of traps.
How could this work for GPT-4?
Here's the surprising, almost unbelievable answer:
...for huge neural networks, it almost doesn't matter.
The Surprising Luck of High Dimensions
✓ Most "local minima" are pretty good solutions
✓ Truly bad traps are incredibly rare
It's one of the luckiest coincidences in the history of AI,
and we're still trying to fully understand why.
Empirically, gradient descent just... works.
Now you understand the core engine.
But so far, our valley only has one dimension. We can only move left or right.
A real neural network is like a massive soundboard with a million knobs to tune.
How do you adjust all of them at once to find the perfect sound?
How do you figure out the slope in a million different directions simultaneously?
Alright, so we've mastered finding the bottom of a 1D valley. We can move left and right.
But a real neural network isn't a single slider.
It's a massive soundboard with millions of knobs.
How do you find the steepest downhill path...
How do you find the steepest downhill path...
...when 'downhill' is in a million different directions?
The answer is surprisingly elegant.
You don't.
Instead, you figure out the slope for each knob individually...
Instead, you figure out the slope for each knob individually...
...as if it were the only one you were turning.
You focus on one knob at a time...
listen to its effect...
adjust it... and move to the next.
This is the core intuition behind one of the most important tools in machine learning:
The Partial Derivative.
When our function has multiple variables...
We can't just ask for "the slope"
We need to specify which direction
This uses the partial derivative symbol:
\[ \frac{\partial f}{\partial x_1} \]
"Slope in the x₁ direction only"
And for the other direction:
\[ \frac{\partial f}{\partial x_2} \]
"Slope in the x₂ direction only"
Our 3D Example: The Bowl
Let's upgrade our valley to 3D with the function:
\[ f(x_1, x_2) = x_1^2 + 2x_2^2 \]
This creates a beautiful, oval-shaped bowl. The lowest point is at (0, 0).
So how do we calculate these partial derivatives?
Here is the one magic rule you need to remember.
To find the partial derivative with respect to one variable...
...you treat ALL OTHER variables as if they are just constant numbers.
Let's find \( \frac{\partial f}{\partial x_1} \) for \( f = x_1^2 + 2x_2^2 \)
1. Pretend \(x_2\) is frozen. The \(2x_2^2\) term becomes a constant.
2. Constants disappear when we take derivatives.
3. Only \(x_1^2\) remains.
Answer: \( \frac{\partial f}{\partial x_1} = 2x_1\)
Now let's find \( \frac{\partial f}{\partial x_2} \) for \( f = x_1^2 + 2x_2^2 \)
1. Pretend \(x_1\) is frozen. The \(x_1^2\) term becomes a constant.
2. Constants disappear when we take derivatives.
3. Only \(2x_2^2\) remains.
Answer: \( \frac{\partial f}{\partial x_2} = 4x_2\)
You've just calculated a multi-dimensional gradient.
With this tool, our Gradient Descent algorithm gets a simple upgrade.
Instead of one update rule, we now have one for each variable...
and here's the key: we apply them all at the same time.
INPUT: function f(x1,x2)
FOR 100 iterations:
grad_x1 = ∂f/∂x1
grad_x2 = ∂f/∂x2
x1 = x1 - η × grad_x1
x2 = x2 - η × grad_x2
RETURN (x1,x2)
Let's Navigate the 3D Bowl
Function: \( f(x_1, x_2) = x_1^2 + 2x_2^2 \)
Start at random point: (x1, x2) = (3, 2)
Learning rate: η = 0.1
Initial Error: \( f(3,2) = 3^2 + 2(2^2) = \mathbf{17} \)
Let's watch the algorithm work.
Let's Watch It In Action
Function: \( f(x_1, x_2) = x_1^2 + 2x_2^2 \), Start: (3, 2), Learning Rate: η = 0.1
Gradients: \( \frac{\partial f}{\partial x_1} = 2x_1 \), \( \frac{\partial f}{\partial x_2} = 4x_2 \)
| Iter |
x₁ |
x₂ |
f(x₁,x₂) |
∂f/∂x₁=2x₁ |
∂f/∂x₂=4x₂ |
New (x₁,x₂) ← (x₁-0.1×2x₁, x₂-0.1×4x₂) |
| 0 |
3.000 |
2.000 |
17.000 |
6.000 |
8.000 |
(2.40, 1.20) ← (3-0.1×6, 2-0.1×8) |
| 1 |
2.400 |
1.200 |
8.640 |
4.800 |
4.800 |
(1.92, 0.72) ← (2.4-0.1×4.8, 1.2-0.1×4.8) |
| 2 |
1.920 |
0.720 |
4.722 |
3.840 |
2.880 |
(1.54, 0.43) ← (1.92-0.1×3.84, 0.72-0.1×2.88) |
| 3 |
1.536 |
0.432 |
2.734 |
3.072 |
1.728 |
(1.23, 0.26) ← (1.54-0.1×3.07, 0.43-0.1×1.73) |
| ... |
... |
... |
... |
... |
... |
... |
| 10 |
0.403 |
0.028 |
0.164 |
0.806 |
0.112 |
(0.32, 0.017) ← (0.40-0.1×0.81, 0.028-0.1×0.11) |
Look at that beautiful convergence!
In the first step, the gradient tells it to move 0.6 in the `x1` direction and 0.8 in the `x2` direction...
...slashing the error in half.
As it gets closer to the bottom, the slopes get smaller, so it takes smaller, more careful steps.
`x1` and `x2` both spiral down towards zero, perfectly finding the minimum of our 3D bowl.
This is it. This is the fundamental technique for training a neural network.
We treat every single weight as its own "knob".
We calculate its partial derivative—its individual contribution to the total error...
...and then we nudge it slightly in the right direction.
Scale this up from 2 knobs to 2 million, and you have modern machine learning.
But this raises a new, much more subtle problem.
In our bowl example, `x1` and `x2` directly affected the final error.
In a deep neural network...
A weight in the first layer doesn't directly touch the final error.
Its influence travels through a long, complex chain.
How do you calculate the "blame" for a single knob when its effect is buried 20 layers deep?
This is the single biggest problem in deep learning...
...and its solution is one of the most elegant ideas in all of mathematics.
The Chain Rule.
The Formula
\[ \frac{dy}{dx} = \frac{dy}{du} \times \frac{du}{dx} \]
This simple formula is the masterstroke.
It's what makes deep learning possible.
And the intuition behind it?
It's literally a blame game.
Imagine your team's final presentation fails. That's the error.
To find out why, you trace the problem backward:
1. The presentation was bad...
To find out why, you trace the problem backward:
1. The presentation was bad...
2. ...because the slides were confusing. (50% blame)
To find out why, you trace the problem backward:
1. The presentation was bad...
2. ...because the slides were confusing. (50% blame)
3. ...because the data analysis was flawed. (80% blame)
To find out why, you trace the problem backward:
1. The presentation was bad...
2. ...because the slides were confusing. (50% blame)
3. ...because the data analysis was flawed. (80% blame)
4. ...because the data collection was sloppy. (90% blame)
To find out how much the initial data collector is responsible for the final failed presentation...
...you just multiply the blame at each step.
90% × 80% × 50% = 36%
The Chain Rule does exactly this.
It multiplies the influence at each link in the chain to find the total impact of a variable far, far away.
Let's see this mathematical blame game in action.
The Problem: A Nested Function
Look at this beast of a function:
\[ f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \]
This looks intimidating. But we can break it down.
The Problem: A Nested Function
\[ f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \]
- First, we calculate \( u = 2x_1 + x_2 \)
- Then, \( v = u^2 + 3x_2^2 \)
- Finally, \( f = v^3 \)
The chain of influence is clear:
%%{init: {'theme': 'dark'}}%%
graph LR
A[(x1, x2)] --> B[u]
B --> C[v]
C --> D[f]
Finding \( \frac{\partial f}{\partial x_1} \): Tracing the Blame
\( f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \), where \( u = 2x_1 + x_2 \), \( v = u^2 + 3x_2^2 \), \( f = v^3 \)
| Step |
Question |
Function |
Derivative |
| 1 |
How much does \(f\) blame \(v\)? |
\( f = v^3 \) |
\( 3v^2 \) |
| 2 |
How much does \(v\) blame \(u\)? |
\( v = u^2 + 3x_2^2 \) |
\( 2u \) |
| 3 |
How much does \(u\) blame \(x_1\)? |
\( u = 2x_1 + x_2 \) |
\( 2 \) |
Total Blame: \( \frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial v} \times \frac{\partial v}{\partial u} \times \frac{\partial u}{\partial x_1} = 3v^2 \times 2u \times 2 = \mathbf{12uv^2} \)
Now for \( \frac{\partial f}{\partial x_2} \). It's trickier!
\( f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \), where \( u = 2x_1 + x_2 \), \( v = u^2 + 3x_2^2 \), \( f = v^3 \)
\(x_2\) influences \(v\) in two ways: indirectly through \(u\), and directly.
%%{init: {'theme': 'dark'}}%%
graph LR
A(x2) --> B(u)
B --> C(v)
A -- direct --> C
C --> D(f)
The total blame is just the sum of the blame from all paths.
Finding \( \frac{\partial f}{\partial x_2} \): Summing the Blame
\( f(x_1, x_2) = ((2x_1 + x_2)^2 + 3x_2^2)^3 \), where \( u = 2x_1 + x_2 \), \( v = u^2 + 3x_2^2 \), \( f = v^3 \)
\( \frac{\partial f}{\partial x_2} = \frac{\partial f}{\partial v} \times \frac{\partial v}{\partial x_2} \) we know \( \frac{\partial f}{\partial v} = 3v^2 \). We need \( \frac{\partial v}{\partial x_2} \) from both paths.
| Path |
Calculation |
Result |
| Indirect: \(x_2 \rightarrow u \rightarrow v\) |
\( \frac{\partial v}{\partial u} \times \frac{\partial u}{\partial x_2} = 2u \times 1 \) |
\( 2u \) |
| Direct: \(x_2 \rightarrow v\) |
\( \frac{\partial}{\partial x_2}(3x_2^2) \) |
\( 6x_2 \) |
| Total \( \frac{\partial v}{\partial x_2} \) |
Sum both paths |
\( 2u + 6x_2 \) |
Final Result: \( \frac{\partial f}{\partial x_2} = 3v^2 \times (2u + 6x_2) \)
And just like that...
the Chain Rule has untangled that complex nested function for us.
Now, look what we can do.
FOR 100 iterations:
# Calculate current values
u = 2*x1 + 3*x2
v = x1 + x2**2
# Calculate the blame for each variable
grad_x1 = 12 * u * v**2
grad_x2 = (3 * v**2) * (2*u + 6*x2)
# Nudge each variable in the right direction
x1 = x1 - η * grad_x1
x2 = x2 - η * grad_x2
RETURN (x1,x2)
This is the key.
The Chain Rule gives us a systematic way to find the gradient for any variable...
This is the key.
The Chain Rule gives us a systematic way to find the gradient for any variable...
...no matter how deeply it's buried inside a complex function.
This is the final piece of the puzzle.
1. We know how to go downhill (Gradient Descent).
2. We know how to find the slope for each knob (Partial Derivatives).
3. And now, we can trace blame through a long chain (The Chain Rule).
We have all the mathematical tools we need.
Now, it's time to stop playing with abstract functions.
Let's use these tools to build our very first, functioning AI brain...
...and watch it actually learn, right before your eyes.
So, what even *is* a 'Neural Network'?
Forget the hype. Forget the sci-fi.
It's just a collection of simple functions, called "neurons," organized in "layers."
Each neuron is just a tiny calculation. Nothing magical.
f(a,b) = a + b²
We stack them in layers, like an assembly line.
This creates a Forward Pass—a one-way flow of information from data to answer.
%%{init: {'theme': 'dark'}}%%
graph TD
A[Inputs] --> B(Layer 1)
B --> C(Layer 2)
C --> D[Final Answer]
Let's build one right now.
Our First AI Brain: The Architecture
Our network has 2 inputs, 2 hidden neurons, 1 output neuron, and 5 tunable weights (w1, w2, w3, w4, w5).
Layer 1 (Hidden):
neuron 1: h1 = (x1 + w1*x2)²
neuron 2: h2 = w2*x1*x2
Layer 2 (Output):
y_pred = w3*h1 + w4*h2 + w5
%%{init: {'theme': 'dark'}}%%
graph LR
x1((x1)) --> h1((h1))
x1 --> h2((h2))
x2((x2)) --> h1
x2 --> h2
h1 --> y((Y))
h2 --> y
The Task: Learn a New Function
Target Function: \( f(x_1,x_2) = 2x_1^2 + 3x_2 \)
We collect three training examples:
| Input (x1, x2) |
True Output (y_true) |
| (3, 2) | 2×3² + 3×2 = 18 + 6 = 24 |
| (1, 4) | 2×1² + 3×4 = 2 + 12 = 14 |
| (2, 1) | 2×2² + 3×1 = 8 + 3 = 11 |
The Setup: An Untrained Brain
Training Examples: (3,2)→24, (1,4)→14, (2,1)→11
Architecture: \( h_1=(x_1+w_1 \times x_2)^2 \), \( h_2=w_2 \times x_1 \times x_2 \), \( y=w_3 \times h_1+w_4 \times h_2+w_5 \)
Initial Random Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Our network is a blank slate. Let's see how badly it does.
%%{init: {'theme': 'dark'}}%%
graph LR
x1((x1)) --> h1((h1))
x1 --> h2((h2))
x2((x2)) --> h1
x2 --> h2
h1 --> y((Y))
h2 --> y
This is the Forward Pass.
We just feed inputs through and see what comes out.
No learning yet. Just pure calculation.
Forward Pass Results
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Formulas: \( h_1=(x_1+w_1 \times x_2)^2 \), \( h_2=w_2 \times x_1 \times x_2 \), \( y=w_3 \times h_1+w_4 \times h_2+w_5 \)
%%{init: {'theme': 'dark'}}%%
graph LR
x1((x1)) --> h1((h1))
x1 --> h2((h2))
x2((x2)) --> h1
x2 --> h2
h1 --> y((Y))
h2 --> y
| Input (x1, x2) |
True Output |
h1 |
h2 |
Predicted Output (y) |
Difference |
| (3, 2) |
24 |
(3+1×2)² = 5² = 25 |
2×3×2 = 12 |
1×25+1×12+0 = 37 |
37-24 = 13 |
| (1, 4) |
14 |
(1+1×4)² = 5² = 25 |
2×1×4 = 8 |
1×25+1×8+0 = 33 |
33-14 = 19 |
| (2, 1) |
11 |
(2+1×1)² = 3² = 9 |
2×2×1 = 4 |
1×9+1×4+0 = 13 |
13-11 = 2 |
Ouch. Not great.
For the second example, it was off by a massive 19 points.
But how do we quantify this failure?
We need a single number that tells us, overall, how wrong our network is.
We can't just add them up...
...because a -19 and a +19 would cancel out, making it look perfect when it's terrible.
The solution?
We SQUARE the differences.
Why Squaring Works:
1. Makes Every Error Positive
No more canceling out!
(-13)² = 169
2. Punishes Big Mistakes
Huge penalties for large errors!
Error 2 → Penalty 4
Error 19 → Penalty 361
Let's calculate the Total Squared Error.
Error = (-13)² + (-19)² + (-2)²
Error = 169 + 361 + 4
Total Squared Error = 534
There it is: 534
That is our antagonist. Our enemy.
Our goal for the rest of this video...
...is to make this number—534—
...as close to zero as possible.
We've completed the Forward Pass.
Our network has made its first, terrible predictions, and we've measured its failure.
Now for the 'Aha!' moment.
We're going to take all the tools we've learned
—Gradient Descent, Partial Derivatives, and the Chain Rule—
and unleash them on this network.
Remember the "Blame Game" with the Chain Rule?
Tracing responsibility back through a complex chain?
And remember our complex nested function from that section?
\[ f = ((2x_1 + x_2)^2 + 3x_2^2)^3 \]
Here is the single most important insight of this entire video.
A Neural Network IS just a giant, nested function.
That's it. That's the whole secret.
Look at the comparison.
|
Math Functions |
Neural Networks |
| Structure |
Nested operations
f(g(h(x)))
|
Layered operations
Layer2(Layer1(inputs))
|
| Variables |
We control
x1, x2
|
We control
???
|
| Target |
Minimize
f(x1,x2)
|
Minimize
Error(???)
|
This raises the most critical question:
What are the "knobs" we are allowed to tune in our neural network? What are the variables?
Is it the inputs, x1 and x2?
NO!
The inputs are the data we're given.
We can't change the fact that we were given (x1=3, x2=2).
That's the problem we have to solve.
The ONLY things we have control over...
... are the weights. `w1, w2, w3, w4, w5`.
This is the breakthrough.
The Real Comparison
Math Functions
Variables we controlled:
x1, x2
Target: Minimize
f(x1, x2)
Neural Networks
Variables we control:
w1, w2, w3, w4, w5
Target: Minimize
Error(w1, w2, w3, w4, w5)
We're not trying to find the best inputs.
We're trying to find the best weights that turn the inputs into the correct outputs.
We already know exactly how to do this!
We use Gradient Descent to find the values of our variables...
...that minimize a function's output.
The process is identical.
It's time to play the Blame Game.
Let's focus on that first terrible prediction.
The Crime Scene
Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Layer 1 (Hidden):
neuron 1: h1 = (x1 + w1*x2)²
= (3 + 1*2)² = 25
neuron 2: h2 = w2*x1*x2
= 2*3*2 = 12
Layer 2 (Output):
y_pred = w3*h1 + w4*h2 + w5
= 1*25 + 1*12 + 0 = 37
%%{init: {'theme': 'dark'}}%%
graph LR
x1((x1)) --> h1((h1))
x1 --> h2((h2))
x2((x2)) --> h1
x2 --> h2
h1 --> y((Y))
h2 --> y
The Crime Scene
Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Layer 1 (Hidden):
neuron 1: h1 = (x1 + w1*x2)²
= (3 + 1*2)² = 25
neuron 2: h2 = w2*x1*x2
= 2*3*2 = 12
Layer 2 (Output):
y_pred = w3*h1 + w4*h2 + w5
= 1*25 + 1*12 + 0 = 37
Let's start the investigation with our first suspect: w5.
Its path to the error is short and direct:
Error ← y_pred ← w5
Investigating `w5` - The Crime Scene Context
Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Forward Pass:
x1 = 3, x2 = 2
w1 = 1, w2 = 2, w3 = 1,
w4 = 1, w5 = 0
h1 = (x1 + w1*x2)²
= (3 + 1*2)² = 5² = 25
h2 = w2*x1*x2 = 2*3*2 = 12
y_pred = w3*h1 + w4*h2 + w5
= 1*25 + 1*12 + 0 = 37
Error = (y_true - y_pred)²
= (24 - 37)² = 169
Path: Error ← y_pred ← w5
\[ \frac{\partial Error}{\partial w_5} = \frac{\partial Error}{\partial y_{pred}} \times \frac{\partial y_{pred}}{\partial w_5} \]
1. \( \frac{\partial Error}{\partial y_{pred}} \): deriv of \( (y_{true} - y_{pred})^2 \) w.r.t. `y_pred` is \( 2(y_{pred} - y_{true}) \) So, \( 2 \times (37 - 24) = \mathbf{26} \)
2. \( \frac{\partial y_{pred}}{\partial w_5} \): deriv of \( w_3 h_1 + w_4 h_2 + w_5 \) w.r.t. `w5` is 1
(because the rest are constants)
Total Blame for `w5` = \( 26 \times 1 = \mathbf{\color{#ff6b6b}{26}} \)
Next Suspect: `w1` - The Crime Scene Context
Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Forward Pass:
x1 = 3, x2 = 2
w1 = 1, w2 = 2, w3 = 1,
w4 = 1, w5 = 0
h1 = (x1 + w1*x2)²
= (3 + 1*2)² = 5² = 25
h2 = w2*x1*x2 = 2*3*2 = 12
y_pred = w3*h1 + w4*h2 + w5
= 1*25 + 1*12 + 0 = 37
Error = (y_true - y_pred)²
= (24 - 37)² = 169
Path: Error ← y_pred ← h1 ← w1
Now for a tougher suspect: w1
It's buried much deeper. Its path to the error is longer:
Error ← y_pred ← h1 ← w1
This means more chain rule steps!
Investigating `w1` - The Crime Scene Context
Input: `(3, 2)`, Target: `24`, Prediction: `37`, Error = (24 - 37)² = 169
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Forward Pass:
x1 = 3, x2 = 2
w1 = 1, w2 = 2, w3 = 1,
w4 = 1, w5 = 0
h1 = (x1 + w1*x2)²
= (3 + 1*2)² = 5² = 25
h2 = w2*x1*x2 = 2*3*2 = 12
y_pred = w3*h1 + w4*h2 + w5
= 1*25 + 1*12 + 0 = 37
Error = (y_true - y_pred)²
= (24 - 37)² = 169
Path: Error ← y_pred ← h1 ← w1
\[ \frac{\partial Error}{\partial w_1} = \frac{\partial Error}{\partial y_{pred}} \times \frac{\partial y_{pred}}{\partial h_1} \times \frac{\partial h_1}{\partial w_1} \]
1. \( \frac{\partial Error}{\partial y_{pred}} \): Wait... we already calculated this. It's 26.
2. \( \frac{\partial y_{pred}}{\partial h_1} \): deriv of \( w_3 h_1 + w_4 h_2 + w_5 \) w.r.t. `h1` is just `w3` = 1.
3. \( \frac{\partial h_1}{\partial w_1} \): deriv of \( (x_1 + w_1 x_2)^2 \) is \( 2(x_1+w_1 x_2) \cdot x_2 \)
So, \( 2 \times (3 + 1 \times 2) \times 2 = 2 \times 5 \times 2 = \mathbf{20} \).
Total Blame for `w1` = \( 26 \times 1 \times 20 = \mathbf{\color{#ff6b6b}{520}} \)
Now stop.
Look at what we just did. Look closer.
\[ \frac{\partial Error}{\partial w_5} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_5} = {\color{#ff6b6b}{26}} \times 1 = 26 \]
\[ \frac{\partial Error}{\partial w_1} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial h_1} \times \frac{\partial h_1}{\partial w_1} = {\color{#ff6b6b}{26}} \times 1 \times 20 = 520 \]
Do you see it? The term ∂Error/∂y_pred = 26 is in BOTH calculations!
This is the multi-trillion dollar trick that makes deep learning efficient.
We don't recalculate everything from scratch for every weight.
We calculate the error gradient once...
...then propagate it backwards, reusing calculations.
This is Backpropagation.
It's not magic.
It's just being clever and not re-doing work you've already done.
Systematically Finding the Blame
\[ \frac{\partial Error}{\partial w_5} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_5}, \quad \frac{\partial Error}{\partial w_4} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_4}, \quad \frac{\partial Error}{\partial w_3} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times \frac{\partial y_{pred}}{\partial w_3} \]
\[ \frac{\partial Error}{\partial w_2} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times {\color{#4CAF50}{\frac{\partial y_{pred}}{\partial h_2}}} \times \frac{\partial h_2}{\partial w_2}, \quad \frac{\partial Error}{\partial w_1} = {\color{#ff6b6b}{\frac{\partial Error}{\partial y_{pred}}}} \times {\color{#00bfff}{\frac{\partial y_{pred}}{\partial h_1}}} \times \frac{\partial h_1}{\partial w_1} \]
Forward Pass: Computing the Output
# Inputs: x1, x2
# Weights: w1, w2, w3, w4, w5
# Hidden Layer
h1 = (x1 + w1*x2)²
h2 = w2*x1*x2
# Output Layer
y_pred = w3*h1 + w4*h2 + w5
# Error
Error = (y_true - y_pred)²
%%{init: {'theme': 'dark'}}%%
graph LR
input1[ ] --> |"x1,x2"| h1((h1))
input2[ ] --> |"x1,x2"| h2((h2))
w1[w1] --> h1
w2[w2] --> h2
h1 --> y((Y))
h2 --> y
w345[w3,w4,w5] --> y
y --> |"y_true"| error((Error))
style input1 fill:transparent,stroke:transparent
style input2 fill:transparent,stroke:transparent
classDef weightNode fill:#4CAF50,stroke:#fff
classDef variableNode fill:#333,stroke:#fff
class w1,w2,w345 weightNode
class h1,h2,y,error variableNode
Backward Pass: Computing Gradients
# Start with error gradient
∂E/∂y_pred = 2(y_pred - y_true)
# Output layer weights
∂E/∂w5 = ∂E/∂y_pred * 1
∂E/∂w4 = ∂E/∂y_pred * h2
∂E/∂w3 = ∂E/∂y_pred * h1
# Hidden layer gradients
∂E/∂h1 = ∂E/∂y_pred * w3
∂E/∂h2 = ∂E/∂y_pred * w4
# Hidden layer weights
∂E/∂w1 = ∂E/∂h1 * 2(x1+w1*x2)*x2
∂E/∂w2 = ∂E/∂h2 * x1*x2
%%{init: {'theme': 'dark'}}%%
graph RL
h1((h1)) --> |"x1,x2"| input1[ ]
h2((h2)) --> |"x1,x2"| input2[ ]
h1 --> |"∂h1/∂w1"| w1[w1]
h2 --> |"∂h2/∂w2"| w2[w2]
y((Y)) --> |"∂Y/∂h1"| h1
y --> |"∂Y/∂h2"| h2
y --> |"∂Y/∂w3,w4,w5"| w345[w3,w4,w5]
error((Error)) --> |"∂E/∂Y"| y
style input1 fill:transparent,stroke:transparent
style input2 fill:transparent,stroke:transparent
classDef weightNode fill:#4CAF50,stroke:#fff
classDef variableNode fill:#333,stroke:#fff
class w1,w2,w345 weightNode
class h1,h2,y,error variableNode
Computing All Gradients: ∂E/∂w
Inputs: x1=3, x2=2, y_true=24
Weights: w1=1, w2=2, w3=1, w4=1, w5=0
Forward Pass: h1=25, h2=12, y_pred=37
Error: E=(37-24)²=169
Gradient Formulas:
# Start with error gradient
∂E/∂y_pred = 2(y_pred - y_true)
# Output layer weights
∂E/∂w5 = ∂E/∂y_pred * 1
∂E/∂w4 = ∂E/∂y_pred * h2
∂E/∂w3 = ∂E/∂y_pred * h1
# Hidden layer gradients
∂E/∂h1 = ∂E/∂y_pred * w3
∂E/∂h2 = ∂E/∂y_pred * w4
# Hidden layer weights
∂E/∂w1 = ∂E/∂h1 * 2(x1+w1*x2)*x2
∂E/∂w2 = ∂E/∂h2 * x1*x2
Actual Calculations:
∂E/∂y_pred = 2(37 - 24) = 26
∂E/∂w5 = 26 × 1 = 26
∂E/∂w4 = 26 × 12 = 312
∂E/∂w3 = 26 × 25 = 650
∂E/∂h1 = 26×1 = 26
∂E/∂h2 = 26×1 = 26
∂E/∂w2 = 26 × (3×2) = 26 × 6 = 156
∂E/∂w1 = 26 × 2×(3+1×2)×2 = 26 × 2×5×2 = 520
Gradient Descent: Using Our Computed Gradients
Learning Rate: η = 0.0001
Update Rule: new_weight = old_weight - η × gradient
Computed Gradients:
∂E/∂w5 = 26
∂E/∂w4 = 312
∂E/∂w3 = 650
∂E/∂w2 = 156
∂E/∂w1 = 520
Weight Updates:
| Weight |
Old |
Calculation |
New |
| w5 |
0 |
0 - 0.0001×26 |
-0.0026 |
| w4 |
1 |
1 - 0.0001×312 |
0.9688 |
| w3 |
1 |
1 - 0.0001×650 |
0.935 |
| w2 |
2 |
2 - 0.0001×156 |
1.9844 |
| w1 |
1 |
1 - 0.0001×520 |
0.948 |
That's it. The learning has happened.
We played the Blame Game, and we commanded each weight to adjust itself.
But... did it work?
This is the moment of truth.
Forward Pass with Updated Weights
Inputs: x1=3, x2=2, y_true=24
NEW Weights: w1=0.948, w2=1.9844, w3=0.935, w4=0.9688, w5=-0.0026
Formulas: h₁=(x₁+w₁×x₂)², h₂=w₂×x₁×x₂, y=w₃×h₁+w₄×h₂+w₅
Question: Did the weights improve our prediction?
Step-by-Step Calculation:
h₁ = (x₁ + w₁×x₂)²
= (3 + 0.948×2)² = (4.896)² = 23.97
h₂ = w₂×x₁×x₂
= 1.9844×3×2 = 11.91
y = w₃×h₁ + w₄×h₂ + w₅
= 0.935×23.97 + 0.9688×11.91 - 0.0026
y = 22.41 + 11.54 - 0.0026 = 33.95
%%{init: {'theme': 'dark'}}%%
graph LR
x1((x1)) --> h1((h1))
x1 --> h2((h2))
x2((x2)) --> h1
x2 --> h2
h1 --> y((Y))
h2 --> y
Improvement: 37 → 33.95 ✓
Let's compare.
| Before Learning | After ONE Update | Target |
| Prediction |
37 |
33.95 |
24 |
| How far off |
13 units |
~10 units |
0 |
The prediction moved from 37 down to 33.95.
It got closer to the target of 24. The error got smaller.
This is not a guess. This is not magic.
This is intelligence emerging directly from mathematics.
Following the chain of blame backwards
Nudging parameters downhill
The network learns automatically
Repeat this process a thousand times, and it will get even closer.
A million times, and it will be nearly perfect.
But hang on.
Think about what we just did.
We trained our network on one single example.
`(3, 2) -> 24`
That's like studying for a final exam by memorizing the answer to a single practice question.
You might ace that one question, but you haven't learned the concept.
You'll fail the actual test.
To become truly intelligent, our network can't just memorize one data point.
It needs to find the underlying pattern that works for ALL of our examples.
So, the next logical question is:
How do you learn from the entire crowd of data at once?
There are two main strategies for this.
Online Learning
- Look at one example.
- Calculate the blame.
- Update the weights immediately.
- Move to the next example.
Batch Learning
- Look at ALL examples.
- Calculate the blame for each one.
- Add up ALL the blame.
- Update the weights ONCE, at the end.
The Algorithms Side-by-Side
Let's say we have N training examples: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Online Learning Formula:
For each example i = 1 to N:
// Calculate gradient for ONE example
gradient_i = ∇Loss(θ, xᵢ, yᵢ)
// Update weights IMMEDIATELY
θ = θ - η × gradient_i
Batch Learning Formula:
total_gradient = 0
For each example i = 1 to N:
// Calculate gradient for this example
gradient_i = ∇Loss(θ, xᵢ, yᵢ)
// ADD to the running total
total_gradient += gradient_i
// Update weights ONCE at the end
θ = θ - η × total_gradient
Online learning is impulsive.
It learns from one example and immediately changes its mind.
Online learning is impulsive.
Batch learning is more deliberate.
It listens to the opinion of every single example before making a collective, democratic decision.
Let's try it.
Let's perform one update using Batch Learning and see what happens.
Batch Learning Setup
We'll start from our original, dumb weights:
w1=1, w2=2, w3=1, w4=1, w5=0
Batch Learning Setup
Start weights: w1=1, w2=2, w3=1, w4=1, w5=0
Step 1: The Forward Pass for ALL Examples
| Example | (x1,x2) | y_true | h1 | h2 | y_pred | Error² |
| 1 (already done) | (3, 2) | 24 | 25 | 12 | 37 | 169 |
| 2 | (1, 4) | 14 | 25 | 8 | 33 | 361 |
| 3 | (2, 1) | 11 | 9 | 4 | 13 | 4 |
Total Error: 169 + 361 + 4 = 534
Batch Learning Setup
Start weights: w1=1, w2=2, w3=1, w4=1, w5=0
Step 2: Calculate Gradients for ALL Examples
We play the blame game for each example, but we don't update the weights yet.
| Example | ∂E/∂w1 | ∂E/∂w2 | ∂E/∂w3 | ∂E/∂w4 | ∂E/∂w5 |
| 1 (already done) | 520 | 156 | 650 | 312 | 26 |
| 2 | 380 | 152 | 950 | 304 | 38 |
| 3 | 24 | 16 | 36 | 16 | 4 |
Batch Learning Setup
Start weights: w1=1, w2=2, w3=1, w4=1, w5=0
Step 3: Sum the Blame and Update ONCE
| Example | ∂E/∂w1 | ∂E/∂w2 | ∂E/∂w3 | ∂E/∂w4 | ∂E/∂w5 |
| 1 (already done) | 520 | 156 | 650 | 312 | 26 |
| 2 | 380 | 152 | 950 | 304 | 38 |
| 3 | 24 | 16 | 36 | 16 | 4 |
| TOTAL |
924 |
324 |
1636 |
632 |
68 |
Batch Learning Setup
Start weights: w1=1, w2=2, w3=1, w4=1, w5=0
Now we use these total gradients to update our weights one time. (η = 0.0001)
| Weight | Old | Total Gradient | Update | New Value |
| w1 | 1 | 924 | 1 - 0.0001×924 | 0.9076 |
| w2 | 2 | 324 | 2 - 0.0001×324 | 1.9676 |
| w3 | 1 | 1636 | 1 - 0.0001×1636 | 0.8364 |
| w4 | 1 | 632 | 1 - 0.0001×632 | 0.9368 |
| w5 | 0 | 68 | 0 - 0.0001×68 | -0.0068 |
This update direction is a compromise.
It’s the average best direction that helps reduce the error across ALL our data, not just one example.
But did it work?
Let's verify.
Verification: Forward Pass with Batch-Updated Weights
OLD Weights: w1=1, w2=2, w3=1, w4=1, w5=0
NEW Weights: w1=0.9076, w2=1.9676, w3=0.8364, w4=0.9368, w5=-0.0068
Architecture: h₁=(x₁+w₁×x₂)², h₂=w₂×x₁×x₂, y=w₃×h₁+w₄×h₂+w₅
Step-by-Step Forward Pass for ALL Examples
| Example | Target | h₁ = (x₁+w₁×x₂)² | h₂ = w₂×x₁×x₂ | y_pred = w₃×h₁+w₄×h₂+w₅ | BEFORE | AFTER |
| (3, 2) |
24 |
(3+0.9076×2)²
= (4.8152)²
= 23.19
|
1.9676×3×2
= 11.81
|
0.8364×23.19 + 0.9368×11.81 - 0.0068
= 19.40 + 11.06 - 0.01
= 30.45
|
37 Error²: (24-37)² = 169 |
30.45 Error²: (24-30.45)² = 41.6 |
| (1, 4) |
14 |
(1+0.9076×4)²
= (4.6304)²
= 21.44
|
1.9676×1×4
= 7.87
|
0.8364×21.44 + 0.9368×7.87 - 0.0068
= 17.93 + 7.37 - 0.01
= 25.29
|
33 Error²: (14-33)² = 361 |
25.29 Error²: (14-25.29)² = 127.5 |
| (2, 1) |
11 |
(2+0.9076×1)²
= (2.9076)²
= 8.45
|
1.9676×2×1
= 3.94
|
0.8364×8.45 + 0.9368×3.94 - 0.0068
= 7.07 + 3.69 - 0.01
= 10.75
|
13 Error²: (11-13)² = 4 |
10.75 Error²: (11-10.75)² = 0.06 |
🎉 ALL examples improved with a single batch update! 🎉
ALL examples improved!
One update. Works for everyone.
This is how a network learns to generalize.
So, which one is better?
Pure Online Learning
Fast updates, but the learning is noisy and chaotic.
Full Batch Learning
Stable direction, but incredibly slow and memory-hungry.
So what do we use in the real world?
We use Mini-batch Learning.
This is the sweet spot. It's the industry standard that powers virtually all modern AI.
Mini-batch:
1. Take a small batch (32, 64, or 256 examples)
2. Update weights
3. Repeat with next batch
Mini-batch Learning Algorithm
For each mini-batch of size B (where B < N):
total_gradient = 0
For each example j in mini-batch:
// Calculate gradient for this example
gradient_j = ∇Loss(θ, xⱼ, yⱼ)
// ADD to mini-batch total
total_gradient += gradient_j
// Update weights using mini-batch average
θ = θ - η × (total_gradient / B)
It's the best of both worlds.
The gradients are stable enough from the small crowd, and the memory usage is perfectly manageable.
This is how you train a network on a dataset of billions of images.
This is how you train a network on a dataset of billions of images.
You feed it one small handful at a time.
So now we have a complete, practical learning loop that can handle massive datasets.
We have all the pieces.
The only question left is... how does this simple process, which we ran on a tiny network with only 5 weights...
...possibly scale up to train a monster like GPT-4 with over a trillion weights?
Is it really the same algorithm?
The answer is the most beautiful part of this entire story.
Our Tiny Network
5 weights
GPT-4
1,760,000,000,000 weights
Identical Process:
✓ Forward Pass
✓ Backpropagation
✓ Gradient Descent
It's the same three-step dance.
Scale changes everything, and yet, it changes nothing.
The stunning truth is this:
You just mastered the core algorithm running inside every AI system on Earth.
ChatGPT, autonomous vehicles, medical diagnosis AI...
...they are all just scaled-up, engineered versions of what we just built together from scratch.
So, how do we get from our simple model to these massive ones?
The core principles don't change. Same engine, bigger scale.
1. More Layers and More Neurons
| Our Network | Real Networks |
| 2 layers | 10-100+ layers |
| 2 hidden neurons | Millions-billions of neurons |
| 5 weights total | Trillions of weights |
This just means the "chain" for the Chain Rule gets much, much longer. The process is identical.
2. More Practical Activation Functions
| Function | Formula | Gradient | Why It's Popular |
| ReLU | `max(0, x)` | 1 if x>0, else 0 | Simple, fast, solves a huge problem. |
| Sigmoid | `1/(1+e⁻ˣ)` | `sigmoid(x) * (1-sigmoid(x))` | Perfect for probabilities (0 to 1). |
| Our x² | `x²` | `2x` | Works, but `2x` gradient can explode. |
3. Better Loss Functions for Different Jobs
| Task | Loss Function | Formula | Example |
| Regression | Mean Squared Error | `(y_true - y_pred)²` | Predicting house prices |
| Classification | Cross-Entropy Loss | `-log(predicted_probability)` | Image recognition: 80% "cat" |
The specific formula changes, but its job is always the same: give us a number to kick off the blame game.
And that's it. That's the whole secret.
You have seen the entire process.
From Midjourney to Tesla...
...they all use the exact same loop.
The Universal Pattern
1. Forward Pass: Make a guess.
2. Loss Function: Measure how wrong the guess is.
3. Backpropagation: Calculate the "blame" for every weight.
4. Gradient Descent: Nudge every weight in the right direction.
5. Repeat: Do this millions, or even billions, of times.
You started this hour thinking neural networks were an impenetrable black box.
But now you know the truth.
It's not magic.
It's just a beautiful cascade of simple, intuitive ideas:
finding the bottom of a valley, tuning one knob at a time, and playing a clever game of blame.
You've mastered the fundamentals.
Welcome to the world of AI.