machine learning

Calculus for Machine Learning: Derivatives and Gradient Descent

Advanced•28 min read•Updated Jan 26, 2026

Loss function landscape with gradient descent path

Machine learning is optimization. Calculus is the tool that tells you how to move your parameters to make the loss go down.

This article is a practical bridge between calculus and ML. We will focus on the ideas you actually use: limits, derivatives, gradients, the chain rule, and why gradient descent works.

No heavy jargon. Just clear explanations, small examples, and the intuition you need to debug real training code.

Optimization Is the Core of ML

Training means minimizing a loss function. The loss is a surface over all parameters. Calculus tells us which direction reduces that loss.

Even when the surface is high‑dimensional and messy, the local slope still gives useful guidance.

Limits: The Starting Point

A derivative is defined using a limit. It measures how a function changes when the input changes by a tiny amount.

Code

f'(x) = lim_{h→0} (f(x+h) - f(x)) / h

In ML, this tiny change is the small step you take in parameter space.

Derivatives: The Slope Story

A derivative is slope. If the slope is positive, moving right increases the loss. If it is negative, moving right decreases the loss.

Gradient descent simply moves in the opposite direction of the slope.

Partial Derivatives and the Gradient

Models have many parameters. You need a slope for each one. That is a partial derivative.

Code

∇L = [∂L/∂w1, ∂L/∂w2, ..., ∂L/∂wn]

The gradient vector points uphill. We move downhill by subtracting it.

Why Gradient Descent Works

Locally, any smooth function looks linear. This is the key idea behind gradient descent.

If you take a small step opposite the gradient, the loss will usually go down.

Code

w = w - lr * ∇L(w)

Chain Rule and Backprop

Neural networks are functions of functions. The chain rule is what lets us compute derivatives through that stack.

Backprop is just the chain rule applied efficiently to every parameter.

Code

If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)

Jacobians and Hessians (Only What You Need)

The Jacobian generalizes gradients when outputs are vectors. The Hessian captures curvature.

Most ML code does not compute Hessians directly, but it helps to know that curvature affects how fast you can learn.

Learning Rate: The Most Important Hyperparameter

If the learning rate is too large, you overshoot. Too small, and training crawls.

Schedules, warmup, and adaptive optimizers are all ways to manage this trade‑off.

Regularization as a Calculus Idea

Regularization adds a penalty term to the loss. That changes the gradient and pulls parameters toward smaller values.

L2 adds λ||w||^2, L1 adds λ|w|. The derivatives of those penalties are simple and easy to implement.

Practical Debugging Tips

Overfit a tiny dataset to test your pipeline

Plot loss curves to spot learning‑rate issues

Watch for NaNs and exploding gradients

Try it

Practice what you learned

Practice Question

What does the gradient vector point toward?

Interactive Practice

Order the steps of gradient descent.

Drag and drop practice coming soon! For now, here are the items to match:

Compute the gradient
Choose a learning rate
Update parameters
Repeat

Practice Question

Which statement about learning rate is correct?

Key Takeaways

Calculus is the engine behind machine learning. Limits define derivatives, derivatives form gradients, and gradients drive optimization.

Once you connect those ideas to training code, the math stops feeling abstract and starts feeling like a practical tool.

Understanding Backpropagation: The Math Behind Neural Networks

28 min read

References

Share this article

X LinkedIn Facebook

Keep learning

Neural network showing forward and backward propagation

machine learning

28 min read

Understanding Backpropagation: The Math Behind Neural Networks

Backpropagation answers a practical question: if the prediction is wrong, which weights caused the mistake, and how much should each one change?

This guide builds the idea from the ground up. We start with one neuron, then a two-layer network, and finally the vectorized version used in real training. We keep the math honest, but we explain it in plain English.

If you can follow this article, you will be able to read gradient code, debug shape errors, and understand why modern frameworks compute gradients the way they do.

Read article→

Optimization Is the Core of ML

Limits: The Starting Point

Derivatives: The Slope Story

Partial Derivatives and the Gradient

Why Gradient Descent Works

Chain Rule and Backprop

Jacobians and Hessians (Only What You Need)

Learning Rate: The Most Important Hyperparameter

Regularization as a Calculus Idea

Practical Debugging Tips

Practice what you learned

Practice Question

Interactive Practice

Practice Question

Key Takeaways

References

Related posts

Understanding Backpropagation: The Math Behind Neural Networks