Gradient Descent
It’s a technique to solve the optimization problem of Empirical Risk Minimization which just means minimizing the average loss over our dataset.
- : loss for one sample (e.g. how wrong we are on one image)
- : average loss over all training examples
- : model parameters (weights)
To minimize , we move the parameters a little bit in the direction that makes the loss decrease fastest, that direction is given by the negative gradient of the loss with respect to .
- : vector of partial derivatives (how much each weight affects loss)
- : learning rate (step size)
- : iteration number
Gradient Descent Variants
- Batch (Full) Gradient Descent
- Compute the gradient using the whole dataset
- Precise but slow (need to pass through all samples before one update) (one update per epoch)
- Stochastic Gradient Descent
- Compute the gradient using one sample at a time
- Noisy but much faster, updates happen every time (one update per sample)
- Mini-Batch SDG
- Compute the gradient on a small subset (e.g. 32 or 64 samples)
- Faster than full batch, more stable than pure SDG (one update per batch)

How to compute the gradient?
Finite difference requires forward passes where is the number of parameters.
- Forward difference (1 forward pass per parameter)
- Central difference (2 forward passes per parameter)
Automatic Differentiation
Much more efficient. per layer. We compute derivatives exactly and efficiently by applying the chain rule programmatically.
- Forward-mode AD (compute derivatives and go forward)
- Reverse-mode AD (aka backpropagation)

PyTorch model
import torch
import torch.nn as nn
import torch.optim as optim
# Define model
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.relu(self.fc1(x))
return self.fc2(x)
# Instantiate model, loss, optimizer
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Training loop
for X, Y in dataloader:
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, Y)
loss.backward() # <-- backprop (AD)
optimizer.step() # <-- gradient descent update