Today, I will be talking about an autograd engine I have created. “Autograd” is short for Automatic Gradient. Autograd engines allow you to implement the forward pass of a neural network and then efficiently and automatically run the backward pass of the training cycle.
What is an Autograd Engine?
Autograd engines, like the one in PyTorch, work by keeping track of all operations performed on values, creating an underlying computational graph. After a series of operations results in a final value (like a loss function), you can call a function such as loss.backward()
. The engine then uses the graph to run backpropagation, efficiently calculating the gradients of the loss function with respect to the network's parameters.
A Simple Scalar-Based Engine
The autograd engine I created is a very simple one. It runs backpropagation on individual scalars rather than on tensors, as in PyTorch. This makes it an excellent exercise for educational purposes, though it is not as efficient as the optimized, tensor-based operations in the PyTorch API.
Scalar Wrapper: The Value Class
First, we create a wrapper class for scalar values:
import math
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._prev = set(_children)
self._op = _op
self.label = label
self._backward = lambda: None
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def __pow__(self, other):
assert isinstance(other, (int, float)), "Only supporting int/float powers for now"
out = Value(self.data ** other, (self,), f'**{other}')
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
out._backward = _backward
return out
def exp(self):
x = self.data
out = Value(math.exp(x), (self,), 'exp')
def _backward():
self.grad += out.data * out.grad
out._backward = _backward
return out
def tanh(self):
x = self.data
t = (math.exp(2 * x) - 1) / (math.exp(2 * x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t ** 2) * out.grad
out._backward = _backward
return out
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()
# Additional dunder methods for convenience
def __radd__(self, other): # other + self
return self + other
def __sub__(self, other): # self - other
return self + (-1 * other)
def __rmul__(self, other): # other * self
return self * other
def __truediv__(self, other): # self / other
return self * other**-1
How it Works
When we create Value
objects and perform operations with them, the Value
class automatically keeps track of the underlying computational graph. Each Value
object knows which children created it and the operation used.
Note that value.data
holds the actual numerical value of the node, while value.grad
holds the derivative of the final output (the loss) with respect to that node's data.
The Forward Pass
To execute the forward pass, we simply perform the calculations with our Value
objects. This calculates the final result while simultaneously building the graph. Here is an example with a single neuron:
# Inputs x1, x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
# Weights w1, w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
# Bias of the neuron
b = Value(6.8813735870195432, label='b')
# The forward pass: x1*w1 + x2*w2 + b
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
The Backward Pass
The backward pass involves finding the derivative of the final output (the loss) with respect to each parameter in the network. The derivative tells us how a small change in a parameter affects the loss. For example, if dL/dw = 6
for some weight w
, increasing w
slightly will increase the loss. Since the goal of training is to minimize loss, we adjust parameters in the opposite direction of their gradient.
To do this efficiently, we use the chain rule from calculus:
dz/dx = dz/dy ⋅ dy/dx
This allows us to calculate the derivative of the loss with respect to any node by using the derivative from the next node in the graph. In other words, we chain the derivative from later nodes back to earlier nodes using the "local derivative" of the current node.
This is handled by the _backward
function defined within each operation. For example, in the __mul__
method, the _backward
function propagates the gradient from the output out
back to the inputs self
and other
.
To ensure the gradients are computed in the correct order (from the end of the graph to the beginning), we perform a topological sort of the nodes. This creates a list of all nodes in the graph such that each node comes after the children that created it. By reversing this list and calling _backward()
on each node, we automatically propagate the gradient from the final output all the way back to the initial inputs.
Building a Neural Network
With the Value
engine, we can construct classes for Neuron
, Layer
, and MLP
(Multi-Layer Perceptron) that look similar to those in PyTorch.
class Neuron:
def __init__(self, nin):
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(random.uniform(-1,1))
def __call__(self, x):
# w * x + b
act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
out = act.tanh()
return out
def parameters(self):
return self.w + [self.b]
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
outs = [n(x) for n in self.neurons]
return outs[0] if len(outs) == 1 else outs
def parameters(self):
return [p for neuron in self.neurons for p in neuron.parameters()]
class MLP:
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
Training Example
Let's use our MLP
class to solve a sample supervised learning problem.
Dataset
xs = [
[2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # desired targets
Model
We can define an MLP
with 3 input nodes, two hidden layers of 4 nodes each, and 1 output node.
n = MLP(3, [4, 4, 1])
Since the MLP
is randomly initialized, its initial predictions are also random. Using mean squared error as our loss function, we can now create a training loop.
Training Loop
for k in range(200):
# forward pass
ypred = [n(x) for x in xs]
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
# zero-grad
for p in n.parameters():
p.grad = 0.0
# backward pass
loss.backward()
# update parameters
for p in n.parameters():
p.data += -0.05 * p.grad
print(k, loss.data)
The value 0.05
is the learning rate. Over 200 iterations, the loss decreases significantly as the network learns the patterns in the data.
Losses Over Time
0 7.9771228298559995
1 5.9888998188204265
2 5.580189802221308
...
198 0.019069241439568268
199 0.018951243937658565
Final Predictions
The final predictions are now very close to the desired target values.
ypred:
[Value(data=0.9449173273954495),
Value(data=-0.9918931324804381),
Value(data=-0.9285516243623214),
Value(data=0.9273341373045709)]
Further Considerations
Batch gradient descent: We used batch gradient descent (using the entire dataset for each update). Other methods include stochastic gradient descent (one data point) and mini-batch gradient descent (a subset of data).
Loss Functions: We used mean squared error, but other loss functions like max-margin or binary cross-entropy could be used depending on the problem.
Regularization: Techniques like L2 regularization can be added to the loss function to help prevent the model from overfitting to the training data.
Learning Rate: We used a constant learning rate, but implementing learning rate decay can help the model converge more effectively in later training cycles.
Huge thanks to Andrej Karpathy!