Autograd Engine Deep Dive

Today, I will be talking about an autograd engine I have created. “Autograd” is short for Automatic Gradient. Autograd engines allow you to implement the forward pass of a neural network and then efficiently and automatically run the backward pass of the training cycle.

What is an Autograd Engine?

Autograd engines, like the one in PyTorch, work by keeping track of all operations performed on values, creating an underlying computational graph. After a series of operations results in a final value (like a loss function), you can call a function such as loss.backward(). The engine then uses the graph to run backpropagation, efficiently calculating the gradients of the loss function with respect to the network's parameters.

A Simple Scalar-Based Engine

The autograd engine I created is a very simple one. It runs backpropagation on individual scalars rather than on tensors, as in PyTorch. This makes it an excellent exercise for educational purposes, though it is not as efficient as the optimized, tensor-based operations in the PyTorch API.

Scalar Wrapper: The Value Class

First, we create a wrapper class for scalar values:

import math

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self.label = label
        self._backward = lambda: None

    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

    def __pow__(self, other):
        assert isinstance(other, (int, float)), "Only supporting int/float powers for now"
        out = Value(self.data ** other, (self,), f'**{other}')
        def _backward():
            self.grad += other * (self.data ** (other - 1)) * out.grad
        out._backward = _backward
        return out

    def exp(self):
        x = self.data
        out = Value(math.exp(x), (self,), 'exp')
        def _backward():
            self.grad += out.data * out.grad
        out._backward = _backward
        return out

    def tanh(self):
        x = self.data
        t = (math.exp(2 * x) - 1) / (math.exp(2 * x) + 1)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t ** 2) * out.grad
        out._backward = _backward
        return out

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()
            
    # Additional dunder methods for convenience
    def __radd__(self, other): # other + self
        return self + other

    def __sub__(self, other): # self - other
        return self + (-1 * other)

    def __rmul__(self, other): # other * self
        return self * other

    def __truediv__(self, other): # self / other
        return self * other**-1

How it Works

When we create Value objects and perform operations with them, the Value class automatically keeps track of the underlying computational graph. Each Value object knows which children created it and the operation used.

Note that value.data holds the actual numerical value of the node, while value.grad holds the derivative of the final output (the loss) with respect to that node's data.

The Forward Pass

To execute the forward pass, we simply perform the calculations with our Value objects. This calculates the final result while simultaneously building the graph. Here is an example with a single neuron:

# Inputs x1, x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
# Weights w1, w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
# Bias of the neuron
b = Value(6.8813735870195432, label='b')

# The forward pass: x1*w1 + x2*w2 + b
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'

The Backward Pass

The backward pass involves finding the derivative of the final output (the loss) with respect to each parameter in the network. The derivative tells us how a small change in a parameter affects the loss. For example, if dL/dw = 6 for some weight w, increasing w slightly will increase the loss. Since the goal of training is to minimize loss, we adjust parameters in the opposite direction of their gradient.

To do this efficiently, we use the chain rule from calculus:

dz/dx = dz/dy ⋅ dy/dx

This allows us to calculate the derivative of the loss with respect to any node by using the derivative from the next node in the graph. In other words, we chain the derivative from later nodes back to earlier nodes using the "local derivative" of the current node.

This is handled by the _backward function defined within each operation. For example, in the __mul__ method, the _backward function propagates the gradient from the output out back to the inputs self and other.

To ensure the gradients are computed in the correct order (from the end of the graph to the beginning), we perform a topological sort of the nodes. This creates a list of all nodes in the graph such that each node comes after the children that created it. By reversing this list and calling _backward() on each node, we automatically propagate the gradient from the final output all the way back to the initial inputs.

Building a Neural Network

With the Value engine, we can construct classes for Neuron, Layer, and MLP (Multi-Layer Perceptron) that look similar to those in PyTorch.

class Neuron:
  def __init__(self, nin):
    self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
    self.b = Value(random.uniform(-1,1))
  
  def __call__(self, x):
    # w * x + b
    act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
    out = act.tanh()
    return out

  def parameters(self):
      return self.w + [self.b]

class Layer:
  def __init__(self, nin, nout):
    self.neurons = [Neuron(nin) for _ in range(nout)]
  
  def __call__(self, x):
    outs = [n(x) for n in self.neurons]
    return outs[0] if len(outs) == 1 else outs
      
  def parameters(self):
    return [p for neuron in self.neurons for p in neuron.parameters()]

class MLP:
  def __init__(self, nin, nouts):
    sz = [nin] + nouts
    self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
  
  def __call__(self, x):
    for layer in self.layers:
      x = layer(x)
    return x

  def parameters(self):
    return [p for layer in self.layers for p in layer.parameters()]

Training Example

Let's use our MLP class to solve a sample supervised learning problem.

Dataset

xs = [
  [2.0, 3.0, -1.0],
  [3.0, -1.0, 0.5],
  [0.5, 1.0, 1.0],
  [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # desired targets

Model

We can define an MLP with 3 input nodes, two hidden layers of 4 nodes each, and 1 output node.

n = MLP(3, [4, 4, 1])

Since the MLP is randomly initialized, its initial predictions are also random. Using mean squared error as our loss function, we can now create a training loop.

Training Loop

for k in range(200):
  
    # forward pass
    ypred = [n(x) for x in xs]
    loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
    
    # zero-grad
    for p in n.parameters():
        p.grad = 0.0
    
    # backward pass
    loss.backward()
    
    # update parameters
    for p in n.parameters():
        p.data += -0.05 * p.grad
    
    print(k, loss.data)

The value 0.05 is the learning rate. Over 200 iterations, the loss decreases significantly as the network learns the patterns in the data.

Losses Over Time

0 7.9771228298559995
1 5.9888998188204265
2 5.580189802221308
...
198 0.019069241439568268
199 0.018951243937658565

Final Predictions

The final predictions are now very close to the desired target values.

ypred:
[Value(data=0.9449173273954495),
 Value(data=-0.9918931324804381),
 Value(data=-0.9285516243623214),
 Value(data=0.9273341373045709)]

Further Considerations

Batch gradient descent: We used batch gradient descent (using the entire dataset for each update). Other methods include stochastic gradient descent (one data point) and mini-batch gradient descent (a subset of data).

Loss Functions: We used mean squared error, but other loss functions like max-margin or binary cross-entropy could be used depending on the problem.

Regularization: Techniques like L2 regularization can be added to the loss function to help prevent the model from overfitting to the training data.

Learning Rate: We used a constant learning rate, but implementing learning rate decay can help the model converge more effectively in later training cycles.

Huge thanks to Andrej Karpathy!