Bigram Language Model

Today, we will build a bigram character-level language model. Bigram models involve pairs of tokens—or in our case, pairs of characters. For example, in the word “Emma,” we can create the following bigrams:

E -> m
m -> m
m -> a

In other words, we want to train a language model so that if we feed it the character 'e,' it should predict 'm' with a high probability. If "emma" is in the training data, the model should learn to predict 'm' after 'e' with high confidence.

As the name implies, character-level language models operate on characters rather than words or tokens. We will train our model on a dataset of names, like "Emma." We will explore two different paradigms to build this model: the first calculates probabilities based on character frequencies in the training set, and the second uses a neural network, optimizing its weights with gradient descent.

Approach 1: Frequency Counting

Let's start with the first approach. We'll take a list of names and break each one into bigrams. For example, the name "emma" becomes:

. -> e
e -> m
m -> m
m -> a
a -> .

Notice the special character ., which represents the start and end of a name. When we sample from our model later, it will signal that a name is complete by generating this . character.

Using this training set, we will create a 27x27 tensor. The first dimension (rows) will represent the first character in a bigram, and the second dimension (columns) will represent the second character. We "train" this simple model by iterating through our bigrams and incrementing the count in the corresponding cell. For example, for the bigram E -> m, we would increment the count in the cell at the 'e-th' row and 'm-th' column.

Here’s what the training looks like:

import torch

# Assumes 'words' is a list of names
N = torch.zeros((27, 27), dtype=torch.int32)

chars = sorted(list(set(''.join(words))))
stoi = {c:i+1 for i, c in enumerate(chars)}
stoi['.'] = 0
itos = {i: c for c, i in stoi.items()}

for w in words:
    chs = ['.'] + list(w) + ['.']
    for c1, c2 in zip(chs, chs[1:]):
        ix1 = stoi[c1]
        ix2 = stoi[c2]
        N[ix1, ix2] += 1

Sampling

How do we use this model to generate new names? This process is called sampling. We convert the count matrix N into a probability matrix P by normalizing each row so it sums to 1. Then we can sample from the probability distributions to generate new character sequences.

Here’s the code to sample from this simple model:

P = N.float()
P /= P.sum(1, keepdims=True)

g = torch.Generator().manual_seed(1010)

for i in range(20):
    out = []
    ix = 0
    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])
        if ix == 0:
            break
    print(''.join(out))

Here are some names sampled from the trained model:

sovalbiarerendrit.
ca.
son.
vulue.
kelorarropra.
msli.

While these results aren't perfect, the names are more plausible than sequences of randomly selected characters:

snvflbpxgsxesdtxp.
vr.
sqqnvxxuevptzocarrupgroyslhmgskkobzijlltztbpkgadfhrtdduiuctinsxrqp.
pjza.

Evaluation

Now, let's evaluate our model. We will use the average negative log likelihood (NLL) as our loss function. Let's break this down:

Likelihood: The product of the probabilities assigned to the correct next character for all bigrams in the dataset.
Log Likelihood: Since probabilities are between 0 and 1, their product becomes very small. Taking the logarithm converts this product into a more manageable sum. The log of a number between 0 and 1 is negative.
Negative Log Likelihood: We want a loss function where lower values are better. By taking the negative of the log-likelihood, we get a positive value that we want to minimize.
Average NLL: Finally, we average this loss across all the training examples.

Here is the code for finding the average negative log likelihood:

log_likelihood = 0.0
n = 0
for w in words:
    chs = ['.'] + list(w) + ['.']
    for c1, c2 in zip(chs, chs[1:]):
        ix1 = stoi[c1]
        ix2 = stoi[c2]
        prob = P[ix1, ix2]
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1

print(f"{log_likelihood=}")
nll = -log_likelihood
print(f"{nll=}")
print(f"Average NLL: {nll/n}")

log_likelihood=tensor(-559951.5625)
nll=tensor(559951.5625)
Average NLL: 2.4543561935424805

Model Smoothing

One more improvement: what happens if our model encounters a bigram it has never seen before, like "nq"? The count would be 0, leading to a probability of 0. The log of 0 is negative infinity, which makes our loss infinite. An infinite loss is highly undesirable. We shouldn't assign an impossible probability just because a bigram wasn't in our training data.

To solve this, we can apply model smoothing. A simple technique is to add 1 to every count in our N matrix before calculating probabilities. This ensures that no bigram has a zero probability, preventing an infinite loss.

Approach 2: The Neural Network

Now, let's move to the second approach. We will frame our bigram character-level language model as a neural network and see that we end up with a very similar result.

Data Preparation

Our inputs (xs) and outputs (ys) are pairs of character indices. For "emma," the data is:

# Bigrams for 'emma': .e, em, mm, ma, a.
xs = tensor([ 0,  5, 13, 13,  1])
ys = tensor([ 5, 13, 13,  1,  0])

One-Hot Encoding

Can we simply feed these integer indices into a neural network? This approach is problematic. Neural networks work through mathematical operations, and feeding raw integer indices would imply an ordinal relationship between characters that doesn't exist (e.g., that 'z' is somehow "larger" than 'a').

Therefore, we use one-hot encoding. To feed a character into the network, we create a vector of 27 zeros and place a 1 at the position of that character's index. The output of our network will also be a layer of 27 neurons, where each neuron's activation represents the likelihood of the corresponding character being next.

This is how we one-hot encode the inputs:

import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=27).float()

The Neural Network Model

For this simple network, we'll use a single fully connected linear layer with no non-linearities or biases. This is a simple model, but it's a great starting point.

The forward pass is a matrix multiplication:

# A 27x27 weight matrix for a layer of 27 neurons
W = torch.randn((27, 27))
logits = xenc @ W # Shape: [5, 27] for our 5 examples

The output, logits, contains log-counts. To convert these to probabilities:

Exponentiate the logits to get positive "counts."
Normalize these counts by dividing each row by its sum, so each row represents a probability distribution.

Training with Gradient Descent

The initial, untrained model has a high loss of around 3.77. We can train it by using gradient-based optimization to tweak the weights W and minimize the average negative log likelihood.

Here’s the training loop:

# Create the full dataset
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    xs.append(ix1)
    ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()

# Initialize the 'network' weights
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)

# Gradient descent
for k in range(100):
  
  # Forward pass
  xenc = F.one_hot(xs, num_classes=27).float()
  logits = xenc @ W
  counts = logits.exp()
  probs = counts / counts.sum(1, keepdims=True)
  loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean() # NLL Loss + Regularization
  
  # Backward pass
  W.grad = None
  loss.backward()
  
  # Update weights
  W.data += -50 * W.grad

Notes on the Neural Network

Regularization as Smoothing

The 0.01*(W**2).mean() term in our loss function is a regularization term. It incentivizes the weights to stay close to zero, which makes the model's probability distributions smoother. This is the neural network's equivalent of the additive smoothing we performed in the first approach.

Model Equivalence

When we perform the matrix multiplication xenc @ W, where xenc is a one-hot encoded vector, the operation is equivalent to simply selecting a row from the W matrix. Each row in W, therefore, contains the logits for the next character. This should sound familiar! The weight matrix W is functionally equivalent to our log-count matrix from the first approach. In fact, W.exp() is equivalent to our count matrix N.

And that's how we can use two different methods—explicit frequency counting and gradient-based optimization—to create two equivalent bigram character-level language models.

Om Agrawal

Bigram Character-Level Language Model