I've been fascinated by the emergent capabilities that appear as we scale generative models. It surprised me that a simple objective like next-word prediction can produce something that behaves like a world model, and I wanted to understand exactly how that happens at the implementation level. So I worked through Andrej Karpathy's Zero to Hero series, coding each lecture rather than just watching, and documented my progress in this technical blog.
The blog builds the language model stack from first principles: an autograd engine, a bigram character-level model, an MLP, batch normalization, a WaveNet-inspired model, a character-level transformer, a BPE tokenizer, and finally GPT-2 pretrained on 4 A100 GPUs.
Full reproduction of the 124M parameter GPT-2 model from scratch. Built the transformer class (blocks, MLP, causal multi-headed self-attention), loaded released weights to verify correctness, then reset and trained from scratch. Implemented bfloat16 mixed precision, torch.compile, FlashAttention (manual integration), power-of-two tensor sizing, cosine LR decay with linear warmup, selective weight decay, gradient accumulation for 0.5M-token batches, and DDP across 4 A100 GPUs. Trained on FineWeb-Edu 10B subset; evaluated on HellaSwag. Our final model beats the reference GPT-2 124M eval score.
March 20, 2026Understanding how tokenization works in LLMs – Byte Pair Encoding, GPT-2/TikToken patterns, special tokens, and why tokenization explains many LLM quirks.
February 21, 2026Building a character-level language model from scratch using transformers – from the simplest bigram model through self-attention, multi-head attention, feed forward networks, residual connections, and layer norm.
February 19, 2026A deep dive into building a WaveNet-inspired MLP architecture, exploring hierarchical structures and improvements to the model.
August 28, 2025A deep dive into the inner workings of the backward pass in neural networks, implementing manual backpropagation at the tensor level to match PyTorch's autograd.
August 18, 2025A deep dive into Batch Normalization, a technique to stabilize and accelerate the training of deep neural networks.
July 22, 2025A character-level language model using a multi-layer perceptron to predict the next character in a sequence.
July 14, 2025An exploration of building a bigram character-level language model using both frequency counting and a neural network approach.
July 4, 2025A deep dive into building a simple autograd engine for neural networks, explaining automatic differentiation and backpropagation.
July 3, 2025A quick introduction to the Hugging Face Transformers library and how to get started with state-of-the-art NLP models.
June 8, 2025