Astle Dsa

This is the second iteration of the llm lab notebook. Here’s where I save most of the current/ongoing discourse on large language models, from resources to blog posts to various other cool repositories and projects centered around LLMs/AI. Unlike the last post, I would pivot away from writing this note as a guide for beginners (at least explicitly).

Large Language Models

At this point learning about how LLMs work from scratch feels like learning how Linux works: beyond an indivisual’s capabilities but interesting nonetheless. Among all the people/CS grads reading about how LLMs work, only a handful would ever go on to actually train an LLM (not a model, but an actual Large Language Model). Despite this, we are but knowledge seekers, and we seek the ultimate aphrodisiac. Here,

We always start off with a brief history of deep learning, followed by the bitter lesson. A blog on Modern Neural Networks, an excellent read on the why of AI. Another great blog, which purely shows the scale, effort and money required to actually train an LLM by walking us through DeepSeek’s progress: Modern LLMs with DeepSeek. LLMs are simulators, gives a fresh perspective on what and how LLM do what they do. Here are a few others: 1, 2, 3, 4, 5, 6 and 7. There are countless other blogs and papers on the reasoning capabilities of LLMs, but as far as I had read, these were pretty good.

Apart from the niche blog posts above, here’s some timeless content provided by Karpathy sensei:

llm.c Repo
NanoGPT Repo
Zero to Hero course
micrograd (There’s also tinygrad, though not by karpathy)

This content is extremely good for a programmer wanting to know more about LLMs (though it can get a little advacned at times).

Technicals

Here, following Karpathy’s repositories, we dive a little bit more into the technical aspects of large language models. The technical aspect is divided into two: the mathematical ground/frame work and the programming side of it all. Funnily enough, even the programming aspect is subdivided into: training (pre and post(including RL)), inference and scale (distributed anything). The above blogpost about deepseek covers, brielfy, all these aspects in an excellent way (it even covers how they collect and organize data). Let’s start with:

Mathematical Framework

Ever since the popularity of LLMs, transformers have received a lot of attention (pun not intended, this has almost become too annoying). Most of the mathematical background can be known through reading the history, but more interesting is studying the emergent behaviour that happens while pre-training. Here are a few resources on the mathemtical aspect:

The diversity in these resources points towards the lack of a unified front (or base ?) for the mathematical/theoretical aspect of transformers and by extend large language models. While these resources are excellent, they wouldn’t be of much pragmatic use, unless for pure researchers

Programming

Vast amount of money and effort have gone into the programming aspect of AI, since at the end of the day, it is more sub-field of CS then math. In the above DeepSeek blog, the sheer engineering effort and genuis is at full display, were one might as well go on to say that we need more GPU engineers than deep learning researchers to advance this field. Here, apart from Karpathy’s repos we have

Pre-Training

This was the most researched stage for training any language model, but not anymore (hence the pre). Essentially requireing trillions of tokens in order to reach convergence, this is easily the most important stage in order to achieve grokking. Here are a few resources, though old, that capture the essence of training an LLM:

Post-Training

This stage includes the fine-tuning and (recently) the Rienforcement learning approach in order to make the base model into a finished “product” to be utilised by the public. Followed this came the new “reasoning” class of models, which utilised Test-Time scaling in order to acheive higher levels of reasoning abilities, as seen in openAI’s o1-o3, Gemini’s Flash 2.0 and DeepSeek’s R1 models.

Scale

The entire paradigm of current AI/LLM hinges on one thing: scale. The unprecedented scale at which initial language models were trained led to discovery and eventually refinement of the astonishing reasoning abilities of current frontier models. While we learn various techniques, formulae and math in order to understand how these models work/train, when we take a peek at how they are trained in real life, there appears to be a massive gap in understanding. This is where engineers come into play, ones who deal with distributed systems, memory bandwidth and usage, along with data flow and distributed training, where thousands of GPUs need to be utilised efficiently in order to train a model, which may take anywhere from days to weeks. Here are a few good resources that I have come across:

Languages and Tools

Among the research that came up during the AI boom, a bunch of highly optimized libraries and technology stack came up which focused on making the training and inference process streamlined for large language models.

For general usage: TorchLib (bindings in Python, C++, Go, Haskell, Rust), Onnx.js (inference on web), SGLang, MLTon, Jax
For distributed training: Mesh-tensorFlow, torch.distributed, Apach MXNet, DeepSeed, Gpipe, Alpa, openR

Rienforcement Learning

While the above resources and blogs do have RL mentioned and utilised, it barely scracthes the surface. RL was a massive field before the era of token generating autoregressive models. One of the best resources would be to look up openAI’s Spinning up RL resource, which provides the necessary sources and papers in order to completely grasp RL.

Another interesting project is to look up Pufferlib, which is an excellent RL library for training and testing agents. A famous implementation of RL is the Alpha series of models made by DeepMind, including AlphaGo, AlphaStar, AlphaGeometry, AlphaQubit and AlphaFold.

These are some of the most intersting non-llm implementations of RL. Chess engines like StockFish and LeelaChessZero would also be great case studies to look at here.

Diffusion Models

The only class of mathematical framework that managed to beat transformers were diffusion models (though diffusion + transformer is a thing now). These models were the SOTA for a long time for image generation (still are). Modern vLLMs and imageGen models all rely on the diffusion training paradigm to generate realistic images. Stable Diffusion, Midjourney and Flux are some of the best image generation models out there. Resources for diffusion (or flow-macthing in general) are obscured by complicated differential equations and non-intuitive training pipelines. Despite there being tons of resources on the same, this remains quite a challenging field to grasp. Here are some excellent resources: