rasbt-LLMs-from-scratch/ch04
2025-10-12 08:55:03 -05:00
..
01_main-chapter-code Add PyPI package (#576) 2025-03-23 19:28:49 -05:00
02_performance-analysis Add PyPI package (#576) 2025-03-23 19:28:49 -05:00
03_kv-cache Link the other KV cache sections (#708) 2025-06-24 16:52:29 -05:00
04_gqa Multi-Head Latent Attention (#876) 2025-10-11 20:08:30 -05:00
05_mla rm plot 2025-10-12 08:55:03 -05:00
README.md Multi-Head Latent Attention (#876) 2025-10-11 20:08:30 -05:00

Chapter 4: Implementing a GPT Model from Scratch to Generate Text

 

Main Chapter Code

 

Bonus Materials

  • 02_performance-analysis contains optional code analyzing the performance of the GPT model(s) implemented in the main chapter
  • 03_kv-cache implements a KV cache to speed up the text generation during inference
  • ch05/07_gpt_to_llama contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI (it might be interesting to look at alternative architectures after completing chapter 4, but you can also save that for after reading chapter 5)
  • 04_gqa contains an introduction to Grouped-Query Attention (GQA), which is used by most modern LLMs (Llama 4, gpt-oss, Qwen3, Gemma 3, and many more) as alternative to regular Multi-Head Attention (MHA)
  • 05_mla contains an introduction to Multi-Head Latent Attention (MLA), which is used by DeepSeek V3, as alternative to regular Multi-Head Attention (MHA)

In the video below, I provide a code-along session that covers some of the chapter contents as supplementary material.



Link to the video