AI Engineering From Scratch – Master AI Without Black Boxes
A

AI Engineering From Scratch – Master AI Without Black Boxes

AI Engineering From Scratch – Master AI Without Black Boxes

36,982 stars
N/A forks
N/A contributors

README

Project documentation from GitHub

AI Engineering From Scratch – No Black Boxes, Just Code

You’ve probably used OpenAI, Hugging Face, or LlamaIndex. They’re great, but they often hide the internals behind abstractions. If you’re the kind of developer who wants to know exactly how an embedding works, how RAG retrieval actually looks under the hood, or how to train a small transformer from scratch, you’ve probably felt the itch to peel back the curtain.

That’s exactly what this GitHub repo does. It’s a hands-on, code-first guide to building AI components from the ground up — no opaque libraries, no magic. Just Python, numpy, and a lot of clear explanations.


What It Does

ai-engineering-from-scratch is a collection of Jupyter notebooks and scripts that walk you through building core AI/ML components from scratch. It covers:

  • Tokenization – byte-pair encoding and word-level tokenizers
  • Embeddings – Word2Vec, GloVe, and positional encodings
  • Transformers – attention mechanisms, multi-head attention, and a full transformer from scratch
  • Retrieval-Augmented Generation (RAG) – chunking, vector search, and a basic RAG pipeline
  • Fine-tuning – simple examples of adapting pretrained models

Each component is built with readable Python, often using nothing more than numpy and basic math. You can run it locally, step through it line by line, and actually understand what’s happening.


Why It’s Cool

Most tutorials stop at “use this library.” This one stops at “here’s the actual algorithm, line by line.” Here’s what makes it stand out:

  • No black boxes – Every layer of a transformer, every attention head, every embedding lookup is implemented in plain code. You can print the tensors, inspect the gradients, and see exactly what changes.
  • Educational by design – The code is heavily commented, with explanations written in a way that assumes you know Python but not necessarily ML theory. It’s not a production framework, it’s a learning tool.
  • Covers the whole pipeline – From tokenization to training to inference, you follow the full flow. You’re not learning isolated pieces; you’re building a mental model of how they connect.
  • RAG done right – The RAG section is particularly clean: it shows you how to chunk documents, create embeddings from scratch (not just use SentenceTransformers), and do a simple cosine similarity search without relying on FAISS or Elastic

Did you like this issue?

Join our weekly newsletter

Love discovering amazing projects?

Help us continue bringing you the best open-source discoveries every week.

Back to Projects
Last updated: Apr 28, 2026