RAG from First Principles, First Edition

This is the code repository for RAG from First Principles, First Edition, published by Packt.

Engineering retrieval-augmented generation systems with Python, LangChain, and LlamaIndex

Jia Huang

About the book

RAG from First Principles, First Edition

Most developers can spin up a RAG pipeline in an afternoon using LangChain or LlamaIndex. Far fewer understand why retrieval fails or how to fix it. This book is for those who want to go deeper. RAG From First Principles dismantles the retrieval-augmented generation stack layer by layer, explaining how documents are ingested and parsed, why chunking strategy directly impacts answer quality, how embedding models encode meaning, what happens inside a vector database, and how sparse and dense retrieval interact in a hybrid system. Written by Jia Huang, a research engineer and bestselling AI author, it brings both research depth and production experience to one of AI's most critical engineering disciplines. Structured as a progressive dialogue between a seasoned engineer and two students, the book surfaces the questions practitioners actually ask. Each chapter builds on the last, covering topics from data import and chunking to embedding selection, index design, hybrid search, and post-retrieval processing, before moving on to response generation, evaluation, and advanced paradigms including GraphRAG, Agentic RAG, and Modular RAG. By the end, you'll have the architectural understanding to optimize, debug, and extend your RAG systems with confidence. *Email sign-up and proof of purchase required

Key Learnings

Parse and ingest diverse document types like PDFs, tables, images, web pages, and structured data
Apply the right chunking strategy for your content type and retrieval goals
Select, compare, and fine-tune embedding models for your domain
Design vector indexes and choose the right similarity metrics for production use
Improve result quality with reranking methods including RRF, cross-encoders, and ColBERT
Integrate retrieval results into generation pipelines using prompt engineering and Self-RAG

Chapters

Data Import
Text Chunking
Information Embedding
Vector Storage
Pre-Retrieval Processing
Index Optimization
Retrieval Post-Processing
Response Generation
System Evaluation
Complex RAG Paradigms

English Inputs for the Practical RAG Manuscript

This directory stores English-language replacement inputs for the Chinese files that are actually referenced by the manuscript's code examples.

The package is self-contained: every replacement path in the table below points to a file or folder inside 99-EN, except for examples that are generated by code or loaded from the web. The original Chinese files under 90-文档-Data are unchanged. The local manuscript PDF is ignored by .gitignore and should not be committed or pushed.

How to Use

When localizing a manuscript code block, replace the original data/... path with the matching English path below. MANIFEST.csv contains the same mapping in CSV form.

Chapter and Program Index

This table maps the manuscript chapters/programs to the English input files. The program names follow the repository's code filenames where the same example exists locally.

Manuscript chapter / program area	Program or code block	English input file or folder
Intro / Simple RAG	`00-简单RAG-SimpleRAG/01_01_LlamaIndex_5行代码.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Intro / Simple RAG	`00-简单RAG-SimpleRAG/01_02_LlamaIndex_更换嵌入模型.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Intro / Simple RAG	`00-简单RAG-SimpleRAG/01_03_LlamaIndex_更换生成模型.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Intro / Simple RAG	`00-简单RAG-SimpleRAG/01_04_LlamaIndex_5行代码_DeepSeek.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Intro / Simple RAG	`00-简单RAG-SimpleRAG/01_05_LlamaIndex_5行代码_Ollama.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Intro / Simple RAG	`WebBaseLoader` examples for the Chinese Black Myth Wikipedia URL	`99-EN/black-myth-wukong/black_myth_wukong_wiki.txt`
Chapter 1 / TXT loading	`01-数据导入-DataLoading/01-简单文本读取/01-用LangChain读入txt文件.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Chapter 1 / Directory loading	`01-数据导入-DataLoading/01-简单文本读取/03-01-用LangChain加载目录中所有文档.py`	`99-EN/black-myth-wukong/`
Chapter 1 / Directory loading	`01-数据导入-DataLoading/01-简单文本读取/03-02-用LangChain加载目录时指定参数.py`	`99-EN/black-myth-wukong/`
Chapter 1 / Directory loading	`01-数据导入-DataLoading/01-简单文本读取/03-03-用LangChain加载目录时更改工具.py`	`99-EN/black-myth-wukong/`
Chapter 1 / Directory loading	`01-数据导入-DataLoading/01-简单文本读取/03-04-用LangChain加载目录时跳过错误.py`	`99-EN/black-myth-wukong/`
Chapter 1 / LlamaIndex reader	`01-数据导入-DataLoading/01-简单文本读取/05-用LlamaIndex-加载目录文档.py`	`99-EN/black-myth-wukong/` and `99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Chapter 1 / Unstructured TXT	`01-数据导入-DataLoading/01-简单文本读取/07-使用Unstructured_v1.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Chapter 1 / Unstructured TXT	`01-数据导入-DataLoading/01-简单文本读取/07-使用Unstructured_v2.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Chapter 1 / JSON as text	`01-数据导入-DataLoading/02-结构化文档读取/01-LangChain-TextLoader-JSON.py`	`99-EN/black-myth-wukong/journey_to_the_west_characters.json`
Chapter 1 / JSONLoader	`01-数据导入-DataLoading/02-结构化文档读取/02-LangCHain-JSONLoader-JSON.py`	`99-EN/black-myth-wukong/black_myth_wukong_characters.json`
Chapter 1 / Markdown loader	`01-数据导入-DataLoading/02-结构化文档读取/04-LangChain-UnstructuredMarkdownLoader.py`	`99-EN/black-myth-wukong/black_myth_wukong_versions.md`
Chapter 1 / Image parsing	`01-数据导入-DataLoading/03-解析图文数据/01-Unstructured读图.py`	`99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg`
Chapter 1 / PPT parsing	`01-数据导入-DataLoading/03-解析图文数据/02-Unstructured读PPT.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pptx`
Chapter 1 / Multimodal PDF-to-image	`01-数据导入-DataLoading/03-解析图文数据/03-大模型读取图文.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / PDF parsing	`01-数据导入-DataLoading/04-PDF文件读取/01-使用PyPDF.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / PDF parsing	`01-数据导入-DataLoading/04-PDF文件读取/02-使用PyMuPDF.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / OCR PDF parsing	`01-数据导入-DataLoading/04-PDF文件读取/03-使用pytesseract+pdf2image.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / LlamaParse PDF	`01-数据导入-DataLoading/04-PDF文件读取/04-使用LlamaParser.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / Unstructured PDF	`01-数据导入-DataLoading/04-PDF文件读取/06-Unstrctured-使用partition函数解析PDF-v1.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / Unstructured PDF	`01-数据导入-DataLoading/04-PDF文件读取/06-Unstrctured-使用partition函数解析PDF-v2.py`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`
Chapter 1 / PDF layout and parent-child parsing	`01-数据导入-DataLoading/04-PDF文件读取/05-LangChain-Unstrucured-PDF-*.py`	`99-EN/assets/shanxi-tourism/云冈石窟-en.pdf`
Chapter 1 / PDF layout and parent-child parsing	`01-数据导入-DataLoading/04-PDF文件读取/09-Parent-Child-Unstructured-*.py`	`99-EN/assets/shanxi-tourism/云冈石窟-en.pdf`
Chapter 1 / CSV loading	`01-数据导入-DataLoading/05-表格数据读取/01-01-导入CSV.py`	`99-EN/black-myth-wukong/black_myth_wukong.csv`
Chapter 2 / Character text splitting	`02-文本切块-DocChunking/01-LangChain-CharacterTextSplitter.py`	`99-EN/shanxi-tourism/yungang_grottoes.txt`
Chapter 2 / Recursive text splitting	`02-文本切块-DocChunking/02-LangChain-RecursiveharacterTextSplitter.py`	`99-EN/shanxi-tourism/yungang_grottoes.txt`
Chapter 2 / Semantic chunking	`02-文本切块-DocChunking/05-LlamaIndex-语义分块.py`	`99-EN/black-myth-wukong/black_myth_wukong_wiki.txt`
Chapter 3 / Recommendation embeddings	`03-向量嵌入-Embedding/01-openai-embedding-recomendation-system.py`	`99-EN/journey-of-extinction-husun/user_reviews.csv` and `99-EN/journey-of-extinction-husun/game_guide.json`
Chapter 3 / Jina clustering	`03-向量嵌入-Embedding/02-jina-embeddings-v3-clustering.py`	`99-EN/journey-of-extinction-husun/jina_games.csv`
Chapter 3 / Multimodal embedding	`03-向量嵌入-Embedding/05-多模态嵌入.py`	`99-EN/assets/multimodal/query_image.jpg`
Chapter 4 / Hybrid retrieval	`04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v1-极简.py`	`99-EN/journey-of-extinction-husun/battle_scenes.json`
Chapter 4 / Hybrid retrieval	`04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v2-细节.py`	`99-EN/journey-of-extinction-husun/battle_scenes.json`
Chapter 4 / Hybrid retrieval	`04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v3-重排.py`	`99-EN/journey-of-extinction-husun/battle_scenes.json`
Chapter 4 / Multimodal retrieval	`04-向量存储-VectorDB/多模态检索/Milvus+Visual-BGE多模态检索-*.py`	`99-EN/multimodal/metadata.json` and `99-EN/assets/multimodal/`
Chapter 5 / Query rewriting	`05-检索前处理-PreRetrieval/02-查询翻译/01-查询重写-*.py`	`99-EN/journey-of-extinction-husun/setting.txt` or `99-EN/black-myth-wukong/black_myth_wukong_setting.txt` depending on the manuscript variant
Chapter 5 / Query decomposition	`05-检索前处理-PreRetrieval/02-查询翻译/02-查询分解-*.py`	`99-EN/journey-of-extinction-husun/setting.txt` or `99-EN/black-myth-wukong/black_myth_wukong_setting.txt` depending on the manuscript variant
Chapter 5 / HyDE query expansion	`05-检索前处理-PreRetrieval/02-查询翻译/04-查询扩展-HyDE假设文档生成.py`	`99-EN/black-myth-wukong/black_myth_wukong_wiki.txt`
Chapter 5 / Text2SQL	`05-检索前处理-PreRetrieval/01-查询构建/Text2SQL/01-Text2SQL-创建数据库表.py`	No package file; the code creates `data/tourism.db`
Chapter 7 / RRF reranking	`07-检索后处理-PostRetrieval/01-重排/01-RRF重排.py`	`99-EN/shanxi-tourism/` and `99-EN/assets/shanxi-tourism/`
Chapter 7 / RankLLM reranking	`07-检索后处理-PostRetrieval/01-重排/05-RankLLM重排.py`	`99-EN/shanxi-tourism/yungang_grottoes.txt`
Chapter 7 / Sentence optimizer compression	`07-检索后处理-PostRetrieval/02-压缩/03-SentenceEmbeddingOptimizer压缩.py`	`99-EN/shanxi-tourism/` and `99-EN/assets/shanxi-tourism/`
Chapter 8 / Prompt template generation	`08-响应生成-Generation/02-通过提示词优化响应/01-使用提示模板明确生成目标.py`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`
Chapter 8 / LlamaIndex output parsing	`08-响应生成-Generation/03-通过输出解析控制格式/02-LlamaIndex输出解析.py`	`99-EN/black-myth-wukong/black_myth_wukong_wiki.txt`
Chapter 10 / Weaviate multimodal search	`10-高级RAG-AdvanceRAG/05-MultiModalRAG/01-Weaviate-Multimodal-Search.py`	`99-EN/assets/multimodal/weaviate/` and `99-EN/assets/multimodal/query_image.jpg`

Detailed Replacement Mapping

Manuscript program or loader call	English input to use	Notes
`SimpleDirectoryReader("data/黑神话").load_data()`	`99-EN/black-myth-wukong/`	English replacement directory for the Black Myth examples.
`DirectoryLoader("./data/黑神话")`	`99-EN/black-myth-wukong/`	Use the directory for examples that load all local Black Myth documents.
`DirectoryLoader("data/黑神话", loader_cls=TextLoader)`	`99-EN/black-myth-wukong/`	Text-readable replacements are provided as `.txt`, `.md`, `.csv`, and `.json`.
`WebBaseLoader(web_paths=("https://zh.wikipedia.org/wiki/黑神话：悟空",))`	`99-EN/black-myth-wukong/black_myth_wukong_wiki.txt`	Offline English substitute for the Chinese Wikipedia loading examples. For live web loading, use the English Wikipedia page instead.
`TextLoader("data/黑神话/黑神话悟空的设定.txt")`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`	Chapter 1 TXT loading example.
`SimpleDirectoryReader(input_files=["data/黑神话/黑神话悟空的设定.txt"])`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`	LlamaIndex single-file loading example.
`text = "data/黑神话/黑神话悟空的设定.txt"`	`99-EN/black-myth-wukong/black_myth_wukong_setting.txt`	Unstructured text parsing example.
`TextLoader("data/西游记人物角色.json")`	`99-EN/black-myth-wukong/journey_to_the_west_characters.json`	JSON-as-text example.
`JSONLoader(file_path="data/黑神话/黑神话人物角色.json", ...)`	`99-EN/black-myth-wukong/black_myth_wukong_characters.json`	Structured JSON loading example.
`image_path = "data/黑神话/黑神话英文.jpg"`	`99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg`	Bundled copy of the existing English image asset.
`partition_ppt(filename="data/黑神话悟空 PPT.pptx")`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pptx`	Lightweight English PPT replacement generated for the book example.
`partition_ppt(filename="data/黑神话/黑神话悟空 PPT.pptx")`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pptx`	Same PPT example with a directory prefix.
`filename = "data/黑神话/黑神话悟空.pdf"`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`	Lightweight English PDF replacement generated for PDF parsing examples.
`fitz.open("data/黑神话/黑神话悟空.pdf")`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`	PyMuPDF example.
`convert_from_path("data/黑神话/黑神话悟空.pdf")`	`99-EN/black-myth-wukong/black_myth_wukong_slides.pdf`	PDF-to-image example.
`file_path = "data/黑神话/黑神话悟空.csv"`	`99-EN/black-myth-wukong/black_myth_wukong.csv`	CSV loading example.
`DirectoryLoader(path="data/黑神话", glob="*/.csv")`	`99-EN/black-myth-wukong/black_myth_wukong.csv`	CSV directory loading example.
`markdown_path = "data/黑神话/黑神话版本介绍.md"`	`99-EN/black-myth-wukong/black_myth_wukong_versions.md`	Markdown loader example.
`markdown_path = "data/黑神话/黑悟空版本介绍.md"`	`99-EN/black-myth-wukong/black_myth_wukong_versions.md`	Same Markdown example with the repository filename variant.
`marker_single "data/山西文旅/云冈石窟-en.pdf"`	`99-EN/assets/shanxi-tourism/云冈石窟-en.pdf`	Bundled copy of the existing English PDF.
`file_path = ("data/山西文旅/云冈石窟-en.pdf")`	`99-EN/assets/shanxi-tourism/云冈石窟-en.pdf`	PDF structure extraction example.
`TextLoader("data/山西文旅/云冈石窟.txt")`	`99-EN/shanxi-tourism/yungang_grottoes.txt`	Shanxi tourism TXT loading example.
`SimpleDirectoryReader(input_files=["data/山西文旅/云冈石窟.txt"])`	`99-EN/shanxi-tourism/yungang_grottoes.txt`	LlamaIndex text splitting example.
`doc_dir = "./data/山西文旅"`	`99-EN/shanxi-tourism/` and `99-EN/assets/shanxi-tourism/`	Directory retrieval examples load both TXT and PDF files.
`SimpleDirectoryReader("data/山西文旅").load_data()`	`99-EN/shanxi-tourism/` and `99-EN/assets/shanxi-tourism/`	Sentence optimization and generation examples.
`SimpleDirectoryReader(input_files=["data/灭神纪/灭神纪设定.txt"])`	`99-EN/journey-of-extinction-husun/setting.txt`	Context/index example.
`TextLoader("data/灭神纪/设定.txt", encoding="utf-8")`	`99-EN/journey-of-extinction-husun/setting.txt`	Query rewrite and decomposition examples.
`TextLoader("data/灭神纪/情节片段.txt", encoding="utf-8")`	`99-EN/journey-of-extinction-husun/plot_fragments.txt`	Manuscript references this file, but no tracked Chinese source was found, so an English substitute is supplied.
`SimpleDirectoryReader("data/灭神纪").load_data()`	`99-EN/journey-of-extinction-husun/`	Directory loading example.
`SimpleDirectoryReader("data/灭神纪").load_data()` when the directory includes `人物角色.json`	`99-EN/journey-of-extinction-husun/characters.json`	English character JSON included for directory-level loading.
`pd.read_csv("data/灭神纪/用户评价.csv")`	`99-EN/journey-of-extinction-husun/user_reviews.csv`	Embedding recommendation example.
`open("data/灭神纪/游戏说明.json", "r")`	`99-EN/journey-of-extinction-husun/game_guide.json`	Embedding recommendation example.
`pd.read_csv("data/灭神纪/Jina游戏.csv")`	`99-EN/journey-of-extinction-husun/jina_games.csv`	Source repo uses `游戏描述.csv`; this is the English replacement for the manuscript's Jina clustering example.
`open("data/灭神纪/战斗场景.json", encoding="utf-8")`	`99-EN/journey-of-extinction-husun/battle_scenes.json`	Hybrid retrieval example.
`WukongDataset("data/多模态", "data/多模态/metadata.json")`	`99-EN/multimodal/metadata.json`	English metadata for the multimodal retrieval examples.
`query_image = "data/多模态/query_image.jpg"`	`99-EN/assets/multimodal/query_image.jpg`	Bundled copy of the existing query image.
`image_dir = "data/多模态/Weaviate"`	`99-EN/assets/multimodal/weaviate/`	Bundled image directory for Weaviate multimodal search examples.
`sqlite3.connect("data/tourism.db")`	Not included	The SQLite database is created by manuscript code rather than maintained as a Chinese input document.

Bundled Existing English Assets

These files were already English or did not require text localization, but they are copied into this package so editors can work from the zip alone:

99-EN/assets/shanxi-tourism/云冈石窟-en.pdf
99-EN/assets/shanxi-tourism/五台山-en.pdf
99-EN/assets/shanxi-tourism/佛光寺-en.pdf
99-EN/assets/shanxi-tourism/壶口瀑布-en.pdf
99-EN/assets/shanxi-tourism/山西-en.pdf
99-EN/assets/shanxi-tourism/平遥古城-en.pdf
99-EN/assets/shanxi-tourism/悬空寺-en.pdf
99-EN/assets/shanxi-tourism/晋祠-en.pdf
99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg
99-EN/assets/multimodal/01.jpg through 99-EN/assets/multimodal/09.jpg
99-EN/assets/multimodal/query_image.jpg
99-EN/assets/multimodal/weaviate/wukong_demon_fight.jpg
99-EN/assets/multimodal/weaviate/wukong_fire_attack.jpg
99-EN/assets/multimodal/weaviate/wukong_vs_white_bone_spirit.jpg

Coverage Notes

All local manuscript data paths found in the extracted code-oriented references are mapped above. The only intentional exceptions are:

Live web examples, such as WebBaseLoader(...), where black_myth_wukong_wiki.txt is provided as an offline English substitute.
Runtime-generated files, such as data/tourism.db, which are created by code rather than maintained as manuscript input files.
The source multimodal metadata refers to data/多模态/10.jpg, but the tracked repository contains 01.jpg through 09.jpg plus query_image.jpg. The English metadata keeps all referenced image paths valid inside this package.

Get to know Authors

Jia Huang Jia Huang is a Lead Research Engineer at A*STAR (Agency for Science, Technology and Research), Singapore, where his work focuses on NLP, large language models, and applied AI engineering. With over twenty years of experience leading large-scale AI and data projects across government, finance, healthcare, and e-commerce, he brings an unusually practical lens to technically rigorous subjects. In recent years, his research has primarily focused on NLP pre-trained large models and FinTech applications. He is the author of six bestselling technical books, including Hands-on AI Agent Development for Large Model Applications selected as one of JD Best Books of 2024 and GPT: How Large Models Are Built, named CSDN's Most Influential IT Book of 2023. His online RAG engineering course has been completed by over 10,000 students.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
00-SimpleRAG		00-SimpleRAG
01-DataLoading		01-DataLoading
02-DocChunking		02-DocChunking
03-Embedding		03-Embedding
04-VectorDB		04-VectorDB
05-PreRetrieval		05-PreRetrieval
06-Indexing		06-Indexing
07-PostRetrieval		07-PostRetrieval
08-Generation		08-Generation
09-Evaluation		09-Evaluation
10-AdvanceRAG		10-AdvanceRAG
90-Data		90-Data
91-Environment		91-Environment
92-Pic		92-Pic
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG from First Principles, First Edition

Engineering retrieval-augmented generation systems with Python, LangChain, and LlamaIndex

About the book

Key Learnings

Chapters

English Inputs for the Practical RAG Manuscript

How to Use

Chapter and Program Index

Detailed Replacement Mapping

Bundled Existing English Assets

Coverage Notes

Get to know Authors

Other Related Books

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG from First Principles, First Edition

Engineering retrieval-augmented generation systems with Python, LangChain, and LlamaIndex

About the book

Key Learnings

Chapters

English Inputs for the Practical RAG Manuscript

How to Use

Chapter and Program Index

Detailed Replacement Mapping

Bundled Existing English Assets

Coverage Notes

Get to know Authors

Other Related Books

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages