This is the code repository for RAG from First Principles, First Edition, published by Packt.
Jia Huang
Most developers can spin up a RAG pipeline in an afternoon using LangChain or LlamaIndex. Far fewer understand why retrieval fails or how to fix it. This book is for those who want to go deeper. RAG From First Principles dismantles the retrieval-augmented generation stack layer by layer, explaining how documents are ingested and parsed, why chunking strategy directly impacts answer quality, how embedding models encode meaning, what happens inside a vector database, and how sparse and dense retrieval interact in a hybrid system. Written by Jia Huang, a research engineer and bestselling AI author, it brings both research depth and production experience to one of AI's most critical engineering disciplines. Structured as a progressive dialogue between a seasoned engineer and two students, the book surfaces the questions practitioners actually ask. Each chapter builds on the last, covering topics from data import and chunking to embedding selection, index design, hybrid search, and post-retrieval processing, before moving on to response generation, evaluation, and advanced paradigms including GraphRAG, Agentic RAG, and Modular RAG. By the end, you'll have the architectural understanding to optimize, debug, and extend your RAG systems with confidence. *Email sign-up and proof of purchase required
- Parse and ingest diverse document types like PDFs, tables, images, web pages, and structured data
- Apply the right chunking strategy for your content type and retrieval goals
- Select, compare, and fine-tune embedding models for your domain
- Design vector indexes and choose the right similarity metrics for production use
- Improve result quality with reranking methods including RRF, cross-encoders, and ColBERT
- Integrate retrieval results into generation pipelines using prompt engineering and Self-RAG
- Data Import
- Text Chunking
- Information Embedding
- Vector Storage
- Pre-Retrieval Processing
- Index Optimization
- Retrieval Post-Processing
- Response Generation
- System Evaluation
- Complex RAG Paradigms
This directory stores English-language replacement inputs for the Chinese files that are actually referenced by the manuscript's code examples.
The package is self-contained: every replacement path in the table below points to a file or folder inside 99-EN, except for examples that are generated by code or loaded from the web. The original Chinese files under 90-文档-Data are unchanged. The local manuscript PDF is ignored by .gitignore and should not be committed or pushed.
When localizing a manuscript code block, replace the original data/... path with the matching English path below. MANIFEST.csv contains the same mapping in CSV form.
This table maps the manuscript chapters/programs to the English input files. The program names follow the repository's code filenames where the same example exists locally.
| Manuscript chapter / program area | Program or code block | English input file or folder |
|---|---|---|
| Intro / Simple RAG | 00-简单RAG-SimpleRAG/01_01_LlamaIndex_5行代码.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Intro / Simple RAG | 00-简单RAG-SimpleRAG/01_02_LlamaIndex_更换嵌入模型.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Intro / Simple RAG | 00-简单RAG-SimpleRAG/01_03_LlamaIndex_更换生成模型.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Intro / Simple RAG | 00-简单RAG-SimpleRAG/01_04_LlamaIndex_5行代码_DeepSeek.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Intro / Simple RAG | 00-简单RAG-SimpleRAG/01_05_LlamaIndex_5行代码_Ollama.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Intro / Simple RAG | WebBaseLoader examples for the Chinese Black Myth Wikipedia URL |
99-EN/black-myth-wukong/black_myth_wukong_wiki.txt |
| Chapter 1 / TXT loading | 01-数据导入-DataLoading/01-简单文本读取/01-用LangChain读入txt文件.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Chapter 1 / Directory loading | 01-数据导入-DataLoading/01-简单文本读取/03-01-用LangChain加载目录中所有文档.py |
99-EN/black-myth-wukong/ |
| Chapter 1 / Directory loading | 01-数据导入-DataLoading/01-简单文本读取/03-02-用LangChain加载目录时指定参数.py |
99-EN/black-myth-wukong/ |
| Chapter 1 / Directory loading | 01-数据导入-DataLoading/01-简单文本读取/03-03-用LangChain加载目录时更改工具.py |
99-EN/black-myth-wukong/ |
| Chapter 1 / Directory loading | 01-数据导入-DataLoading/01-简单文本读取/03-04-用LangChain加载目录时跳过错误.py |
99-EN/black-myth-wukong/ |
| Chapter 1 / LlamaIndex reader | 01-数据导入-DataLoading/01-简单文本读取/05-用LlamaIndex-加载目录文档.py |
99-EN/black-myth-wukong/ and 99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Chapter 1 / Unstructured TXT | 01-数据导入-DataLoading/01-简单文本读取/07-使用Unstructured_v1.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Chapter 1 / Unstructured TXT | 01-数据导入-DataLoading/01-简单文本读取/07-使用Unstructured_v2.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Chapter 1 / JSON as text | 01-数据导入-DataLoading/02-结构化文档读取/01-LangChain-TextLoader-JSON.py |
99-EN/black-myth-wukong/journey_to_the_west_characters.json |
| Chapter 1 / JSONLoader | 01-数据导入-DataLoading/02-结构化文档读取/02-LangCHain-JSONLoader-JSON.py |
99-EN/black-myth-wukong/black_myth_wukong_characters.json |
| Chapter 1 / Markdown loader | 01-数据导入-DataLoading/02-结构化文档读取/04-LangChain-UnstructuredMarkdownLoader.py |
99-EN/black-myth-wukong/black_myth_wukong_versions.md |
| Chapter 1 / Image parsing | 01-数据导入-DataLoading/03-解析图文数据/01-Unstructured读图.py |
99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg |
| Chapter 1 / PPT parsing | 01-数据导入-DataLoading/03-解析图文数据/02-Unstructured读PPT.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pptx |
| Chapter 1 / Multimodal PDF-to-image | 01-数据导入-DataLoading/03-解析图文数据/03-大模型读取图文.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / PDF parsing | 01-数据导入-DataLoading/04-PDF文件读取/01-使用PyPDF.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / PDF parsing | 01-数据导入-DataLoading/04-PDF文件读取/02-使用PyMuPDF.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / OCR PDF parsing | 01-数据导入-DataLoading/04-PDF文件读取/03-使用pytesseract+pdf2image.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / LlamaParse PDF | 01-数据导入-DataLoading/04-PDF文件读取/04-使用LlamaParser.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / Unstructured PDF | 01-数据导入-DataLoading/04-PDF文件读取/06-Unstrctured-使用partition函数解析PDF-v1.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / Unstructured PDF | 01-数据导入-DataLoading/04-PDF文件读取/06-Unstrctured-使用partition函数解析PDF-v2.py |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
| Chapter 1 / PDF layout and parent-child parsing | 01-数据导入-DataLoading/04-PDF文件读取/05-LangChain-Unstrucured-PDF-*.py |
99-EN/assets/shanxi-tourism/云冈石窟-en.pdf |
| Chapter 1 / PDF layout and parent-child parsing | 01-数据导入-DataLoading/04-PDF文件读取/09-Parent-Child-Unstructured-*.py |
99-EN/assets/shanxi-tourism/云冈石窟-en.pdf |
| Chapter 1 / CSV loading | 01-数据导入-DataLoading/05-表格数据读取/01-01-导入CSV.py |
99-EN/black-myth-wukong/black_myth_wukong.csv |
| Chapter 2 / Character text splitting | 02-文本切块-DocChunking/01-LangChain-CharacterTextSplitter.py |
99-EN/shanxi-tourism/yungang_grottoes.txt |
| Chapter 2 / Recursive text splitting | 02-文本切块-DocChunking/02-LangChain-RecursiveharacterTextSplitter.py |
99-EN/shanxi-tourism/yungang_grottoes.txt |
| Chapter 2 / Semantic chunking | 02-文本切块-DocChunking/05-LlamaIndex-语义分块.py |
99-EN/black-myth-wukong/black_myth_wukong_wiki.txt |
| Chapter 3 / Recommendation embeddings | 03-向量嵌入-Embedding/01-openai-embedding-recomendation-system.py |
99-EN/journey-of-extinction-husun/user_reviews.csv and 99-EN/journey-of-extinction-husun/game_guide.json |
| Chapter 3 / Jina clustering | 03-向量嵌入-Embedding/02-jina-embeddings-v3-clustering.py |
99-EN/journey-of-extinction-husun/jina_games.csv |
| Chapter 3 / Multimodal embedding | 03-向量嵌入-Embedding/05-多模态嵌入.py |
99-EN/assets/multimodal/query_image.jpg |
| Chapter 4 / Hybrid retrieval | 04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v1-极简.py |
99-EN/journey-of-extinction-husun/battle_scenes.json |
| Chapter 4 / Hybrid retrieval | 04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v2-细节.py |
99-EN/journey-of-extinction-husun/battle_scenes.json |
| Chapter 4 / Hybrid retrieval | 04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v3-重排.py |
99-EN/journey-of-extinction-husun/battle_scenes.json |
| Chapter 4 / Multimodal retrieval | 04-向量存储-VectorDB/多模态检索/Milvus+Visual-BGE多模态检索-*.py |
99-EN/multimodal/metadata.json and 99-EN/assets/multimodal/ |
| Chapter 5 / Query rewriting | 05-检索前处理-PreRetrieval/02-查询翻译/01-查询重写-*.py |
99-EN/journey-of-extinction-husun/setting.txt or 99-EN/black-myth-wukong/black_myth_wukong_setting.txt depending on the manuscript variant |
| Chapter 5 / Query decomposition | 05-检索前处理-PreRetrieval/02-查询翻译/02-查询分解-*.py |
99-EN/journey-of-extinction-husun/setting.txt or 99-EN/black-myth-wukong/black_myth_wukong_setting.txt depending on the manuscript variant |
| Chapter 5 / HyDE query expansion | 05-检索前处理-PreRetrieval/02-查询翻译/04-查询扩展-HyDE假设文档生成.py |
99-EN/black-myth-wukong/black_myth_wukong_wiki.txt |
| Chapter 5 / Text2SQL | 05-检索前处理-PreRetrieval/01-查询构建/Text2SQL/01-Text2SQL-创建数据库表.py |
No package file; the code creates data/tourism.db |
| Chapter 7 / RRF reranking | 07-检索后处理-PostRetrieval/01-重排/01-RRF重排.py |
99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/ |
| Chapter 7 / RankLLM reranking | 07-检索后处理-PostRetrieval/01-重排/05-RankLLM重排.py |
99-EN/shanxi-tourism/yungang_grottoes.txt |
| Chapter 7 / Sentence optimizer compression | 07-检索后处理-PostRetrieval/02-压缩/03-SentenceEmbeddingOptimizer压缩.py |
99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/ |
| Chapter 8 / Prompt template generation | 08-响应生成-Generation/02-通过提示词优化响应/01-使用提示模板明确生成目标.py |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
| Chapter 8 / LlamaIndex output parsing | 08-响应生成-Generation/03-通过输出解析控制格式/02-LlamaIndex输出解析.py |
99-EN/black-myth-wukong/black_myth_wukong_wiki.txt |
| Chapter 10 / Weaviate multimodal search | 10-高级RAG-AdvanceRAG/05-MultiModalRAG/01-Weaviate-Multimodal-Search.py |
99-EN/assets/multimodal/weaviate/ and 99-EN/assets/multimodal/query_image.jpg |
| Manuscript program or loader call | English input to use | Notes |
|---|---|---|
SimpleDirectoryReader("data/黑神话").load_data() |
99-EN/black-myth-wukong/ |
English replacement directory for the Black Myth examples. |
DirectoryLoader("./data/黑神话") |
99-EN/black-myth-wukong/ |
Use the directory for examples that load all local Black Myth documents. |
DirectoryLoader("data/黑神话", loader_cls=TextLoader) |
99-EN/black-myth-wukong/ |
Text-readable replacements are provided as .txt, .md, .csv, and .json. |
WebBaseLoader(web_paths=("https://zh.wikipedia.org/wiki/黑神话:悟空",)) |
99-EN/black-myth-wukong/black_myth_wukong_wiki.txt |
Offline English substitute for the Chinese Wikipedia loading examples. For live web loading, use the English Wikipedia page instead. |
TextLoader("data/黑神话/黑神话悟空的设定.txt") |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
Chapter 1 TXT loading example. |
SimpleDirectoryReader(input_files=["data/黑神话/黑神话悟空的设定.txt"]) |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
LlamaIndex single-file loading example. |
text = "data/黑神话/黑神话悟空的设定.txt" |
99-EN/black-myth-wukong/black_myth_wukong_setting.txt |
Unstructured text parsing example. |
TextLoader("data/西游记人物角色.json") |
99-EN/black-myth-wukong/journey_to_the_west_characters.json |
JSON-as-text example. |
JSONLoader(file_path="data/黑神话/黑神话人物角色.json", ...) |
99-EN/black-myth-wukong/black_myth_wukong_characters.json |
Structured JSON loading example. |
image_path = "data/黑神话/黑神话英文.jpg" |
99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg |
Bundled copy of the existing English image asset. |
partition_ppt(filename="data/黑神话悟空 PPT.pptx") |
99-EN/black-myth-wukong/black_myth_wukong_slides.pptx |
Lightweight English PPT replacement generated for the book example. |
partition_ppt(filename="data/黑神话/黑神话悟空 PPT.pptx") |
99-EN/black-myth-wukong/black_myth_wukong_slides.pptx |
Same PPT example with a directory prefix. |
filename = "data/黑神话/黑神话悟空.pdf" |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
Lightweight English PDF replacement generated for PDF parsing examples. |
fitz.open("data/黑神话/黑神话悟空.pdf") |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
PyMuPDF example. |
convert_from_path("data/黑神话/黑神话悟空.pdf") |
99-EN/black-myth-wukong/black_myth_wukong_slides.pdf |
PDF-to-image example. |
file_path = "data/黑神话/黑神话悟空.csv" |
99-EN/black-myth-wukong/black_myth_wukong.csv |
CSV loading example. |
DirectoryLoader(path="data/黑神话", glob="**/*.csv") |
99-EN/black-myth-wukong/black_myth_wukong.csv |
CSV directory loading example. |
markdown_path = "data/黑神话/黑神话版本介绍.md" |
99-EN/black-myth-wukong/black_myth_wukong_versions.md |
Markdown loader example. |
markdown_path = "data/黑神话/黑悟空版本介绍.md" |
99-EN/black-myth-wukong/black_myth_wukong_versions.md |
Same Markdown example with the repository filename variant. |
marker_single "data/山西文旅/云冈石窟-en.pdf" |
99-EN/assets/shanxi-tourism/云冈石窟-en.pdf |
Bundled copy of the existing English PDF. |
file_path = ("data/山西文旅/云冈石窟-en.pdf") |
99-EN/assets/shanxi-tourism/云冈石窟-en.pdf |
PDF structure extraction example. |
TextLoader("data/山西文旅/云冈石窟.txt") |
99-EN/shanxi-tourism/yungang_grottoes.txt |
Shanxi tourism TXT loading example. |
SimpleDirectoryReader(input_files=["data/山西文旅/云冈石窟.txt"]) |
99-EN/shanxi-tourism/yungang_grottoes.txt |
LlamaIndex text splitting example. |
doc_dir = "./data/山西文旅" |
99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/ |
Directory retrieval examples load both TXT and PDF files. |
SimpleDirectoryReader("data/山西文旅").load_data() |
99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/ |
Sentence optimization and generation examples. |
SimpleDirectoryReader(input_files=["data/灭神纪/灭神纪设定.txt"]) |
99-EN/journey-of-extinction-husun/setting.txt |
Context/index example. |
TextLoader("data/灭神纪/设定.txt", encoding="utf-8") |
99-EN/journey-of-extinction-husun/setting.txt |
Query rewrite and decomposition examples. |
TextLoader("data/灭神纪/情节片段.txt", encoding="utf-8") |
99-EN/journey-of-extinction-husun/plot_fragments.txt |
Manuscript references this file, but no tracked Chinese source was found, so an English substitute is supplied. |
SimpleDirectoryReader("data/灭神纪").load_data() |
99-EN/journey-of-extinction-husun/ |
Directory loading example. |
SimpleDirectoryReader("data/灭神纪").load_data() when the directory includes 人物角色.json |
99-EN/journey-of-extinction-husun/characters.json |
English character JSON included for directory-level loading. |
pd.read_csv("data/灭神纪/用户评价.csv") |
99-EN/journey-of-extinction-husun/user_reviews.csv |
Embedding recommendation example. |
open("data/灭神纪/游戏说明.json", "r") |
99-EN/journey-of-extinction-husun/game_guide.json |
Embedding recommendation example. |
pd.read_csv("data/灭神纪/Jina游戏.csv") |
99-EN/journey-of-extinction-husun/jina_games.csv |
Source repo uses 游戏描述.csv; this is the English replacement for the manuscript's Jina clustering example. |
open("data/灭神纪/战斗场景.json", encoding="utf-8") |
99-EN/journey-of-extinction-husun/battle_scenes.json |
Hybrid retrieval example. |
WukongDataset("data/多模态", "data/多模态/metadata.json") |
99-EN/multimodal/metadata.json |
English metadata for the multimodal retrieval examples. |
query_image = "data/多模态/query_image.jpg" |
99-EN/assets/multimodal/query_image.jpg |
Bundled copy of the existing query image. |
image_dir = "data/多模态/Weaviate" |
99-EN/assets/multimodal/weaviate/ |
Bundled image directory for Weaviate multimodal search examples. |
sqlite3.connect("data/tourism.db") |
Not included | The SQLite database is created by manuscript code rather than maintained as a Chinese input document. |
These files were already English or did not require text localization, but they are copied into this package so editors can work from the zip alone:
99-EN/assets/shanxi-tourism/云冈石窟-en.pdf99-EN/assets/shanxi-tourism/五台山-en.pdf99-EN/assets/shanxi-tourism/佛光寺-en.pdf99-EN/assets/shanxi-tourism/壶口瀑布-en.pdf99-EN/assets/shanxi-tourism/山西-en.pdf99-EN/assets/shanxi-tourism/平遥古城-en.pdf99-EN/assets/shanxi-tourism/悬空寺-en.pdf99-EN/assets/shanxi-tourism/晋祠-en.pdf99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg99-EN/assets/multimodal/01.jpgthrough99-EN/assets/multimodal/09.jpg99-EN/assets/multimodal/query_image.jpg99-EN/assets/multimodal/weaviate/wukong_demon_fight.jpg99-EN/assets/multimodal/weaviate/wukong_fire_attack.jpg99-EN/assets/multimodal/weaviate/wukong_vs_white_bone_spirit.jpg
All local manuscript data paths found in the extracted code-oriented references are mapped above. The only intentional exceptions are:
- Live web examples, such as
WebBaseLoader(...), whereblack_myth_wukong_wiki.txtis provided as an offline English substitute. - Runtime-generated files, such as
data/tourism.db, which are created by code rather than maintained as manuscript input files. - The source multimodal metadata refers to
data/多模态/10.jpg, but the tracked repository contains01.jpgthrough09.jpgplusquery_image.jpg. The English metadata keeps all referenced image paths valid inside this package.
Jia Huang Jia Huang is a Lead Research Engineer at A*STAR (Agency for Science, Technology and Research), Singapore, where his work focuses on NLP, large language models, and applied AI engineering. With over twenty years of experience leading large-scale AI and data projects across government, finance, healthcare, and e-commerce, he brings an unusually practical lens to technically rigorous subjects. In recent years, his research has primarily focused on NLP pre-trained large models and FinTech applications. He is the author of six bestselling technical books, including Hands-on AI Agent Development for Large Model Applications selected as one of JD Best Books of 2024 and GPT: How Large Models Are Built, named CSDN's Most Influential IT Book of 2023. His online RAG engineering course has been completed by over 10,000 students.