Skip to content

PacktPublishing/RAG-from-First-Principles

Repository files navigation

RAG from First Principles, First Edition

This is the code repository for RAG from First Principles, First Edition, published by Packt.

Engineering retrieval-augmented generation systems with Python, LangChain, and LlamaIndex

Jia Huang

Free PDF       Graphic Bundle       Amazon      

About the book

RAG from First Principles, First Edition

Most developers can spin up a RAG pipeline in an afternoon using LangChain or LlamaIndex. Far fewer understand why retrieval fails or how to fix it. This book is for those who want to go deeper. RAG From First Principles dismantles the retrieval-augmented generation stack layer by layer, explaining how documents are ingested and parsed, why chunking strategy directly impacts answer quality, how embedding models encode meaning, what happens inside a vector database, and how sparse and dense retrieval interact in a hybrid system. Written by Jia Huang, a research engineer and bestselling AI author, it brings both research depth and production experience to one of AI's most critical engineering disciplines. Structured as a progressive dialogue between a seasoned engineer and two students, the book surfaces the questions practitioners actually ask. Each chapter builds on the last, covering topics from data import and chunking to embedding selection, index design, hybrid search, and post-retrieval processing, before moving on to response generation, evaluation, and advanced paradigms including GraphRAG, Agentic RAG, and Modular RAG. By the end, you'll have the architectural understanding to optimize, debug, and extend your RAG systems with confidence. *Email sign-up and proof of purchase required

Key Learnings

  • Parse and ingest diverse document types like PDFs, tables, images, web pages, and structured data
  • Apply the right chunking strategy for your content type and retrieval goals
  • Select, compare, and fine-tune embedding models for your domain
  • Design vector indexes and choose the right similarity metrics for production use
  • Improve result quality with reranking methods including RRF, cross-encoders, and ColBERT
  • Integrate retrieval results into generation pipelines using prompt engineering and Self-RAG

Chapters

Unity Cookbook, Fifth Edition
  1. Data Import
  2. Text Chunking
  3. Information Embedding
  4. Vector Storage
  5. Pre-Retrieval Processing
  6. Index Optimization
  7. Retrieval Post-Processing
  8. Response Generation
  9. System Evaluation
  10. Complex RAG Paradigms

English Inputs for the Practical RAG Manuscript

This directory stores English-language replacement inputs for the Chinese files that are actually referenced by the manuscript's code examples.

The package is self-contained: every replacement path in the table below points to a file or folder inside 99-EN, except for examples that are generated by code or loaded from the web. The original Chinese files under 90-文档-Data are unchanged. The local manuscript PDF is ignored by .gitignore and should not be committed or pushed.

How to Use

When localizing a manuscript code block, replace the original data/... path with the matching English path below. MANIFEST.csv contains the same mapping in CSV form.

Chapter and Program Index

This table maps the manuscript chapters/programs to the English input files. The program names follow the repository's code filenames where the same example exists locally.

Manuscript chapter / program area Program or code block English input file or folder
Intro / Simple RAG 00-简单RAG-SimpleRAG/01_01_LlamaIndex_5行代码.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Intro / Simple RAG 00-简单RAG-SimpleRAG/01_02_LlamaIndex_更换嵌入模型.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Intro / Simple RAG 00-简单RAG-SimpleRAG/01_03_LlamaIndex_更换生成模型.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Intro / Simple RAG 00-简单RAG-SimpleRAG/01_04_LlamaIndex_5行代码_DeepSeek.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Intro / Simple RAG 00-简单RAG-SimpleRAG/01_05_LlamaIndex_5行代码_Ollama.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Intro / Simple RAG WebBaseLoader examples for the Chinese Black Myth Wikipedia URL 99-EN/black-myth-wukong/black_myth_wukong_wiki.txt
Chapter 1 / TXT loading 01-数据导入-DataLoading/01-简单文本读取/01-用LangChain读入txt文件.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Chapter 1 / Directory loading 01-数据导入-DataLoading/01-简单文本读取/03-01-用LangChain加载目录中所有文档.py 99-EN/black-myth-wukong/
Chapter 1 / Directory loading 01-数据导入-DataLoading/01-简单文本读取/03-02-用LangChain加载目录时指定参数.py 99-EN/black-myth-wukong/
Chapter 1 / Directory loading 01-数据导入-DataLoading/01-简单文本读取/03-03-用LangChain加载目录时更改工具.py 99-EN/black-myth-wukong/
Chapter 1 / Directory loading 01-数据导入-DataLoading/01-简单文本读取/03-04-用LangChain加载目录时跳过错误.py 99-EN/black-myth-wukong/
Chapter 1 / LlamaIndex reader 01-数据导入-DataLoading/01-简单文本读取/05-用LlamaIndex-加载目录文档.py 99-EN/black-myth-wukong/ and 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Chapter 1 / Unstructured TXT 01-数据导入-DataLoading/01-简单文本读取/07-使用Unstructured_v1.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Chapter 1 / Unstructured TXT 01-数据导入-DataLoading/01-简单文本读取/07-使用Unstructured_v2.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Chapter 1 / JSON as text 01-数据导入-DataLoading/02-结构化文档读取/01-LangChain-TextLoader-JSON.py 99-EN/black-myth-wukong/journey_to_the_west_characters.json
Chapter 1 / JSONLoader 01-数据导入-DataLoading/02-结构化文档读取/02-LangCHain-JSONLoader-JSON.py 99-EN/black-myth-wukong/black_myth_wukong_characters.json
Chapter 1 / Markdown loader 01-数据导入-DataLoading/02-结构化文档读取/04-LangChain-UnstructuredMarkdownLoader.py 99-EN/black-myth-wukong/black_myth_wukong_versions.md
Chapter 1 / Image parsing 01-数据导入-DataLoading/03-解析图文数据/01-Unstructured读图.py 99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg
Chapter 1 / PPT parsing 01-数据导入-DataLoading/03-解析图文数据/02-Unstructured读PPT.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pptx
Chapter 1 / Multimodal PDF-to-image 01-数据导入-DataLoading/03-解析图文数据/03-大模型读取图文.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / PDF parsing 01-数据导入-DataLoading/04-PDF文件读取/01-使用PyPDF.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / PDF parsing 01-数据导入-DataLoading/04-PDF文件读取/02-使用PyMuPDF.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / OCR PDF parsing 01-数据导入-DataLoading/04-PDF文件读取/03-使用pytesseract+pdf2image.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / LlamaParse PDF 01-数据导入-DataLoading/04-PDF文件读取/04-使用LlamaParser.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / Unstructured PDF 01-数据导入-DataLoading/04-PDF文件读取/06-Unstrctured-使用partition函数解析PDF-v1.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / Unstructured PDF 01-数据导入-DataLoading/04-PDF文件读取/06-Unstrctured-使用partition函数解析PDF-v2.py 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf
Chapter 1 / PDF layout and parent-child parsing 01-数据导入-DataLoading/04-PDF文件读取/05-LangChain-Unstrucured-PDF-*.py 99-EN/assets/shanxi-tourism/云冈石窟-en.pdf
Chapter 1 / PDF layout and parent-child parsing 01-数据导入-DataLoading/04-PDF文件读取/09-Parent-Child-Unstructured-*.py 99-EN/assets/shanxi-tourism/云冈石窟-en.pdf
Chapter 1 / CSV loading 01-数据导入-DataLoading/05-表格数据读取/01-01-导入CSV.py 99-EN/black-myth-wukong/black_myth_wukong.csv
Chapter 2 / Character text splitting 02-文本切块-DocChunking/01-LangChain-CharacterTextSplitter.py 99-EN/shanxi-tourism/yungang_grottoes.txt
Chapter 2 / Recursive text splitting 02-文本切块-DocChunking/02-LangChain-RecursiveharacterTextSplitter.py 99-EN/shanxi-tourism/yungang_grottoes.txt
Chapter 2 / Semantic chunking 02-文本切块-DocChunking/05-LlamaIndex-语义分块.py 99-EN/black-myth-wukong/black_myth_wukong_wiki.txt
Chapter 3 / Recommendation embeddings 03-向量嵌入-Embedding/01-openai-embedding-recomendation-system.py 99-EN/journey-of-extinction-husun/user_reviews.csv and 99-EN/journey-of-extinction-husun/game_guide.json
Chapter 3 / Jina clustering 03-向量嵌入-Embedding/02-jina-embeddings-v3-clustering.py 99-EN/journey-of-extinction-husun/jina_games.csv
Chapter 3 / Multimodal embedding 03-向量嵌入-Embedding/05-多模态嵌入.py 99-EN/assets/multimodal/query_image.jpg
Chapter 4 / Hybrid retrieval 04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v1-极简.py 99-EN/journey-of-extinction-husun/battle_scenes.json
Chapter 4 / Hybrid retrieval 04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v2-细节.py 99-EN/journey-of-extinction-husun/battle_scenes.json
Chapter 4 / Hybrid retrieval 04-向量存储-VectorDB/混合检索/Milvus+BGE-M3混合检索-v3-重排.py 99-EN/journey-of-extinction-husun/battle_scenes.json
Chapter 4 / Multimodal retrieval 04-向量存储-VectorDB/多模态检索/Milvus+Visual-BGE多模态检索-*.py 99-EN/multimodal/metadata.json and 99-EN/assets/multimodal/
Chapter 5 / Query rewriting 05-检索前处理-PreRetrieval/02-查询翻译/01-查询重写-*.py 99-EN/journey-of-extinction-husun/setting.txt or 99-EN/black-myth-wukong/black_myth_wukong_setting.txt depending on the manuscript variant
Chapter 5 / Query decomposition 05-检索前处理-PreRetrieval/02-查询翻译/02-查询分解-*.py 99-EN/journey-of-extinction-husun/setting.txt or 99-EN/black-myth-wukong/black_myth_wukong_setting.txt depending on the manuscript variant
Chapter 5 / HyDE query expansion 05-检索前处理-PreRetrieval/02-查询翻译/04-查询扩展-HyDE假设文档生成.py 99-EN/black-myth-wukong/black_myth_wukong_wiki.txt
Chapter 5 / Text2SQL 05-检索前处理-PreRetrieval/01-查询构建/Text2SQL/01-Text2SQL-创建数据库表.py No package file; the code creates data/tourism.db
Chapter 7 / RRF reranking 07-检索后处理-PostRetrieval/01-重排/01-RRF重排.py 99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/
Chapter 7 / RankLLM reranking 07-检索后处理-PostRetrieval/01-重排/05-RankLLM重排.py 99-EN/shanxi-tourism/yungang_grottoes.txt
Chapter 7 / Sentence optimizer compression 07-检索后处理-PostRetrieval/02-压缩/03-SentenceEmbeddingOptimizer压缩.py 99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/
Chapter 8 / Prompt template generation 08-响应生成-Generation/02-通过提示词优化响应/01-使用提示模板明确生成目标.py 99-EN/black-myth-wukong/black_myth_wukong_setting.txt
Chapter 8 / LlamaIndex output parsing 08-响应生成-Generation/03-通过输出解析控制格式/02-LlamaIndex输出解析.py 99-EN/black-myth-wukong/black_myth_wukong_wiki.txt
Chapter 10 / Weaviate multimodal search 10-高级RAG-AdvanceRAG/05-MultiModalRAG/01-Weaviate-Multimodal-Search.py 99-EN/assets/multimodal/weaviate/ and 99-EN/assets/multimodal/query_image.jpg

Detailed Replacement Mapping

Manuscript program or loader call English input to use Notes
SimpleDirectoryReader("data/黑神话").load_data() 99-EN/black-myth-wukong/ English replacement directory for the Black Myth examples.
DirectoryLoader("./data/黑神话") 99-EN/black-myth-wukong/ Use the directory for examples that load all local Black Myth documents.
DirectoryLoader("data/黑神话", loader_cls=TextLoader) 99-EN/black-myth-wukong/ Text-readable replacements are provided as .txt, .md, .csv, and .json.
WebBaseLoader(web_paths=("https://zh.wikipedia.org/wiki/黑神话:悟空",)) 99-EN/black-myth-wukong/black_myth_wukong_wiki.txt Offline English substitute for the Chinese Wikipedia loading examples. For live web loading, use the English Wikipedia page instead.
TextLoader("data/黑神话/黑神话悟空的设定.txt") 99-EN/black-myth-wukong/black_myth_wukong_setting.txt Chapter 1 TXT loading example.
SimpleDirectoryReader(input_files=["data/黑神话/黑神话悟空的设定.txt"]) 99-EN/black-myth-wukong/black_myth_wukong_setting.txt LlamaIndex single-file loading example.
text = "data/黑神话/黑神话悟空的设定.txt" 99-EN/black-myth-wukong/black_myth_wukong_setting.txt Unstructured text parsing example.
TextLoader("data/西游记人物角色.json") 99-EN/black-myth-wukong/journey_to_the_west_characters.json JSON-as-text example.
JSONLoader(file_path="data/黑神话/黑神话人物角色.json", ...) 99-EN/black-myth-wukong/black_myth_wukong_characters.json Structured JSON loading example.
image_path = "data/黑神话/黑神话英文.jpg" 99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg Bundled copy of the existing English image asset.
partition_ppt(filename="data/黑神话悟空 PPT.pptx") 99-EN/black-myth-wukong/black_myth_wukong_slides.pptx Lightweight English PPT replacement generated for the book example.
partition_ppt(filename="data/黑神话/黑神话悟空 PPT.pptx") 99-EN/black-myth-wukong/black_myth_wukong_slides.pptx Same PPT example with a directory prefix.
filename = "data/黑神话/黑神话悟空.pdf" 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf Lightweight English PDF replacement generated for PDF parsing examples.
fitz.open("data/黑神话/黑神话悟空.pdf") 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf PyMuPDF example.
convert_from_path("data/黑神话/黑神话悟空.pdf") 99-EN/black-myth-wukong/black_myth_wukong_slides.pdf PDF-to-image example.
file_path = "data/黑神话/黑神话悟空.csv" 99-EN/black-myth-wukong/black_myth_wukong.csv CSV loading example.
DirectoryLoader(path="data/黑神话", glob="**/*.csv") 99-EN/black-myth-wukong/black_myth_wukong.csv CSV directory loading example.
markdown_path = "data/黑神话/黑神话版本介绍.md" 99-EN/black-myth-wukong/black_myth_wukong_versions.md Markdown loader example.
markdown_path = "data/黑神话/黑悟空版本介绍.md" 99-EN/black-myth-wukong/black_myth_wukong_versions.md Same Markdown example with the repository filename variant.
marker_single "data/山西文旅/云冈石窟-en.pdf" 99-EN/assets/shanxi-tourism/云冈石窟-en.pdf Bundled copy of the existing English PDF.
file_path = ("data/山西文旅/云冈石窟-en.pdf") 99-EN/assets/shanxi-tourism/云冈石窟-en.pdf PDF structure extraction example.
TextLoader("data/山西文旅/云冈石窟.txt") 99-EN/shanxi-tourism/yungang_grottoes.txt Shanxi tourism TXT loading example.
SimpleDirectoryReader(input_files=["data/山西文旅/云冈石窟.txt"]) 99-EN/shanxi-tourism/yungang_grottoes.txt LlamaIndex text splitting example.
doc_dir = "./data/山西文旅" 99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/ Directory retrieval examples load both TXT and PDF files.
SimpleDirectoryReader("data/山西文旅").load_data() 99-EN/shanxi-tourism/ and 99-EN/assets/shanxi-tourism/ Sentence optimization and generation examples.
SimpleDirectoryReader(input_files=["data/灭神纪/灭神纪设定.txt"]) 99-EN/journey-of-extinction-husun/setting.txt Context/index example.
TextLoader("data/灭神纪/设定.txt", encoding="utf-8") 99-EN/journey-of-extinction-husun/setting.txt Query rewrite and decomposition examples.
TextLoader("data/灭神纪/情节片段.txt", encoding="utf-8") 99-EN/journey-of-extinction-husun/plot_fragments.txt Manuscript references this file, but no tracked Chinese source was found, so an English substitute is supplied.
SimpleDirectoryReader("data/灭神纪").load_data() 99-EN/journey-of-extinction-husun/ Directory loading example.
SimpleDirectoryReader("data/灭神纪").load_data() when the directory includes 人物角色.json 99-EN/journey-of-extinction-husun/characters.json English character JSON included for directory-level loading.
pd.read_csv("data/灭神纪/用户评价.csv") 99-EN/journey-of-extinction-husun/user_reviews.csv Embedding recommendation example.
open("data/灭神纪/游戏说明.json", "r") 99-EN/journey-of-extinction-husun/game_guide.json Embedding recommendation example.
pd.read_csv("data/灭神纪/Jina游戏.csv") 99-EN/journey-of-extinction-husun/jina_games.csv Source repo uses 游戏描述.csv; this is the English replacement for the manuscript's Jina clustering example.
open("data/灭神纪/战斗场景.json", encoding="utf-8") 99-EN/journey-of-extinction-husun/battle_scenes.json Hybrid retrieval example.
WukongDataset("data/多模态", "data/多模态/metadata.json") 99-EN/multimodal/metadata.json English metadata for the multimodal retrieval examples.
query_image = "data/多模态/query_image.jpg" 99-EN/assets/multimodal/query_image.jpg Bundled copy of the existing query image.
image_dir = "data/多模态/Weaviate" 99-EN/assets/multimodal/weaviate/ Bundled image directory for Weaviate multimodal search examples.
sqlite3.connect("data/tourism.db") Not included The SQLite database is created by manuscript code rather than maintained as a Chinese input document.

Bundled Existing English Assets

These files were already English or did not require text localization, but they are copied into this package so editors can work from the zip alone:

  • 99-EN/assets/shanxi-tourism/云冈石窟-en.pdf
  • 99-EN/assets/shanxi-tourism/五台山-en.pdf
  • 99-EN/assets/shanxi-tourism/佛光寺-en.pdf
  • 99-EN/assets/shanxi-tourism/壶口瀑布-en.pdf
  • 99-EN/assets/shanxi-tourism/山西-en.pdf
  • 99-EN/assets/shanxi-tourism/平遥古城-en.pdf
  • 99-EN/assets/shanxi-tourism/悬空寺-en.pdf
  • 99-EN/assets/shanxi-tourism/晋祠-en.pdf
  • 99-EN/assets/black-myth-wukong/black_myth_wukong_english.jpg
  • 99-EN/assets/multimodal/01.jpg through 99-EN/assets/multimodal/09.jpg
  • 99-EN/assets/multimodal/query_image.jpg
  • 99-EN/assets/multimodal/weaviate/wukong_demon_fight.jpg
  • 99-EN/assets/multimodal/weaviate/wukong_fire_attack.jpg
  • 99-EN/assets/multimodal/weaviate/wukong_vs_white_bone_spirit.jpg

Coverage Notes

All local manuscript data paths found in the extracted code-oriented references are mapped above. The only intentional exceptions are:

  • Live web examples, such as WebBaseLoader(...), where black_myth_wukong_wiki.txt is provided as an offline English substitute.
  • Runtime-generated files, such as data/tourism.db, which are created by code rather than maintained as manuscript input files.
  • The source multimodal metadata refers to data/多模态/10.jpg, but the tracked repository contains 01.jpg through 09.jpg plus query_image.jpg. The English metadata keeps all referenced image paths valid inside this package.

Get to know Authors

Jia Huang Jia Huang is a Lead Research Engineer at A*STAR (Agency for Science, Technology and Research), Singapore, where his work focuses on NLP, large language models, and applied AI engineering. With over twenty years of experience leading large-scale AI and data projects across government, finance, healthcare, and e-commerce, he brings an unusually practical lens to technically rigorous subjects. In recent years, his research has primarily focused on NLP pre-trained large models and FinTech applications. He is the author of six bestselling technical books, including Hands-on AI Agent Development for Large Model Applications selected as one of JD Best Books of 2024 and GPT: How Large Models Are Built, named CSDN's Most Influential IT Book of 2023. His online RAG engineering course has been completed by over 10,000 students.

Other Related Books

About

RAG from First Principles Published by Packt Pub

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors