Skip to content

Commit 38ca03f

Browse files
committed
Main commit
1 parent 5a9ff28 commit 38ca03f

12 files changed

+273
-1
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
./db
2+
DS_Store

README.md

+210-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,210 @@
1-
# chromaHackpack
1+
# Chroma Hackpack
2+
3+
# Overview
4+
5+
Chroma is a vector database. In this hackpack, we'll use it to implement
6+
retrieval augmented generation (RAG) – a technique for enhancing large language
7+
model (LLM) informational capabilities. Specifically, we'll build a chat bot to
8+
answer logistic questions about TreeHacks 2024.
9+
10+
## Motivation
11+
12+
LLMs like ChatGPT are capable
13+
of solving sophisticated tasks. However, their knowledge of current events and
14+
new information is often limited by training cut-off dates. Moreover, LLMs can
15+
exhibit hallucinatory behavior. In other words, LLMs have strong reasoning
16+
abilities but they often need the appropriate facts to reason with.
17+
18+
Rerieval augmentented generation is a technique of **retrieving** information and
19+
then providing it to the LLM to **augment** the content it next **generates**.
20+
This helps mitigate hallucination and supplements the LLM's existing knowledge
21+
with facts of the developer's choice.
22+
23+
In the case of our application, we'll use RAG to provide our chat bot up-to-date
24+
information regarding TreeHacks 2024.
25+
26+
## How does RAG work?
27+
28+
Typically, LLMs directly respond to a user's query. Retrievel augmented
29+
generation modifies the query by including relevant facts to the query.
30+
31+
First, we select several documents containing information relevant to TreeHacks
32+
logistics. We then calculate embeddings for the document contents. Embeddings
33+
are vectors that represent the semantics of a given string. If two vector
34+
embeddings are similar, then we know the semantics of the two respective strings
35+
are also similar. These embeddings are all loaded into our vector database,
36+
Chroma.
37+
38+
Once our vector database is populated, we can begin querying it. When the user
39+
prompts our chat bot, the following occurs:
40+
41+
(1) Take in user input.
42+
(2) Pass the input's embedding into a vector database. Retrieve the `k` most
43+
similar vectors and their associated strings. Each of these strings represent
44+
the information that is most relevant to the user's query.
45+
(3) Pass the user's original input along with the information from the vector
46+
database into the LLM.
47+
(4) Return the LLM's output.
48+
49+
![RAG with Chroma diagram.](https://docs.trychroma.com/img/hrm4.svg)
50+
51+
This framework is simple, but powerful. There are several ways to introduce
52+
additional sophistication into RAG, but for the purpose of this hackpack we'll
53+
focus on the basics.
54+
55+
# Project Walkthrough
56+
57+
## Step 0: Installing Dependencies
58+
59+
Ensure you have Python 3+ installed on your computer.
60+
61+
Download this repository and run `pip install -r requirements.txt`.
62+
63+
## Step 1: Setting up Chroma
64+
65+
Before we can process user queries, we must populate a Chroma vector database
66+
with embeddings.
67+
68+
### Pre-processing
69+
70+
We will use the `langchain` library to load and pre-process our data.
71+
72+
```python
73+
from langchain_community.document_loaders import DirectoryLoader
74+
from langchain.text_splitter import RecursiveCharacterTextSplitter
75+
```
76+
77+
First, we'll use the `DirectoryLoader` to load all the files from our
78+
`documents` folder. Then, we'll use the `RecursiveCharacterTextSplitter` to
79+
break each document down into a series of strings. Each string will have its own
80+
embedding and thus can be independently queried.
81+
82+
```python
83+
loader = DirectoryLoader('./documents')
84+
documents = loader.load()
85+
86+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
87+
texts = text_splitter.split_documents(documents)
88+
```
89+
90+
You'll notice that the `documents` folder comes pre-populated with TreeHacks
91+
related documents. You may replace these documents if you'd like your RAG LLM to be
92+
fed a different set of information.
93+
94+
### Loading into Chroma
95+
96+
Setting up our Chroma database is very easy.
97+
98+
```python
99+
from langchain_community.vectorstores import Chroma
100+
from langchain_openai import OpenAIEmbeddings
101+
```
102+
103+
We will use `OpenAIEmbeddings` to embed our texts. However, we don't need to
104+
manually do this – Chroma will handle it. We simply declare our Chroma database
105+
with the texts and the embedding function.
106+
107+
```python
108+
embeddings = OpenAIEmbeddings()
109+
vectordb = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory='db')
110+
```
111+
112+
This will automatically produce a Chroma vector database containing all the text
113+
documents and their vector embeddings. If you'd like to use a [different
114+
embedding function](https://python.langchain.com/docs/integrations/text_embedding), you can easily replace it.
115+
116+
Before running this code, you will need to set up your API key. Use this
117+
[tutorial](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)
118+
provided by OpenAI.
119+
120+
Moreover, notice that we pass a value for `persist_directory`. This tells Chroma
121+
to locally save the vector database to the folder `db`. By doing so, we can
122+
simply load in the vector database next time we run our program. This allows us
123+
to avoid recomputing the embeddings for all the documents.
124+
125+
## Step 2: Running Queries
126+
127+
Now that we've configured our Chroma database, we'd like to query it for the
128+
purpose of RAG. `langchain` gives us a pre-packaged object to do this.
129+
130+
```python
131+
from langchain_openai import OpenAI
132+
from langchain.chains import VectorDBQA
133+
```
134+
135+
The `VectorDBQA` automatically coordinates interactions between our LLM and
136+
Chroma vector database. We can declare it easily.
137+
138+
```python
139+
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vectordb)
140+
```
141+
142+
To ask it a question, we can simply call the `invoke` method.
143+
144+
```python
145+
query = "What are the top prize categories?"
146+
out = qa.invoke(query)
147+
print(out)
148+
```
149+
150+
This outputs: `The hackathon starts on Friday, February 16th at 9pm.`
151+
152+
(You may notice a slightly different output when running this script, since ChatGPT is
153+
non-deterministic. It should be of similar content. however.)
154+
155+
## Step 3 (optional): User Interface
156+
157+
We have produced a minimally viable instance of RAG! However, most users are
158+
probably in want of a more friendly user interface for development or usage
159+
purposes.
160+
161+
To achieve this, we can use the `gradio` library.
162+
163+
```python
164+
import gradio as gr
165+
```
166+
167+
Gradio gives us a convenient chatbot template we can simply define some logic
168+
for. Let us first declare our chatbot response function.
169+
170+
```python
171+
def response(message, history):
172+
h = ''
173+
174+
for d in history:
175+
h += 'User message: \'' + d[0] + '\', '
176+
h += 'Bot message: \'' + d[1] + '\' \n'
177+
178+
m = 'You are an chatbot meant to answer participant questions about TreeHacks, a hackathon. Here is the prior message history: \n' + h + '\nHere is the message you have just been given: ' + message
179+
yield qa.run(m)
180+
```
181+
182+
This function accepts two variables: the most recent message from the user and a
183+
history of previous messages. We format the chat history into a single string
184+
such that our chatbot is always aware of the conversation's whole context.
185+
Although re-formatting this string every function call is certainly not the most
186+
elegant or efficient approach, it will suffice for our proof-of-concept.
187+
188+
Notice that we also use this formatting step to provide additional context
189+
regarding the chatbot's purpose. This is a simple technique for focusing the
190+
chatbot's responses.
191+
192+
To start our user interface, we can run the following line.
193+
194+
```python
195+
gr.ChatInterface(response).launch()
196+
```
197+
198+
You should see a local URL printed in the terminal. Use this to access the gradio chat
199+
interface.
200+
201+
# Thanks
202+
Thanks to everyone at Chroma for supporting this hackpack and TreeHacks!
203+
204+
# Additional Resources
205+
- This hackpack is heavily derived from Harrison Chase's
206+
[chroma-langchain](https://github.com/hwchase17/chroma-langchain) demo repo.
207+
Please check it out!
208+
- Chroma has a variety of integratons and features, including multi-modal
209+
capabilities. Check out their [documentation](https://docs.trychroma.com/) to
210+
learn more.
+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><link type="text/css" rel="stylesheet" href="resources/sheet.css" >
2+
<style type="text/css">.ritz .waffle a { color: inherit; }.ritz .waffle .s5{background-color:#d0e0e3;text-align:center;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:middle;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s3{background-color:#f4cccc;text-align:center;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:middle;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s6{background-color:#d9ead3;text-align:center;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:middle;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s7{background-color:#fff2cc;text-align:center;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:middle;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s2{background-color:#d5a6bd;text-align:center;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:bottom;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s0{background-color:#ffffff;text-align:center;font-weight:bold;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:middle;white-space:normal;overflow:hidden;word-wrap:break-word;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s1{background-color:#ffffff;text-align:center;font-weight:bold;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:bottom;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}.ritz .waffle .s4{background-color:#d0e0e3;text-align:center;color:#000000;font-family:'docs-Calibri',Arial;font-size:12pt;vertical-align:bottom;white-space:nowrap;direction:ltr;padding:0px 3px 0px 3px;}</style><div class="ritz grid-container" dir="ltr"><table class="waffle" cellspacing="0" cellpadding="0"><thead><tr><th class="row-header freezebar-origin-ltr"></th><th id="1872506343C0" style="width:99px;" class="column-headers-background">A</th><th id="1872506343C1" style="width:232px;" class="column-headers-background">B</th><th id="1872506343C2" style="width:226px;" class="column-headers-background">C</th></tr></thead><tbody><tr style="height: 19px"><th id="1872506343R0" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">1</div></th><td class="s0" dir="ltr">Time</td><td class="s1" dir="ltr" colspan="2">Event</td></tr><tr style="height: 19px"><th id="1872506343R1" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">2</div></th><td class="s0" dir="ltr">1:30 PM</td><td class="s2" dir="ltr" colspan="2">Closing Ceremony Video</td></tr><tr style="height: 19px"><th id="1872506343R2" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">3</div></th><td class="s0" dir="ltr">1:35 PM</td><td class="s3" dir="ltr" colspan="2" rowspan="2">Welcome Hackers</td></tr><tr style="height: 19px"><th id="1872506343R3" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">4</div></th><td class="s0" dir="ltr">1:40 PM</td></tr><tr style="height: 19px"><th id="1872506343R4" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">5</div></th><td class="s0" dir="ltr">1:45 PM</td><td class="s4" dir="ltr" colspan="2">Oak Sponsor Prize Presentation</td></tr><tr style="height: 19px"><th id="1872506343R5" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">6</div></th><td class="s0" dir="ltr">1:50 PM</td><td class="s4" dir="ltr" colspan="2">Willow Sponsors Prize Presentation</td></tr><tr style="height: 19px"><th id="1872506343R6" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">7</div></th><td class="s0" dir="ltr">1:55 PM</td><td class="s4" dir="ltr" colspan="2">Redwood Sponsor Prize Presentation</td></tr><tr style="height: 19px"><th id="1872506343R7" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">8</div></th><td class="s0" dir="ltr">2:00 PM</td><td class="s5" dir="ltr" colspan="2" rowspan="2">Cedar Sponsor Prize Presentation</td></tr><tr style="height: 19px"><th id="1872506343R8" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">9</div></th><td class="s0" dir="ltr">2:05 PM</td></tr><tr style="height: 19px"><th id="1872506343R9" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">10</div></th><td class="s0" dir="ltr">2:10 PM</td><td class="s5" dir="ltr" colspan="2" rowspan="2">TreeHacks Prize Presentation</td></tr><tr style="height: 19px"><th id="1872506343R10" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">11</div></th><td class="s0" dir="ltr">2:15 PM</td></tr><tr style="height: 19px"><th id="1872506343R11" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">12</div></th><td class="s0" dir="ltr">2:20 PM</td><td class="s6" dir="ltr" colspan="2">Hacker Presentation: Moonshot Prize</td></tr><tr style="height: 19px"><th id="1872506343R12" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">13</div></th><td class="s0" dir="ltr">2:25 PM</td><td class="s3" dir="ltr" colspan="2" rowspan="2">End The Event</td></tr><tr style="height: 19px"><th id="1872506343R13" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">14</div></th><td class="s0" dir="ltr">2:30 PM</td></tr><tr style="height: 19px"><th id="1872506343R14" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">15</div></th><td class="s0" dir="ltr">2:35 PM</td><td class="s7" dir="ltr" colspan="2" rowspan="6">Prize Collection</td></tr><tr style="height: 19px"><th id="1872506343R15" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">16</div></th><td class="s0" dir="ltr">2:40 PM</td></tr><tr style="height: 19px"><th id="1872506343R16" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">17</div></th><td class="s0" dir="ltr">2:45 PM</td></tr><tr style="height: 19px"><th id="1872506343R17" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">18</div></th><td class="s0" dir="ltr">2:50 PM</td></tr><tr style="height: 19px"><th id="1872506343R18" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">19</div></th><td class="s0" dir="ltr">2:55 PM</td></tr><tr style="height: 19px"><th id="1872506343R19" style="height: 19px;" class="row-headers-background"><div class="row-header-wrapper" style="line-height: 19px">20</div></th><td class="s0" dir="ltr">3:00 PM</td></tr></tbody></table></div>

documents/Friday Schedule.html

+2
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)