Skip to content

Commit 445bf43

Browse files
committed
feat: add q&a bot
1 parent baef687 commit 445bf43

9 files changed

+1256
-1
lines changed

06-qa-bot/F1_QA_Assistant.ipynb

+944
Large diffs are not rendered by default.

06-qa-bot/README.md

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Q&A Bot
2+
3+
A dynamic Q&A Bot using GPT-4.
4+
5+
<p align="center">
6+
<img src="screenshot.png">
7+
</p>
8+
9+
## Setup
10+
11+
You need to create a virtual env and install the packages listed in `requirements.txt`. You can then run Jupyter Notebooks in VS Code.
12+
13+
Follow these steps: [How to Work with Python Virtual Environments, Jupyter Notebooks and VS Code](https://python.plainenglish.io/how-to-work-with-python-virtual-environments-jupyter-notebooks-and-vs-code-536fac3d93a1).
14+
15+
You need to create a `.env` file with your `OPENAI_API_KEY`.
16+
17+
# Usage
18+
19+
Open `F1_QA_Assistant.ipynb`.
20+
21+
## Features
22+
23+
- scraping data from Wikipedia.
24+
- generating a bunch of embeddings on the last Formula One season.
25+
- turning the questions from users into embeddings.
26+
- finding the K nearest neighbors to that embedding.
27+
- including the matching texts in the prompt to expand GPT-4 knowledge.
28+
29+
Based on [Mastering OpenAI Python APIs: Unleash the Power of GPT4](https://www.udemy.com/course/mastering-openai/) by Colt Steele (2023).

06-qa-bot/cache.db

2.48 MB
Binary file not shown.

06-qa-bot/embeddings.db

46.6 MB
Binary file not shown.

06-qa-bot/f1_2022.csv

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Link
2+
2022_Formula_One_World_Championship
3+
2022_Abu_Dhabi_Grand_Prix
4+
2022_Sao_Paulo_Grand_Prix
5+
2022_Mexico_City_Grand_Prix
6+
2022_United_States_Grand_Prix
7+
2022_Japanese_Grand_Prix
8+
2022_Singapore_Grand_Prix
9+
2022_Italian_Grand_Prix
10+
2022_Dutch_Grand_Prix
11+
2022_Belgian_Grand_Prix
12+
2022_Hungarian_Grand_Prix
13+
2022_French_Grand_Prix
14+
2022_Austrian_Grand_Prix
15+
2022_British_Grand_Prix
16+
2022_Canadian_Grand_Prix
17+
2022_Azerbaijan_Grand_Prix
18+
2022_Monaco_Grand_Prix
19+
2022_Spanish_Grand_Prix
20+
2022_Miami_Grand_Prix
21+
2022_Emilia_Romagna_Grand_Prix
22+
2022_Australian_Grand_Prix
23+
2022_Saudi_Arabian_Grand_Prix
24+
2022_Bahrain_Grand_Prix

06-qa-bot/f1_utilities.py

+86
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
import fnmatch
2+
import os
3+
import re
4+
from dataclasses import dataclass
5+
from typing import Generator, Iterable, List
6+
7+
import openai
8+
import pandas as pd
9+
import tiktoken
10+
from dotenv import load_dotenv
11+
from openai.embeddings_utils import cosine_similarity
12+
from utilities import get_embedding, num_tokens_from_messages
13+
14+
# Thanks to http://www.oldmanumby.com/ for his remaster and converion of the Dungeons
15+
# and Dragons 5th Edition SRD (Systems Reference Document)
16+
# https://github.com/OldManUmby/DND.SRD.Wiki
17+
18+
# Thanks to Wizards of the Coast for DnD and preserving its openness with the Open Gaming License.
19+
20+
21+
@dataclass(frozen=True, repr=True)
22+
class WikipediaPath:
23+
article: str
24+
header: str
25+
26+
def __str__(self):
27+
return f"{self.article} - {self.header}"
28+
29+
30+
@dataclass(frozen=True, repr=True)
31+
class Section:
32+
"""
33+
A segment is defined by anything that follows an h1 header (# ...) or
34+
an entire document if the file has no h1 headers.
35+
"""
36+
37+
location: WikipediaPath
38+
text: str
39+
40+
def __str__(self):
41+
return f"{self.location}:\n{self.text}"
42+
43+
44+
def wikipedia_splitter(contents: str, article_title: str, token_limit: int, split_point_regexes: List[str]) -> Iterable[Section]:
45+
# Take a markdown file and the article split on `==` sections.
46+
"""
47+
Generate sections of Wikipedia pages.
48+
:param contents: The contents of the wikipedia page
49+
:param article_title: The title of the article, to be included in the emitted section object
50+
:param token_limit: The maximum number of tokens to allow in a section
51+
:param split_point_regexes: A list of regexes to split on. The first one is the highest precedence.
52+
If we can't fit a section into the token limit, we'll split on the next lower regex.
53+
"""
54+
split_point_regex = split_point_regexes[0]
55+
sections = re.split(split_point_regex, contents)
56+
57+
if not sections[0].strip():
58+
# Remove the first section if it's empty (this happens when the file starts with a "#" line)
59+
sections.pop(0)
60+
else:
61+
# Otherwise: Wikipedia articles often begin with a section that has no `==` header.
62+
first_section = sections.pop(0)
63+
yield Section(location=WikipediaPath(article=article_title, header=article_title), text=first_section)
64+
65+
# And now proceed into splitting sections based on the `==` header
66+
for section in sections:
67+
if not section.strip():
68+
# Remove trailing empty sections.
69+
continue
70+
71+
header = section.splitlines()[0].strip()
72+
if "=" in split_point_regex:
73+
# If we're splitting on equal-sign headers, then we need to remove the trailing equal signs
74+
header = re.sub(r"=+$", "", header).strip()
75+
76+
# To be better steer embeddings, we include the article's title and section name with one another above the text.
77+
emit = Section(location=WikipediaPath(article=article_title, header=header), text=f"{article_title}: {section}")
78+
79+
if len(str(section).replace("\n", " ")) > token_limit:
80+
print(f"Section is too long: {emit.location}, splitting")
81+
subtitle = f"{article_title} - {header}"
82+
# If the section is too long, split it on a lower precedence split point
83+
84+
yield from wikipedia_splitter(section, subtitle, token_limit, split_point_regexes[1:])
85+
else:
86+
yield emit

06-qa-bot/screenshot.png

72.7 KB
Loading

06-qa-bot/utilities.py

+152
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
import hashlib
2+
import json
3+
import os
4+
import sqlite3
5+
import zipfile
6+
from typing import Dict, List, Tuple, TypeVar
7+
8+
import numpy as np
9+
import openai
10+
import tiktoken
11+
from openai.embeddings_utils import cosine_similarity
12+
from openai.error import APIConnectionError, APIError, RateLimitError
13+
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_random_exponential
14+
15+
16+
def get_file_with_zip_fallback(file_name: str, zip_file_name: str) -> str:
17+
# Check if the CSV file exists
18+
if not os.path.exists(file_name):
19+
# If not, check if the ZIP file exists and unzip it
20+
if os.path.exists(zip_file_name):
21+
with zipfile.ZipFile(zip_file_name, "r") as zip_ref:
22+
zip_ref.extractall()
23+
else:
24+
raise ValueError(f"Neither {file_name} nor {zip_file_name} were found in the current directory.")
25+
26+
# Read the contents of the CSV file
27+
with open(file_name, "r", encoding="utf-8") as file:
28+
contents = file.read()
29+
30+
return contents
31+
32+
33+
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
34+
def num_tokens_from_messages(messages: List[Dict], model: str) -> int:
35+
"""Returns the number of tokens used by a list of messages."""
36+
try:
37+
encoding = tiktoken.encoding_for_model(model)
38+
except KeyError:
39+
encoding = tiktoken.get_encoding("cl100k_base")
40+
if model == "gpt-3.5-turbo":
41+
return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
42+
elif model == "gpt-4":
43+
return num_tokens_from_messages(messages, model="gpt-4-0314")
44+
elif model == "gpt-3.5-turbo-0301":
45+
tokens_per_message = 4 # every message follows <|start|>{role/name}\n{content}<|end|>\n
46+
tokens_per_name = -1 # if there's a name, the role is omitted
47+
elif model == "gpt-4-0314":
48+
tokens_per_message = 3
49+
tokens_per_name = 1
50+
else:
51+
raise NotImplementedError(
52+
f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
53+
)
54+
num_tokens = 0
55+
for message in messages:
56+
num_tokens += tokens_per_message
57+
for key, value in message.items():
58+
num_tokens += len(encoding.encode(value))
59+
if key == "name":
60+
num_tokens += tokens_per_name
61+
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
62+
return num_tokens
63+
64+
65+
def memoize_to_sqlite(filename: str = "cache.db"):
66+
"""
67+
Memoization decorator that caches the output of a method in a SQLite database.
68+
The database connection is persisted across calls.
69+
"""
70+
db_conn = sqlite3.connect(filename)
71+
db_conn.execute("CREATE TABLE IF NOT EXISTS cache (hash TEXT PRIMARY KEY, result TEXT)")
72+
73+
def memoize(func):
74+
def wrapped(*args):
75+
# Compute the hash of the argument
76+
arg_hash = hashlib.sha256(repr(tuple(args)).encode("utf-8")).hexdigest()
77+
78+
# Check if the result is already cached
79+
cursor = db_conn.cursor()
80+
cursor.execute("SELECT result FROM cache WHERE hash = ?", (arg_hash,))
81+
row = cursor.fetchone()
82+
if row is not None:
83+
print(f"Cached result found for {arg_hash}. Returning it.")
84+
return json.loads(row[0])
85+
86+
# Compute the result and cache it
87+
result = func(*args)
88+
cursor.execute("INSERT INTO cache (hash, result) VALUES (?, ?)", (arg_hash, json.dumps(result)))
89+
db_conn.commit()
90+
91+
return result
92+
93+
return wrapped
94+
95+
return memoize
96+
97+
98+
# This is not optimized for massive reads and writes, but it's good enough for this example
99+
@memoize_to_sqlite(filename="embeddings.db")
100+
@retry(
101+
wait=wait_random_exponential(multiplier=1, max=30),
102+
stop=stop_after_attempt(3),
103+
retry=retry_if_exception_type(APIConnectionError) | retry_if_exception_type(APIError) | retry_if_exception_type(RateLimitError),
104+
)
105+
def get_embedding(text: str) -> List[float]:
106+
"""
107+
:param text: The text to compute an embedding for
108+
:return: The embedding for the text
109+
"""
110+
# replace newlines, which can negatively affect performance.
111+
text_no_newlines = text.replace("\n", " ")
112+
print(f"Computing embedding for {text_no_newlines[:50]}")
113+
response = openai.Embedding.create(input=text_no_newlines, model="text-embedding-ada-002")
114+
embeddings = response["data"][0]["embedding"]
115+
return embeddings
116+
117+
118+
T = TypeVar("T") # Declare type variable
119+
120+
121+
def get_n_nearest_neighbors(query_embedding: List[float], embeddings: Dict[T, List[float]], n: int) -> List[Tuple[T, float]]:
122+
"""
123+
:param query_embedding: The embedding to find the nearest neighbors for
124+
:param embeddings: A dictionary of embeddings, where the keys are the entity type (e.g. Movie, Segment)
125+
and the values are the that entity's embeddings
126+
:param n: The number of nearest neighbors to return
127+
:return: A list of tuples, where the first element is the entity and the second element is the cosine
128+
similarity between -1 and 1
129+
"""
130+
131+
# This is not optimized for rapid indexing, but it's good enough for this example
132+
# If you're using this in production, you should use a more efficient vector datastore such as
133+
# those mentioned specifically by OpenAI here
134+
#
135+
# https://platform.openai.com/docs/guides/embeddings/how-can-i-retrieve-k-nearest-embedding-vectors-quickly
136+
#
137+
# * Pinecone, a fully managed vector database
138+
# * Weaviate, an open-source vector search engine
139+
# * Redis as a vector database
140+
# * Qdrant, a vector search engine
141+
# * Milvus, a vector database built for scalable similarity search
142+
# * Chroma, an open-source embeddings store
143+
#
144+
145+
target_embedding = np.array(query_embedding)
146+
147+
similarities = [(segment, cosine_similarity(target_embedding, np.array(embedding))) for segment, embedding in embeddings.items()]
148+
149+
# Sort by similarity and get the top n results
150+
nearest_neighbors = sorted(similarities, key=lambda x: x[1], reverse=True)[:n]
151+
152+
return nearest_neighbors

README.md

+21-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenAI Projects
22

3-
5 projects using OpenAI APIs with Python.
3+
6 projects using OpenAI APIs with Python.
44

55
## Setup
66

@@ -112,6 +112,26 @@ An embedding-powered movie recommendation algorithm using Nomic Atlas.
112112
- visualizing our embeddings with Atlas.
113113
- recommending movies using our embeddings.
114114

115+
## Q&A Bot
116+
117+
A dynamic Q&A Bot using GPT-4.
118+
119+
[Check the 06-qa-bot folder](06-qa-bot)
120+
121+
<p align="center">
122+
<a href="06-qa-bot">
123+
<img src="06-qa-bot/screenshot.png">
124+
</a>
125+
</p>
126+
127+
### Features
128+
129+
- scraping data from Wikipedia.
130+
- generating a bunch of embeddings on the last Formula One season.
131+
- turning the questions from users into embeddings.
132+
- finding the K nearest neighbors to that embedding.
133+
- including the matching texts in the prompt to expand GPT-4 knowledge.
134+
115135
## Playground
116136

117137
[Check the playground](playground/) to understand the basics.

0 commit comments

Comments
 (0)