Skip to content

Commit 9ffde58

Browse files
authored
Getting Started Improvements (#138)
* Update FastEmbed README.md * Rewrite GettingStarted to use TextEmbedding instead of DefaultEmbedding * Improve grammar * Update Getting Started.ipynb with model information and document format
1 parent 5603fbe commit 9ffde58

File tree

2 files changed

+137
-138
lines changed

2 files changed

+137
-138
lines changed

README.md

+19-16
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,13 @@ FastEmbed is a lightweight, fast, Python library built for embedding generation.
44

55
The default text embedding (`TextEmbedding`) model is Flag Embedding, the top model in the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard. It supports "query" and "passage" prefixes for the input text. Here is an example for [Retrieval Embedding Generation](https://qdrant.github.io/fastembed/examples/Retrieval_with_FastEmbed/) and how to use [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/).
66

7-
1. Light & Fast
8-
- Quantized model weights
9-
- ONNX Runtime, no PyTorch dependency
10-
- CPU-first design
11-
- Data-parallelism for encoding of large datasets
7+
## 📈 Why FastEmbed?
128

13-
2. Accuracy/Recall
14-
- Better than OpenAI Ada-002
15-
- Default is Flag Embedding, which is top of the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard
16-
- List of [supported models](https://qdrant.github.io/fastembed/examples/Supported_Models/) - including multilingual models
9+
1. Light: FastEmbed is a lightweight library with few external dependencies. We don't require a GPU and don't download GBs of PyTorch dependencies, and instead use the ONNX Runtime. This makes it a great candidate for serverless runtimes like AWS Lambda.
10+
11+
2. Fast: FastEmbed is designed for speed. We use the ONNX Runtime, which is faster than PyTorch. We also use data-parallelism for encoding large datasets.
12+
13+
3. Accurate: FastEmbed is better than OpenAI Ada-002. We also [supported](https://qdrant.github.io/fastembed/examples/Supported_Models/) an ever expanding set of models, including a few multilingual models.
1714

1815
## 🚀 Installation
1916

@@ -26,18 +23,24 @@ pip install fastembed
2623
## 📖 Quickstart
2724

2825
```python
26+
import numpy as np
2927
from fastembed import TextEmbedding
3028
from typing import List
31-
import numpy as np
3229

30+
# Example list of documents
3331
documents: List[str] = [
34-
"passage: Hello, World!",
35-
"query: Hello, World!", # these are two different embedding
36-
"passage: This is an example passage.",
37-
"fastembed is supported by and maintained by Qdrant." # You can leave out the prefix but it's recommended
32+
"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.",
33+
"fastembed is supported by and maintained by Qdrant.",
3834
]
39-
embedding_model = TextEmbedding(model_name="BAAI/bge-base-en")
40-
embeddings: List[np.ndarray] = list(embedding_model.embed(documents)) # Note the list() call - this is a generator
35+
36+
# This will trigger the model download and initialization
37+
embedding_model = TextEmbedding()
38+
print("The model BAAI/bge-small-en-v1.5 is ready to use.")
39+
40+
embeddings_generator = embedding_model.embed(documents) # reminder this is a generator
41+
embeddings_list = list(embedding_model.embed(documents))
42+
# you can also convert the generator to a list, and that to a numpy array
43+
len(embeddings_list[0]) # Vector of 384 dimensions
4144
```
4245

4346
## Usage with Qdrant

docs/Getting Started.ipynb

+118-122
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@
1111
"\n",
1212
"## Quick Start\n",
1313
"\n",
14-
"The fastembed package is designed to be easy to use. The main class is the `Embedding` class. It takes a list of strings as input and returns a list of vectors as output. The `Embedding` class is initialized with a model file."
14+
"The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n",
15+
"\n",
16+
"> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)"
1517
]
1618
},
1719
{
@@ -21,15 +23,7 @@
2123
"metadata": {},
2224
"outputs": [],
2325
"source": [
24-
"!pip install fastembed --upgrade --quiet # Install fastembed "
25-
]
26-
},
27-
{
28-
"cell_type": "markdown",
29-
"id": "ed81d725",
30-
"metadata": {},
31-
"source": [
32-
"Make the necessary imports, initialize the `Embedding` class, and embed your data into vectors:"
26+
"!pip install -Uqq fastembed # Install fastembed"
3327
]
3428
},
3529
{
@@ -39,186 +33,188 @@
3933
"metadata": {},
4034
"outputs": [
4135
{
42-
"name": "stderr",
43-
"output_type": "stream",
44-
"text": [
45-
"100%|██████████| 76.7M/76.7M [00:05<00:00, 15.0MiB/s]\n",
46-
"100%|██████████| 3/3 [00:00<00:00, 455.37it/s]"
47-
]
36+
"data": {
37+
"application/vnd.jupyter.widget-view+json": {
38+
"model_id": "890cc3b969354eec8d149d143e301a7a",
39+
"version_major": 2,
40+
"version_minor": 0
41+
},
42+
"text/plain": [
43+
"Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]"
44+
]
45+
},
46+
"metadata": {},
47+
"output_type": "display_data"
4848
},
4949
{
5050
"name": "stdout",
5151
"output_type": "stream",
5252
"text": [
53-
"(384,)\n"
53+
"The model BAAI/bge-small-en-v1.5 is ready to use.\n"
5454
]
5555
},
5656
{
57-
"name": "stderr",
58-
"output_type": "stream",
59-
"text": [
60-
"\n"
61-
]
57+
"data": {
58+
"text/plain": [
59+
"384"
60+
]
61+
},
62+
"execution_count": 2,
63+
"metadata": {},
64+
"output_type": "execute_result"
6265
}
6366
],
6467
"source": [
65-
"from typing import List\n",
6668
"import numpy as np\n",
67-
"from fastembed.embedding import DefaultEmbedding\n",
69+
"from fastembed import TextEmbedding\n",
70+
"from typing import List\n",
6871
"\n",
6972
"# Example list of documents\n",
7073
"documents: List[str] = [\n",
71-
" \"Hello, World!\",\n",
72-
" \"This is an example document.\",\n",
74+
" \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n",
7375
" \"fastembed is supported by and maintained by Qdrant.\",\n",
7476
"]\n",
75-
"# Initialize the DefaultEmbedding class\n",
76-
"embedding_model = DefaultEmbedding()\n",
77-
"embeddings: List[np.ndarray] = list(embedding_model.embed(documents))\n",
78-
"print(embeddings[0].shape)"
77+
"\n",
78+
"# This will trigger the model download and initialization\n",
79+
"embedding_model = TextEmbedding()\n",
80+
"print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n",
81+
"\n",
82+
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
83+
"embeddings_list = list(embedding_model.embed(documents))\n",
84+
" # you can also convert the generator to a list, and that to a numpy array\n",
85+
"len(embeddings_list[0]) # Vector of 384 dimensions"
7986
]
8087
},
8188
{
8289
"cell_type": "markdown",
83-
"id": "8c49ae50",
90+
"id": "d772190b",
8491
"metadata": {},
8592
"source": [
86-
"## Let's think step by step"
87-
]
88-
},
89-
{
90-
"cell_type": "markdown",
91-
"id": "92cf4b76",
92-
"metadata": {},
93-
"source": [
94-
"### Setup\n",
95-
"\n",
96-
"Importing the required classes and modules:"
93+
"> 💡 **Why do we use generators?**\n",
94+
"> \n",
95+
"> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once."
9796
]
9897
},
9998
{
10099
"cell_type": "code",
101100
"execution_count": 3,
102-
"id": "c0a6f634",
103-
"metadata": {},
104-
"outputs": [],
105-
"source": [
106-
"from typing import List\n",
107-
"import numpy as np\n",
108-
"from fastembed.embedding import DefaultEmbedding"
109-
]
110-
},
111-
{
112-
"cell_type": "markdown",
113-
"id": "3fd03a71",
101+
"id": "8a225cb8",
114102
"metadata": {},
103+
"outputs": [
104+
{
105+
"name": "stdout",
106+
"output_type": "stream",
107+
"text": [
108+
"Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n",
109+
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n",
110+
"Document: fastembed is supported by and maintained by Qdrant.\n",
111+
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n"
112+
]
113+
}
114+
],
115115
"source": [
116-
"Notice that we are using the DefaultEmbedding -- which is a quantized, state of the Art Flag Embedding model which beats OpenAI's Embedding by a large margin. \n",
116+
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
117117
"\n",
118-
"### Prepare your Documents\n",
119-
"You can define a list of documents that you'd like to embed. These can be sentences, paragraphs, or even entire documents. \n",
120-
"\n",
121-
"#### Format of the Document List\n",
122-
"1. List of Strings: Your documents must be in a list, and each document must be a string\n",
123-
"2. For Retrieval Tasks: If you're working with queries and passages, you can add special labels to them:\n",
124-
"- **Queries**: Add \"query:\" at the beginning of each query string\n",
125-
"- **Passages**: Add \"passage:\" at the beginning of each passage string"
118+
"for doc, vector in zip(documents, embeddings_generator):\n",
119+
" print(\"Document:\", doc)\n",
120+
" print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")"
126121
]
127122
},
128123
{
129124
"cell_type": "code",
130-
"execution_count": 3,
131-
"id": "145a56ce",
125+
"execution_count": 4,
126+
"id": "769a1be9",
132127
"metadata": {},
133-
"outputs": [],
128+
"outputs": [
129+
{
130+
"data": {
131+
"text/plain": [
132+
"(2, 384)"
133+
]
134+
},
135+
"execution_count": 4,
136+
"metadata": {},
137+
"output_type": "execute_result"
138+
}
139+
],
134140
"source": [
135-
"# Example list of documents\n",
136-
"documents: List[str] = [\n",
137-
" \"passage: Hello, World!\",\n",
138-
" \"query: Hello, World!\", # these are two different embedding\n",
139-
" \"passage: This is an example passage.\",\n",
140-
" # You can leave out the prefix but it's recommended\n",
141-
" \"fastembed is supported by and maintained by Qdrant.\",\n",
142-
"]"
141+
"embeddings_list = np.array(\n",
142+
" list(embedding_model.embed(documents))\n",
143+
") # you can also convert the generator to a list, and that to a numpy array\n",
144+
"embeddings_list.shape"
143145
]
144146
},
145147
{
146148
"cell_type": "markdown",
147-
"id": "1cb3cc87",
149+
"id": "8c49ae50",
148150
"metadata": {},
149151
"source": [
150-
"### Load the Embedding Model Weights\n",
151-
"Next, initialize the Embedding class with the desired parameters. Here, \"BAAI/bge-small-en\" is the pre-trained model name, and max_length=512 is the maximum token length for each document.\n",
152+
"We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n",
152153
"\n",
153-
"This will download the model weights, decompress to directory `local_cache` and load them into the Embedding class.\n",
154+
"#### Format of the Document List\n",
154155
"\n",
155-
"#### Initialize DefaultEmbedding\n",
156+
"1. List of Strings: Your documents must be in a list, and each document must be a string\n",
157+
"2. For Retrieval Tasks with our default: If you're working with queries and passages, you can add special labels to them:\n",
158+
"- **Queries**: Add \"query:\" at the beginning of each query string\n",
159+
"- **Passages**: Add \"passage:\" at the beginning of each passage string\n",
156160
"\n",
157-
"We will initialize Flag Embeddings with the model name and the maximum token length. That is the DefaultEmbedding class with the model name \"BAAI/bge-small-en\" and max_length=512."
158-
]
159-
},
160-
{
161-
"cell_type": "code",
162-
"execution_count": 4,
163-
"id": "272c8915",
164-
"metadata": {},
165-
"outputs": [],
166-
"source": [
167-
"embedding_model = DefaultEmbedding()"
168-
]
169-
},
170-
{
171-
"cell_type": "markdown",
172-
"id": "5549d501",
173-
"metadata": {},
174-
"source": [
175-
"### Embed your Documents\n",
161+
"## Beyond the default model\n",
176162
"\n",
177-
"Use the embed method of the embedding model to transform the documents into a List of np.array. The method returns a generator, so we cast it to a list to get the embeddings."
163+
"The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`."
178164
]
179165
},
180166
{
181167
"cell_type": "code",
182168
"execution_count": 5,
183-
"id": "8013eee9",
169+
"id": "2e9c8766",
184170
"metadata": {},
185171
"outputs": [
186172
{
187-
"name": "stderr",
188-
"output_type": "stream",
189-
"text": [
190-
"100%|██████████| 4/4 [00:00<00:00, 361.82it/s]\n"
191-
]
173+
"data": {
174+
"application/vnd.jupyter.widget-view+json": {
175+
"model_id": "9470ec542f3c4400a42452c2489a1abc",
176+
"version_major": 2,
177+
"version_minor": 0
178+
},
179+
"text/plain": [
180+
"Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]"
181+
]
182+
},
183+
"metadata": {},
184+
"output_type": "display_data"
192185
}
193186
],
194187
"source": [
195-
"embeddings: List[np.ndarray] = list(embedding_model.embed(documents))"
196-
]
197-
},
198-
{
199-
"cell_type": "markdown",
200-
"id": "e5b5a6ad",
201-
"metadata": {},
202-
"source": [
203-
"You can print the shape of the embeddings to understand their dimensions. Typically, the shape will indicate the number of dimensions in the vector."
188+
"multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\") # This can take a few minutes to download"
204189
]
205190
},
206191
{
207192
"cell_type": "code",
208193
"execution_count": 6,
209-
"id": "0d8c8e08",
194+
"id": "a9e70f0e",
210195
"metadata": {},
211196
"outputs": [
212197
{
213-
"name": "stdout",
214-
"output_type": "stream",
215-
"text": [
216-
"(384,)\n"
217-
]
198+
"data": {
199+
"text/plain": [
200+
"(4, 1024)"
201+
]
202+
},
203+
"execution_count": 6,
204+
"metadata": {},
205+
"output_type": "execute_result"
218206
}
219207
],
220208
"source": [
221-
"print(embeddings[0].shape) # (384,) or similar output"
209+
"np.array(list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))).shape # Vector of 1024 dimensions"
210+
]
211+
},
212+
{
213+
"cell_type": "markdown",
214+
"id": "64fe20ed",
215+
"metadata": {},
216+
"source": [
217+
"Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)"
222218
]
223219
}
224220
],
@@ -238,7 +234,7 @@
238234
"name": "python",
239235
"nbconvert_exporter": "python",
240236
"pygments_lexer": "ipython3",
241-
"version": "3.9.17"
237+
"version": "3.10.13"
242238
}
243239
},
244240
"nbformat": 4,

0 commit comments

Comments
 (0)