|
11 | 11 | "\n",
|
12 | 12 | "## Quick Start\n",
|
13 | 13 | "\n",
|
14 |
| - "The fastembed package is designed to be easy to use. The main class is the `Embedding` class. It takes a list of strings as input and returns a list of vectors as output. The `Embedding` class is initialized with a model file." |
| 14 | + "The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n", |
| 15 | + "\n", |
| 16 | + "> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)" |
15 | 17 | ]
|
16 | 18 | },
|
17 | 19 | {
|
|
21 | 23 | "metadata": {},
|
22 | 24 | "outputs": [],
|
23 | 25 | "source": [
|
24 |
| - "!pip install fastembed --upgrade --quiet # Install fastembed " |
25 |
| - ] |
26 |
| - }, |
27 |
| - { |
28 |
| - "cell_type": "markdown", |
29 |
| - "id": "ed81d725", |
30 |
| - "metadata": {}, |
31 |
| - "source": [ |
32 |
| - "Make the necessary imports, initialize the `Embedding` class, and embed your data into vectors:" |
| 26 | + "!pip install -Uqq fastembed # Install fastembed" |
33 | 27 | ]
|
34 | 28 | },
|
35 | 29 | {
|
|
39 | 33 | "metadata": {},
|
40 | 34 | "outputs": [
|
41 | 35 | {
|
42 |
| - "name": "stderr", |
43 |
| - "output_type": "stream", |
44 |
| - "text": [ |
45 |
| - "100%|██████████| 76.7M/76.7M [00:05<00:00, 15.0MiB/s]\n", |
46 |
| - "100%|██████████| 3/3 [00:00<00:00, 455.37it/s]" |
47 |
| - ] |
| 36 | + "data": { |
| 37 | + "application/vnd.jupyter.widget-view+json": { |
| 38 | + "model_id": "890cc3b969354eec8d149d143e301a7a", |
| 39 | + "version_major": 2, |
| 40 | + "version_minor": 0 |
| 41 | + }, |
| 42 | + "text/plain": [ |
| 43 | + "Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]" |
| 44 | + ] |
| 45 | + }, |
| 46 | + "metadata": {}, |
| 47 | + "output_type": "display_data" |
48 | 48 | },
|
49 | 49 | {
|
50 | 50 | "name": "stdout",
|
51 | 51 | "output_type": "stream",
|
52 | 52 | "text": [
|
53 |
| - "(384,)\n" |
| 53 | + "The model BAAI/bge-small-en-v1.5 is ready to use.\n" |
54 | 54 | ]
|
55 | 55 | },
|
56 | 56 | {
|
57 |
| - "name": "stderr", |
58 |
| - "output_type": "stream", |
59 |
| - "text": [ |
60 |
| - "\n" |
61 |
| - ] |
| 57 | + "data": { |
| 58 | + "text/plain": [ |
| 59 | + "384" |
| 60 | + ] |
| 61 | + }, |
| 62 | + "execution_count": 2, |
| 63 | + "metadata": {}, |
| 64 | + "output_type": "execute_result" |
62 | 65 | }
|
63 | 66 | ],
|
64 | 67 | "source": [
|
65 |
| - "from typing import List\n", |
66 | 68 | "import numpy as np\n",
|
67 |
| - "from fastembed.embedding import DefaultEmbedding\n", |
| 69 | + "from fastembed import TextEmbedding\n", |
| 70 | + "from typing import List\n", |
68 | 71 | "\n",
|
69 | 72 | "# Example list of documents\n",
|
70 | 73 | "documents: List[str] = [\n",
|
71 |
| - " \"Hello, World!\",\n", |
72 |
| - " \"This is an example document.\",\n", |
| 74 | + " \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n", |
73 | 75 | " \"fastembed is supported by and maintained by Qdrant.\",\n",
|
74 | 76 | "]\n",
|
75 |
| - "# Initialize the DefaultEmbedding class\n", |
76 |
| - "embedding_model = DefaultEmbedding()\n", |
77 |
| - "embeddings: List[np.ndarray] = list(embedding_model.embed(documents))\n", |
78 |
| - "print(embeddings[0].shape)" |
| 77 | + "\n", |
| 78 | + "# This will trigger the model download and initialization\n", |
| 79 | + "embedding_model = TextEmbedding()\n", |
| 80 | + "print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n", |
| 81 | + "\n", |
| 82 | + "embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n", |
| 83 | + "embeddings_list = list(embedding_model.embed(documents))\n", |
| 84 | + " # you can also convert the generator to a list, and that to a numpy array\n", |
| 85 | + "len(embeddings_list[0]) # Vector of 384 dimensions" |
79 | 86 | ]
|
80 | 87 | },
|
81 | 88 | {
|
82 | 89 | "cell_type": "markdown",
|
83 |
| - "id": "8c49ae50", |
| 90 | + "id": "d772190b", |
84 | 91 | "metadata": {},
|
85 | 92 | "source": [
|
86 |
| - "## Let's think step by step" |
87 |
| - ] |
88 |
| - }, |
89 |
| - { |
90 |
| - "cell_type": "markdown", |
91 |
| - "id": "92cf4b76", |
92 |
| - "metadata": {}, |
93 |
| - "source": [ |
94 |
| - "### Setup\n", |
95 |
| - "\n", |
96 |
| - "Importing the required classes and modules:" |
| 93 | + "> 💡 **Why do we use generators?**\n", |
| 94 | + "> \n", |
| 95 | + "> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once." |
97 | 96 | ]
|
98 | 97 | },
|
99 | 98 | {
|
100 | 99 | "cell_type": "code",
|
101 | 100 | "execution_count": 3,
|
102 |
| - "id": "c0a6f634", |
103 |
| - "metadata": {}, |
104 |
| - "outputs": [], |
105 |
| - "source": [ |
106 |
| - "from typing import List\n", |
107 |
| - "import numpy as np\n", |
108 |
| - "from fastembed.embedding import DefaultEmbedding" |
109 |
| - ] |
110 |
| - }, |
111 |
| - { |
112 |
| - "cell_type": "markdown", |
113 |
| - "id": "3fd03a71", |
| 101 | + "id": "8a225cb8", |
114 | 102 | "metadata": {},
|
| 103 | + "outputs": [ |
| 104 | + { |
| 105 | + "name": "stdout", |
| 106 | + "output_type": "stream", |
| 107 | + "text": [ |
| 108 | + "Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n", |
| 109 | + "Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n", |
| 110 | + "Document: fastembed is supported by and maintained by Qdrant.\n", |
| 111 | + "Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n" |
| 112 | + ] |
| 113 | + } |
| 114 | + ], |
115 | 115 | "source": [
|
116 |
| - "Notice that we are using the DefaultEmbedding -- which is a quantized, state of the Art Flag Embedding model which beats OpenAI's Embedding by a large margin. \n", |
| 116 | + "embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n", |
117 | 117 | "\n",
|
118 |
| - "### Prepare your Documents\n", |
119 |
| - "You can define a list of documents that you'd like to embed. These can be sentences, paragraphs, or even entire documents. \n", |
120 |
| - "\n", |
121 |
| - "#### Format of the Document List\n", |
122 |
| - "1. List of Strings: Your documents must be in a list, and each document must be a string\n", |
123 |
| - "2. For Retrieval Tasks: If you're working with queries and passages, you can add special labels to them:\n", |
124 |
| - "- **Queries**: Add \"query:\" at the beginning of each query string\n", |
125 |
| - "- **Passages**: Add \"passage:\" at the beginning of each passage string" |
| 118 | + "for doc, vector in zip(documents, embeddings_generator):\n", |
| 119 | + " print(\"Document:\", doc)\n", |
| 120 | + " print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")" |
126 | 121 | ]
|
127 | 122 | },
|
128 | 123 | {
|
129 | 124 | "cell_type": "code",
|
130 |
| - "execution_count": 3, |
131 |
| - "id": "145a56ce", |
| 125 | + "execution_count": 4, |
| 126 | + "id": "769a1be9", |
132 | 127 | "metadata": {},
|
133 |
| - "outputs": [], |
| 128 | + "outputs": [ |
| 129 | + { |
| 130 | + "data": { |
| 131 | + "text/plain": [ |
| 132 | + "(2, 384)" |
| 133 | + ] |
| 134 | + }, |
| 135 | + "execution_count": 4, |
| 136 | + "metadata": {}, |
| 137 | + "output_type": "execute_result" |
| 138 | + } |
| 139 | + ], |
134 | 140 | "source": [
|
135 |
| - "# Example list of documents\n", |
136 |
| - "documents: List[str] = [\n", |
137 |
| - " \"passage: Hello, World!\",\n", |
138 |
| - " \"query: Hello, World!\", # these are two different embedding\n", |
139 |
| - " \"passage: This is an example passage.\",\n", |
140 |
| - " # You can leave out the prefix but it's recommended\n", |
141 |
| - " \"fastembed is supported by and maintained by Qdrant.\",\n", |
142 |
| - "]" |
| 141 | + "embeddings_list = np.array(\n", |
| 142 | + " list(embedding_model.embed(documents))\n", |
| 143 | + ") # you can also convert the generator to a list, and that to a numpy array\n", |
| 144 | + "embeddings_list.shape" |
143 | 145 | ]
|
144 | 146 | },
|
145 | 147 | {
|
146 | 148 | "cell_type": "markdown",
|
147 |
| - "id": "1cb3cc87", |
| 149 | + "id": "8c49ae50", |
148 | 150 | "metadata": {},
|
149 | 151 | "source": [
|
150 |
| - "### Load the Embedding Model Weights\n", |
151 |
| - "Next, initialize the Embedding class with the desired parameters. Here, \"BAAI/bge-small-en\" is the pre-trained model name, and max_length=512 is the maximum token length for each document.\n", |
| 152 | + "We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n", |
152 | 153 | "\n",
|
153 |
| - "This will download the model weights, decompress to directory `local_cache` and load them into the Embedding class.\n", |
| 154 | + "#### Format of the Document List\n", |
154 | 155 | "\n",
|
155 |
| - "#### Initialize DefaultEmbedding\n", |
| 156 | + "1. List of Strings: Your documents must be in a list, and each document must be a string\n", |
| 157 | + "2. For Retrieval Tasks with our default: If you're working with queries and passages, you can add special labels to them:\n", |
| 158 | + "- **Queries**: Add \"query:\" at the beginning of each query string\n", |
| 159 | + "- **Passages**: Add \"passage:\" at the beginning of each passage string\n", |
156 | 160 | "\n",
|
157 |
| - "We will initialize Flag Embeddings with the model name and the maximum token length. That is the DefaultEmbedding class with the model name \"BAAI/bge-small-en\" and max_length=512." |
158 |
| - ] |
159 |
| - }, |
160 |
| - { |
161 |
| - "cell_type": "code", |
162 |
| - "execution_count": 4, |
163 |
| - "id": "272c8915", |
164 |
| - "metadata": {}, |
165 |
| - "outputs": [], |
166 |
| - "source": [ |
167 |
| - "embedding_model = DefaultEmbedding()" |
168 |
| - ] |
169 |
| - }, |
170 |
| - { |
171 |
| - "cell_type": "markdown", |
172 |
| - "id": "5549d501", |
173 |
| - "metadata": {}, |
174 |
| - "source": [ |
175 |
| - "### Embed your Documents\n", |
| 161 | + "## Beyond the default model\n", |
176 | 162 | "\n",
|
177 |
| - "Use the embed method of the embedding model to transform the documents into a List of np.array. The method returns a generator, so we cast it to a list to get the embeddings." |
| 163 | + "The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`." |
178 | 164 | ]
|
179 | 165 | },
|
180 | 166 | {
|
181 | 167 | "cell_type": "code",
|
182 | 168 | "execution_count": 5,
|
183 |
| - "id": "8013eee9", |
| 169 | + "id": "2e9c8766", |
184 | 170 | "metadata": {},
|
185 | 171 | "outputs": [
|
186 | 172 | {
|
187 |
| - "name": "stderr", |
188 |
| - "output_type": "stream", |
189 |
| - "text": [ |
190 |
| - "100%|██████████| 4/4 [00:00<00:00, 361.82it/s]\n" |
191 |
| - ] |
| 173 | + "data": { |
| 174 | + "application/vnd.jupyter.widget-view+json": { |
| 175 | + "model_id": "9470ec542f3c4400a42452c2489a1abc", |
| 176 | + "version_major": 2, |
| 177 | + "version_minor": 0 |
| 178 | + }, |
| 179 | + "text/plain": [ |
| 180 | + "Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]" |
| 181 | + ] |
| 182 | + }, |
| 183 | + "metadata": {}, |
| 184 | + "output_type": "display_data" |
192 | 185 | }
|
193 | 186 | ],
|
194 | 187 | "source": [
|
195 |
| - "embeddings: List[np.ndarray] = list(embedding_model.embed(documents))" |
196 |
| - ] |
197 |
| - }, |
198 |
| - { |
199 |
| - "cell_type": "markdown", |
200 |
| - "id": "e5b5a6ad", |
201 |
| - "metadata": {}, |
202 |
| - "source": [ |
203 |
| - "You can print the shape of the embeddings to understand their dimensions. Typically, the shape will indicate the number of dimensions in the vector." |
| 188 | + "multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\") # This can take a few minutes to download" |
204 | 189 | ]
|
205 | 190 | },
|
206 | 191 | {
|
207 | 192 | "cell_type": "code",
|
208 | 193 | "execution_count": 6,
|
209 |
| - "id": "0d8c8e08", |
| 194 | + "id": "a9e70f0e", |
210 | 195 | "metadata": {},
|
211 | 196 | "outputs": [
|
212 | 197 | {
|
213 |
| - "name": "stdout", |
214 |
| - "output_type": "stream", |
215 |
| - "text": [ |
216 |
| - "(384,)\n" |
217 |
| - ] |
| 198 | + "data": { |
| 199 | + "text/plain": [ |
| 200 | + "(4, 1024)" |
| 201 | + ] |
| 202 | + }, |
| 203 | + "execution_count": 6, |
| 204 | + "metadata": {}, |
| 205 | + "output_type": "execute_result" |
218 | 206 | }
|
219 | 207 | ],
|
220 | 208 | "source": [
|
221 |
| - "print(embeddings[0].shape) # (384,) or similar output" |
| 209 | + "np.array(list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))).shape # Vector of 1024 dimensions" |
| 210 | + ] |
| 211 | + }, |
| 212 | + { |
| 213 | + "cell_type": "markdown", |
| 214 | + "id": "64fe20ed", |
| 215 | + "metadata": {}, |
| 216 | + "source": [ |
| 217 | + "Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)" |
222 | 218 | ]
|
223 | 219 | }
|
224 | 220 | ],
|
|
238 | 234 | "name": "python",
|
239 | 235 | "nbconvert_exporter": "python",
|
240 | 236 | "pygments_lexer": "ipython3",
|
241 |
| - "version": "3.9.17" |
| 237 | + "version": "3.10.13" |
242 | 238 | }
|
243 | 239 | },
|
244 | 240 | "nbformat": 4,
|
|
0 commit comments