Best way for indexing a stream of data. #67

wammy19 · 2021-07-14T15:14:42Z

wammy19
Jul 14, 2021

Hi again.

I'm wondering what your opinion is on the best way of indexing a constant stream of data is?

I've experimented with using the API to index one document at a time, but I'm not to confident this is so robust as I have to keep track of the ID on the client side. I need to make sure no document gets indexed over, and there doesn't seem to be any auto increment of the ID with this method. Is there a way to have the ID auto increment on qdrant's side? Alternatively Elasticsearch allow strings as ID's, meaning you can use things such as NanoID to generate random ID's that have a low probability of being regenerated.

Code:

vector_to_index = json.dumps({
   "upsert_points": {
      "points": [
         {
            "id": self.id_counter,
            "vector": encoded_document.json()['encoded_data'],
            "payload": document.get_data_as_dict()
         }
      ]
   }
})

self.id_counter += 1

requests.request(
    'POST',
    f'http://{self.qdrant_ip_address}:{self.qdrant_port}/collections/articles',
    headers={'Content-Type': 'application/json'},
    data=vector_to_index
)

I was also looking at the python client's upload_collection() method, but this seems intended for initially uploading a collection. This method is also throwing me an error:

ValueError: not enough values to unpack (expected 2, got 1)

The error is coming from line 133 in the qdrant_client.py:
num_vectors, _dim = vectors.shape

The error makes sense as I'm passing in a np.array with a shape of (768,), 768 because I'm using the BERT model for encoding, but I can't seem to get the data in the way the client wants. If you could provide any advice here that would be helpful, but like I mentioned before this method probably isn't intended for indexing one document at a time.

Code:

# Encoding happens in a separate docker container, return a List[float].
encoded_document = requests.request(
    'POST',
    f'http://{self.encoding_service_ip}:{self.encoding_service_port}/encode',
    data=json.dumps(text_for_encoding)
)

# Convert to numpy array.
vectorized_data = np.array(encoded_document.json()['encoded_data'])

self.qdrant_client.upload_collection(
    collection_name='articles',
    vectors=vectorized_data,
    payload=document.get_data_as_dict(),
    ids=None,
    batch_size=100
)

Any response is much appreciated :)

generall · 2021-07-14T16:30:33Z

generall
Jul 14, 2021
Maintainer

Hi @wammy19, thanks for interesting questions!

Regarding the ids, Qdrant was intended to be used as an indexing engine, so in most applications it should be provided with some existing IDS which you have in your data storage. As you correctly noticed this original ID could be in a form of String, which is currently supported by Qdrant only in a form of payload, so you would steel need to have an integer id. In future, we are planning to add support for string IDS as well https://github.com/qdrant/qdrant/projects/1#card-52881607

What I can propose in your case, but it depends on an amount of vectors you are planning to index. Qdrant uses unsigned integer 64 type for storing IDS, which gives you 0..18446744073709551616 possible values. At this range, collisions are quite unlikely if you just use randomly generated ids or hashes. https://stackoverflow.com/questions/22029012/probability-of-64bit-hash-code-collisions

According to the approximation equation math.exp(- n**2 / (2 * k)) you could hit one collision with probability of 0.000002 only if you index at least 10 millions vectors.

Regarding upload_collection - you are absolutely right that this method should be used for initial creation of the collection. It should be provided with a list of ids, otherwise it will use Ids starting from 0 each time.
The error appears because this method expects a matrix of vectors. What you can do is to convert your single vector into a matrix of single vector with following method: vectorized_data = np.expand_dims(vectorized_data, axis=0)

1 reply

wammy19 Jul 16, 2021
Author

Thanks for your reply and suggestion! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qdrant

Best way for indexing a stream of data. #67

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Qdrant

Best way for indexing a stream of data. #67

wammy19 Jul 14, 2021

Replies: 1 comment · 1 reply

generall Jul 14, 2021 Maintainer

wammy19 Jul 16, 2021 Author

wammy19
Jul 14, 2021

Replies: 1 comment 1 reply

generall
Jul 14, 2021
Maintainer

wammy19 Jul 16, 2021
Author