Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Memory footprint and is the project still maintained #12

Open
nemo83 opened this issue Sep 16, 2022 · 6 comments
Open

[Question] Memory footprint and is the project still maintained #12

nemo83 opened this issue Sep 16, 2022 · 6 comments

Comments

@nemo83
Copy link

nemo83 commented Sep 16, 2022

Hello,

I have been long time Spotify ANNoy user, and I've recently come across HSNW. I'm a Java guy so I obiously took a look at the full java implementation, and then this project too.

I'm surprised by the performances, but even more, by the memory footprint used by this library and I'm wondering if someone could validate the number I'm seeing.

I've loaded about 2.5 million tensors with 1024 dimensions and from what I can see on JvisualVM, the memory consumption of a Spring Java API in idle (just index loaded), is about 500mb. Is that possible? (see picture below)

Screenshot 2022-09-16 at 19 33 03

Where are tensors data stored? On disk?

I also have another question, is this project still maintained?

Thanks,
Gio

@hussamaa
Copy link
Contributor

Hi Gio, hope you're well.

Back then when we wrote the binding we were mostly aiming to have the lower query time hnswlib provided. I personally don't remember figures we could use as comparison but I imagine we would have a similar memory footprint using the native library. If not, we could optimize that.

I haven't been working with hnswlib for a while but from what I remember the references are kept in memory. You have the option to write the state of your index in disk if you want (and restore it).

Have you tried using the java implementation? how were your figures?

I don't think this project is being actively maintained anymore but it should be working fine still. Hnswlib released new updates and improvements which weren't tested with this binding. If you want, I could have a look into that.

Wishing you a great weekend.

Best regards,
Hussama

@nemo83
Copy link
Author

nemo83 commented Sep 16, 2022

Thanks for the very quick reply,

I have indeed tested the java client, but using double rather than float, and I could not fully build the index with 10GB of heap.
I'm in the middle of rebuilding the full 2.5milion w/ float, but the current projections is 250k seems to use 1.2 gig, so it should all fit in 12GB.
This JNA implementation does everything w/o, apparently, ever exceeding 1GB. I guess the C native code allocate and deallocate memory as it goes. What I don't understand is, when I dump the index on disk it is about 11GB, but in memory is less than 1GB... where are the other 11GB saved?

We need to replace Spotify ANNoy, and after some very quick tests, can appreciate the power of HNSW, in terms of performances and accuracies (results a much better).
I need to pick up the right HNSW library/framework to replace ANNoy and I was wondering if this project is still maitained because seems to be the best java solution.

If you were so kind to test it with the latest hnswlib it would be amazing, and I would be delighted to pair-program/review the code, so that I could start learning and contributing.

I would be using this library in argusnft.com, and AI Powered NFT Fake Detection Platform and our goal is to extract embeddings for all the (picture-based) nft from all the blockchains, and load them in ANN indexes. A pretty huge objective. If you have recommentions or fancy a chat let me know!

You too enjoy the weekend and thanks for the superquick reply.

☮️

@hussamaa
Copy link
Contributor

hussamaa commented Sep 16, 2022

We also moved from annoy back then because updating the index in runtime was not possible and some other problems.

Hmmmmmmmm, yeah, that sounds suspicious indeed 😛 but I believe if we had an issue we would have spotted already in production. You're right, the memory allocation/free is taking place as it runs (can't guarantee there are no memory leaks) and when dumping the index it is most likely generating and writing down the entire state space and parameters.

Talking about the Java one, I remember it took a while to build the indexes with large datasets (even in parallel) and the query time was so performant either (in comparison to the native) which made us try out other solutions. Building the index natively (in python) and restoring in few secs on Java was also a big plus for us that we got using the binding.

That sounds a cool project. I'm not a data science / ml hero; our wizard was @alexcarterkarsus. Alex, would you have any recommendation? Would hnswlib be your go to library for Gio's use case?

--

I had a quick look and since I left STST, I'm afraid I can't push updates to the library. I will fork, upgrade hnswlib and open a pull request once ready.

Which platform would you be using the library on? WIn64, AMD64? ARM64?

Have a nice weekend! Take care!

@nemo83
Copy link
Author

nemo83 commented Sep 17, 2022

Thanks again for the detailed response, it's really helping.

Nice to meet you @alexcarterkarsus !

I would have another couple of questions:

  1. What was the type of document you guys were indexing? And the tensor dimension?
  2. Is the index file dump interoperable among hnsw libraries? java/jna/c/python?
  3. Where were you storing the tensors for long time storage? Relational db? S3? Asking because we currently have 2.5m tensor now, and possibly close to 50m in just a couple of months and we were wondering if you guys had any recommendations.

So many questions! But very exciting space and this project is providing me with so much knowledge! Thank you!

@hussamaa
Copy link
Contributor

hussamaa commented Sep 17, 2022

1: leave it to alex 🧠 😛

2: yeah; they use the same code underneath so it is indeed interoperable across languages 🙌. I'm not so sure if is across architectures, I'm afraid it isn't.

3: our use case, no, the model had periodically recreated due to the constant new data (it was prepared separately and stored somewhere in the cloud indeed). For yours, I can see it would be more static (keeping track of existing nfts and adding new), right?

Yeah, there are some challenges to keep in mind, the bigger the index, higher is the insertion and query time and maybe the fine tuning of the hnswlib parameters can help. Handling 7M was one our acceptance criteria and it fit for us.

Not sure if you could organize and split it in different models?

@alexcarterkarsus
Copy link

alexcarterkarsus commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants