Replies: 2 comments 1 reply
-
|
This is an interesting development, especially for systems like Qdrant that rely heavily on vector search and large embeddings. If Turbo Quant delivers on its claims—3-bit quantization of KV caches with nearly zero accuracy loss—it could significantly reduce memory overhead for ANN search, making in-memory databases much more efficient. That said, implementing something like this would require careful evaluation. First, the paper and blog focus on transformer models' KV caches, which may not directly translate to the embeddings used in vector search. We'd need to assess whether the same quantization approach applies to static embeddings or if it's limited to dynamic attention-based scenarios. From a production standpoint, ultra-low-bit quantization sometimes introduces hardware-specific constraints. For instance, 3-bit values might require custom CUDA kernels or hardware optimizations since most GPUs are optimized for 8-bit or 16-bit operations. If we were to integrate this into Qdrant, we’d need to explore whether it's compatible with common SIMD optimizations or libraries like Faiss that rely on AVX instructions. It could be worth testing this in a controlled experiment. You could try quantizing embeddings to 3 bits using their method and benchmarking Qdrant’s recall and query latency. If this really achieves minimal accuracy loss and significant memory savings, we can then evaluate its integration as a configurable feature. Curious if anyone else here has tested Turbo Quant in an embedding-heavy workflow yet? |
Beta Was this translation helpful? Give feedback.
-
|
We're actively working on implementing TurboQuant along with some Qdrant specific enhancements. You can find a tracking issue here: #8670 More discussion here: #8524 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Google Research just posted a blog and paper about a new algorithm that allows quantizing the KV cache down to under 3 bits with close to 0 accuracy loss.
Blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
Paper: https://arxiv.org/pdf/2504.19874
Web Site: https://turboquant.net
This could be huge if their claims are true and MLX developers are already jumping on this
https://x.com/Prince_Canuma/status/2036611007523512397
Thought I'd share the news here to see if qdrant developers would be interested in adding this feature.
@timvisee @generall
Beta Was this translation helpful? Give feedback.
All reactions