Regarding Training time and Embeddings #2

shivapk · 2025-01-18T16:50:53Z

Awesome work! Inspirational! Interested to learn more about:

How long it took for pre-training? And infra used - 8gpu P4d/P4de/P5?
Any reason to strict to nvidia/NV-EMBED model? Have you instead tried using same LLAMA model itself for embedding? For example, using LLAMA’s last token representation? Will there be any extra benefit in using same model for embedding aswell as for decoder?
Does your approach generalize to longer text like 3k tokens aswell w.r.t embedding?

EvanZhuang · 2025-02-19T22:06:08Z

Thanks for your interest! to answer your questions

The pre-training of the projector only took hours on a 4 GPU machine since we used a small corpus (wikitext)
For embedding models we have experimented with 4: NV-Embed (Lee et al. 2024; nvidia/NV-Embed-v1), SFR (Meng et al. 2024; Salesforce/SFR-Embedding-2 R), Stella (Zhang 2024; dunzhang/stella en 1.5B v5), GTR-t5 (Ni et al. 2021; sentence-transformers/gtr-t5-base). You can see detailed comparisons among those in our paper. The takeaway is that stronger Embedding Models are more likely to deliver better downstream performance in Vector-ICL (See our analysis in Sec 6.1)
I think so, but longer text might be more difficult to encode and retain the original information.

Provide feedback