You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running ultra large model like deepseek r1 is a pain - and given there is no good smaller model that comes with same tokenizer, the traditional way of speculative decoding won't work.
In this blog, Universal Assisted Generation (UAG) was introduced, that allows speculative decoding across models with different tokenizers. The implementation was very intuitive - a two-way translation of token to make sure models see what they need to see, and work with each other.
It would be very valueable if we can use the 1.5B distilled R1 to do speculative decoding on the 671B full model - that should greatly improve the performance.
I did a serach of this repo and seem this has not been discussed yet - hence this post.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Running ultra large model like deepseek r1 is a pain - and given there is no good smaller model that comes with same tokenizer, the traditional way of speculative decoding won't work.
In this blog, Universal Assisted Generation (UAG) was introduced, that allows speculative decoding across models with different tokenizers. The implementation was very intuitive - a two-way translation of token to make sure models see what they need to see, and work with each other.
This has been merged in transformers - PR
It would be very valueable if we can use the 1.5B distilled R1 to do speculative decoding on the 671B full model - that should greatly improve the performance.
I did a serach of this repo and seem this has not been discussed yet - hence this post.
Beta Was this translation helpful? Give feedback.
All reactions