Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed inference of 70B awq model #2531

Merged
merged 2 commits into from
Dec 4, 2023
Merged

Conversation

vince62s
Copy link
Member

@vince62s vince62s commented Dec 4, 2023

Now use InferenceEngine for translate.py
Fix left padding for target
Fix AWQ model loading when GEMM (in features and out features are reversed)
Extend llama-like converter to handle awq quantized model with safetensors.

Tried this one: https://huggingface.co/TheBloke/Llama-2-70B-Chat-AWQ
inference at 18 tok/sec on 2 GPU (1x3090 + 1x4090)

@vince62s vince62s merged commit 1e5ed31 into OpenNMT:master Dec 4, 2023
@vince62s vince62s deleted the distribawq branch December 14, 2023 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant