You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 <ahref="https://github.com/huggingface/transformers">Transformers</a> experience with advanced kernel optimizations and placement/parallelism strategies.
16
+
<br/><br/>
17
+
KTransformers is a flexible, Python-centric framework designed with extensibility at its core.
18
+
By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible
19
+
interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI.
20
+
<br/><br/>
21
+
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
22
+
23
+
<h2id="Updates">🔥 Updates</h2>
24
+
25
+
***Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md).
26
+
***Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
27
+
***Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
28
+
***Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
29
+
***Aug 14, 2024**: Support llamfile as linear backend.
30
+
***Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
0 commit comments