|
| 1 | +--- |
| 2 | +layout: blog_detail |
| 3 | +title: "SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine" |
| 4 | +author: "SGLang Team" |
| 5 | +hidden: true |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | +{:style="max-width:400px; display: block; margin-left: auto; margin-right: auto"} |
| 10 | + |
| 11 | + |
| 12 | +We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs. |
| 13 | + |
| 14 | +To view the PyTorch Ecosystem, see the [PyTorch Landscape](https://landscape.pytorch.org/) and learn more about how projects can [join the PyTorch Ecosystem](https://github.com/pytorch-fdn/ecosystem). |
| 15 | + |
| 16 | + |
| 17 | +## About SGLang |
| 18 | + |
| 19 | +SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language. |
| 20 | + |
| 21 | +The core features include: |
| 22 | + |
| 23 | +* Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ). |
| 24 | +* Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions. |
| 25 | +* Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models. |
| 26 | +* Active Community: SGLang is open source and backed by an active community with industry adoption. |
| 27 | + |
| 28 | +SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. You can learn more about the underlying techniques from the past release blog posts: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/). |
| 29 | + |
| 30 | +SGLang has been widely adopted by leading industry companies and frontier research labs. For example, xAI uses SGLang to serve its flagship model, [Grok 3](https://grok.com/), which is currently the best model according to the Chatbot Arena leaderboard. Microsoft Azure uses SGLang to serve [DeepSeek R1](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) on AMD GPUs, which is currently the best open source model. |
| 31 | + |
| 32 | + |
| 33 | +## Serving DeepSeek Models |
| 34 | + |
| 35 | +You can easily launch a Docker container to serve a DeepSeek model with the following command: |
| 36 | + |
| 37 | +``` |
| 38 | +# Pull the latest image |
| 39 | +docker pull lmsysorg/sglang:latest |
| 40 | +
|
| 41 | +# Launch a server |
| 42 | +docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \ |
| 43 | + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000 |
| 44 | +``` |
| 45 | + |
| 46 | +Then you can query the server with the OpenAI-compatible API |
| 47 | + |
| 48 | +``` |
| 49 | +import openai |
| 50 | +client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None") |
| 51 | +
|
| 52 | +response = client.chat.completions.create( |
| 53 | + model="deepseek-ai/DeepSeek-V3", |
| 54 | + messages=[ |
| 55 | + {"role": "user", "content": "List 3 countries and their capitals."}, |
| 56 | + ], |
| 57 | + temperature=0, |
| 58 | + max_tokens=64, |
| 59 | +) |
| 60 | +``` |
| 61 | + |
| 62 | +The server launch command above works for 8xH200. You can find detailed instructions for other hardware (MI300X, H100, A100, H20, L40S) at https://docs.sglang.ai/references/deepseek.html. |
| 63 | + |
| 64 | +SGLang integrates DeepSeek-specific optimizations, such as MLA throughput optimizations, MLA-optimized kernels, data-parallel attention, multi-token prediction, and DeepGemm, making it the top choice for serving DeepSeek models by dozens of [companies](https://x.com/lmsysorg/status/1887262321636221412), including AMD, NVIDIA, and many cloud providers. The team is actively working on integrating more optimizations following the 2025 H1 roadmap below. |
| 65 | + |
| 66 | + |
| 67 | +## Serving Llama Models |
| 68 | + |
| 69 | +Similarly, you can launch the server for a Llama 3.1 text model with: |
| 70 | + |
| 71 | +``` |
| 72 | +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct |
| 73 | +``` |
| 74 | + |
| 75 | +Or a Llama 3.2 multimodal model with: |
| 76 | + |
| 77 | +``` |
| 78 | +python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct --chat-template=llama_3_vision |
| 79 | +``` |
| 80 | + |
| 81 | + |
| 82 | +## Roadmap |
| 83 | + |
| 84 | +This year, the SGLang team will continue to push the boundaries of system efficiency. You can find the roadmap of 2025H1 [here](https://github.com/sgl-project/sglang/issues/4042). The focus is |
| 85 | + |
| 86 | +- Throughput-oriented large-scale deployment similar to the DeepSeek inference system |
| 87 | +- Long context optimizations |
| 88 | +- Low latency speculative decoding |
| 89 | +- Reinforcement learning training framework integration |
| 90 | +- Kernel optimizations |
| 91 | + |
| 92 | +## Community |
| 93 | + |
| 94 | +SGLang has been deployed to large-scale production, generating trillions of tokens every day. It has an active community with over three hundred contributors on GitHub. It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, iFlytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI. |
| 95 | + |
| 96 | + |
| 97 | +{:style="width:100%;"} |
| 98 | + |
| 99 | + |
| 100 | + |
| 101 | +## Conclusion |
| 102 | + |
| 103 | +We’re excited to welcome SGLang to the PyTorch ecosystem. SGLang accelerates the serving of large language and vision language models. It’s widely adopted by industry, powering the large-scale online serving of frontier models like Grok and DeepSeek. |
| 104 | + |
| 105 | +We invite you to explore the [SGLang GitHub repo ](https://github.com/sgl-project/sglang/tree/main), join the [community on Slack ](https://slack.mindee.com/), and reach out to [[email protected]](mailto:[email protected]) for inquiries or collaboration opportunities. Together, we can make powerful AI models accessible to everyone. |
0 commit comments