-
Notifications
You must be signed in to change notification settings - Fork 236
Multi-Node Distributed Inference Support #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ve into multi-node-inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive multi-node distributed inference support for VLLM and SGLang servers, enabling inference workloads to scale across multiple nodes in SLURM and Ray environments. The changes include significant refactoring of server lifecycle management with auto-restart capabilities, distributed coordination infrastructure, and executor enhancements for multi-node execution.
Key Changes:
- Distributed inference infrastructure with Ray cluster management and SLURM support for multi-node VLLM/SGLang deployments
- Server lifecycle refactoring with automatic restart, health monitoring, and improved process management
- Distributed coordination server for master/worker health checks and node synchronization
- Node-aware logging with rank prefixes and distributed environment utilities
Reviewed Changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pipeline/test_inference.py | Added lifecycle management and auto-restart tests for inference servers |
| tests/executor/test_ray.py | Added comprehensive Ray executor tests for placement groups, task resubmission, and cleanup |
| src/datatrove/utils/logging.py | Added node_rank parameter for distributed logging with [NODE X] prefixes |
| src/datatrove/pipeline/inference/types.py | New file defining InferenceResult, InferenceError, ServerError, and type aliases |
| src/datatrove/pipeline/inference/distributed/utils.py | New utilities for distributed environment detection and node coordination |
| src/datatrove/pipeline/inference/distributed/ray.py | New Ray cluster initialization, worker management, and health monitoring |
| src/datatrove/pipeline/inference/distributed/coordination_server.py | New HTTP coordination server for master/worker health checks |
| src/datatrove/pipeline/inference/servers/base.py | Major refactoring with auto-restart, health monitoring, and improved lifecycle management |
| src/datatrove/pipeline/inference/servers/vllm_server.py | Added multi-node support with Ray integration and distributed health monitoring |
| src/datatrove/pipeline/inference/servers/sglang_server.py | Added multi-node support with distributed parameters |
| src/datatrove/pipeline/inference/servers/endpoint_server.py | Updated to support new server lifecycle methods |
| src/datatrove/pipeline/inference/servers/dummy_server.py | Updated for testing with new lifecycle methods |
| src/datatrove/pipeline/inference/run_inference.py | Integrated distributed coordination and updated for new server API |
| src/datatrove/executor/base.py | Added node_rank parameter and master-only job completion tracking |
| src/datatrove/executor/slurm.py | Added gpus_per_task and nodes_per_task parameters for multi-node support |
| src/datatrove/executor/ray.py | Major refactoring with placement groups, task manager, and multi-node execution |
| pyproject.toml | Updated Ray dependency to ray[default] for full distributed features |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
(kinda related, does this mean we could run distributed inference in HF Jobs if we set up Ray ?) |
Add Multi-Node Distributed Inference Support
This PR adds support for running inference across multiple nodes in distributed environments (SLURM and Ray).
Key Changes
Distributed Inference
DistributedCoordinationServerfor master/worker health checksdistributed/utils.py) for node detection and coordinationServer Lifecycle Refactoring
start_server_task(),monitor_health(), andserver_cleanup()Executor Enhancements
gpus_per_taskandnodes_per_taskparametersLogging Improvements
[NODE X]) for multi-node debuggingDependencies
ray[default]for full distributed featuresBreaking Changes
rankparameter in__init__()wait_until_ready()renamed to_wait_until_ready()(internal method)Tests