Skip to content

Conversation

@hynky1999
Copy link
Contributor

Add Multi-Node Distributed Inference Support

This PR adds support for running inference across multiple nodes in distributed environments (SLURM and Ray).

Key Changes

Distributed Inference

  • Multi-node support for VLLM and SGLang servers using SLURM and Ray
  • New DistributedCoordinationServer for master/worker health checks
  • Distributed environment utilities (distributed/utils.py) for node detection and coordination
  • Ray cluster management for multi-node deployments (through placement groups)

Server Lifecycle Refactoring

  • Split server management into start_server_task(), monitor_health(), and server_cleanup()
  • Improved error detection and handling in server logs
  • Better process cleanup and resource management
  • Server-specific logging to separate files

Executor Enhancements

  • SLURM/RAY executor now supports gpus_per_task and nodes_per_task parameters
  • Node rank tracking for distributed logging
  • Job completion tracking only on master node (node 0)

Logging Improvements

  • Node rank prefixes in logs ([NODE X]) for multi-node debugging
  • Custom format strings for distributed environments
  • Task-specific log files preserved across nodes

Dependencies

  • Updated Ray dependency to ray[default] for full distributed features

Breaking Changes

  • Inference servers now require rank parameter in __init__()
  • wait_until_ready() renamed to _wait_until_ready() (internal method)

Tests

  • Added tests for auto-restarting and lifecycle managment of inference server
  • Added tests for lifecycle managment of ray executor

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive multi-node distributed inference support for VLLM and SGLang servers, enabling inference workloads to scale across multiple nodes in SLURM and Ray environments. The changes include significant refactoring of server lifecycle management with auto-restart capabilities, distributed coordination infrastructure, and executor enhancements for multi-node execution.

Key Changes:

  • Distributed inference infrastructure with Ray cluster management and SLURM support for multi-node VLLM/SGLang deployments
  • Server lifecycle refactoring with automatic restart, health monitoring, and improved process management
  • Distributed coordination server for master/worker health checks and node synchronization
  • Node-aware logging with rank prefixes and distributed environment utilities

Reviewed Changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/pipeline/test_inference.py Added lifecycle management and auto-restart tests for inference servers
tests/executor/test_ray.py Added comprehensive Ray executor tests for placement groups, task resubmission, and cleanup
src/datatrove/utils/logging.py Added node_rank parameter for distributed logging with [NODE X] prefixes
src/datatrove/pipeline/inference/types.py New file defining InferenceResult, InferenceError, ServerError, and type aliases
src/datatrove/pipeline/inference/distributed/utils.py New utilities for distributed environment detection and node coordination
src/datatrove/pipeline/inference/distributed/ray.py New Ray cluster initialization, worker management, and health monitoring
src/datatrove/pipeline/inference/distributed/coordination_server.py New HTTP coordination server for master/worker health checks
src/datatrove/pipeline/inference/servers/base.py Major refactoring with auto-restart, health monitoring, and improved lifecycle management
src/datatrove/pipeline/inference/servers/vllm_server.py Added multi-node support with Ray integration and distributed health monitoring
src/datatrove/pipeline/inference/servers/sglang_server.py Added multi-node support with distributed parameters
src/datatrove/pipeline/inference/servers/endpoint_server.py Updated to support new server lifecycle methods
src/datatrove/pipeline/inference/servers/dummy_server.py Updated for testing with new lifecycle methods
src/datatrove/pipeline/inference/run_inference.py Integrated distributed coordination and updated for new server API
src/datatrove/executor/base.py Added node_rank parameter and master-only job completion tracking
src/datatrove/executor/slurm.py Added gpus_per_task and nodes_per_task parameters for multi-node support
src/datatrove/executor/ray.py Major refactoring with placement groups, task manager, and multi-node execution
pyproject.toml Updated Ray dependency to ray[default] for full distributed features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lhoestq
Copy link
Member

lhoestq commented Nov 25, 2025

(kinda related, does this mean we could run distributed inference in HF Jobs if we set up Ray ?)

@hynky1999 hynky1999 merged commit 8bbda9c into main Nov 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants