Multi-Node Distributed Inference Support #406

hynky1999 · 2025-11-20T15:21:47Z

Add Multi-Node Distributed Inference Support

This PR adds support for running inference across multiple nodes in distributed environments (SLURM and Ray).

Key Changes

Distributed Inference

Multi-node support for VLLM and SGLang servers using SLURM and Ray
New DistributedCoordinationServer for master/worker health checks
Distributed environment utilities (distributed/utils.py) for node detection and coordination
Ray cluster management for multi-node deployments (through placement groups)

Server Lifecycle Refactoring

Split server management into start_server_task(), monitor_health(), and server_cleanup()
Improved error detection and handling in server logs
Better process cleanup and resource management
Server-specific logging to separate files

Executor Enhancements

SLURM/RAY executor now supports gpus_per_task and nodes_per_task parameters
Node rank tracking for distributed logging
Job completion tracking only on master node (node 0)

Logging Improvements

Node rank prefixes in logs ([NODE X]) for multi-node debugging
Custom format strings for distributed environments
Task-specific log files preserved across nodes

Dependencies

Updated Ray dependency to ray[default] for full distributed features

Breaking Changes

Inference servers now require rank parameter in __init__()
wait_until_ready() renamed to _wait_until_ready() (internal method)

Tests

Added tests for auto-restarting and lifecycle managment of inference server
Added tests for lifecycle managment of ray executor

…ve into multi-node-inference

Copilot

Pull Request Overview

This PR adds comprehensive multi-node distributed inference support for VLLM and SGLang servers, enabling inference workloads to scale across multiple nodes in SLURM and Ray environments. The changes include significant refactoring of server lifecycle management with auto-restart capabilities, distributed coordination infrastructure, and executor enhancements for multi-node execution.

Key Changes:

Distributed inference infrastructure with Ray cluster management and SLURM support for multi-node VLLM/SGLang deployments
Server lifecycle refactoring with automatic restart, health monitoring, and improved process management
Distributed coordination server for master/worker health checks and node synchronization
Node-aware logging with rank prefixes and distributed environment utilities

Reviewed Changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/pipeline/test_inference.py	Added lifecycle management and auto-restart tests for inference servers
tests/executor/test_ray.py	Added comprehensive Ray executor tests for placement groups, task resubmission, and cleanup
src/datatrove/utils/logging.py	Added node_rank parameter for distributed logging with [NODE X] prefixes
src/datatrove/pipeline/inference/types.py	New file defining InferenceResult, InferenceError, ServerError, and type aliases
src/datatrove/pipeline/inference/distributed/utils.py	New utilities for distributed environment detection and node coordination
src/datatrove/pipeline/inference/distributed/ray.py	New Ray cluster initialization, worker management, and health monitoring
src/datatrove/pipeline/inference/distributed/coordination_server.py	New HTTP coordination server for master/worker health checks
src/datatrove/pipeline/inference/servers/base.py	Major refactoring with auto-restart, health monitoring, and improved lifecycle management
src/datatrove/pipeline/inference/servers/vllm_server.py	Added multi-node support with Ray integration and distributed health monitoring
src/datatrove/pipeline/inference/servers/sglang_server.py	Added multi-node support with distributed parameters
src/datatrove/pipeline/inference/servers/endpoint_server.py	Updated to support new server lifecycle methods
src/datatrove/pipeline/inference/servers/dummy_server.py	Updated for testing with new lifecycle methods
src/datatrove/pipeline/inference/run_inference.py	Integrated distributed coordination and updated for new server API
src/datatrove/executor/base.py	Added node_rank parameter and master-only job completion tracking
src/datatrove/executor/slurm.py	Added gpus_per_task and nodes_per_task parameters for multi-node support
src/datatrove/executor/ray.py	Major refactoring with placement groups, task manager, and multi-node execution
pyproject.toml	Updated Ray dependency to ray[default] for full distributed features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/datatrove/pipeline/inference/servers/vllm_server.py

src/datatrove/executor/slurm.py

src/datatrove/executor/ray.py

tests/pipeline/test_inference.py

lhoestq · 2025-11-25T10:40:07Z

(kinda related, does this mean we could run distributed inference in HF Jobs if we set up Ray ?)

guipenedo and others added 23 commits November 7, 2025 15:04

fixed inference skipping

4c60148

nit

97bd686

nit

00045c6

refactored inf runner

16d20e0

style

8a2f019

drop documents with 0 successful rollouts

9819f29

add requests cache

f798ddd

nit

50f3157

perf improvements (less aggressive fs hits)

c1e0d60

improved writes with queue

9358be6

aiosqlite

fb01939

tmp sync on cluster

fb8413e

working version for slurm

f5d97af

fix master node import

023c538

capture output of ray stop

1876c8b

sync locally

4ea13d8

final polishes

5e81239

nit condition during distributed check

f92057b

Merger with main

a7752bd

push ray

1914b1d

fix issues with vllm and sglang on slurm

b20218c

Merge branch 'multi-node-inference' of github.com:huggingface/datatro…

51c61e7

…ve into multi-node-inference

get ray + sglang working

0282f60

hynky1999 requested a review from Copilot November 20, 2025 15:30

Copilot started reviewing on behalf of hynky1999 November 20, 2025 15:31 View session

Copilot finished reviewing on behalf of hynky1999 November 20, 2025 15:32

Copilot AI reviewed Nov 20, 2025

View reviewed changes

hynky1999 and others added 3 commits November 20, 2025 19:57

logging node in multinode, fixes from debugging + prettier

46e0d50

prettier

261f680

removed auto restart and distributed coordinator + small nits

0d654af

guipenedo and others added 4 commits November 21, 2025 17:13

remove wal

5278c66

envs vars consistency + vllm master node tracking or nodes

e212a7f

fmt

45b5eaa

add example

ba8ad67

make ray checks async and with lnoger timeout

0715bff

hynky1999 merged commit 8bbda9c into main Nov 25, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-Node Distributed Inference Support #406

Multi-Node Distributed Inference Support #406

Uh oh!

hynky1999 commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Multi-Node Distributed Inference Support #406

Multi-Node Distributed Inference Support #406

Uh oh!

Conversation

hynky1999 commented Nov 20, 2025

Add Multi-Node Distributed Inference Support

Key Changes

Breaking Changes

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants