fix(balance_serve): bind scheduler RPC to loopback to close pre-auth …#2043
fix(balance_serve): bind scheduler RPC to loopback to close pre-auth …#2043AAtomical wants to merge 1 commit into
Conversation
…pickle RCE
The balance_serve scheduler RPC (sched_rpc.py) binds its ZMQ ROUTER socket
to tcp://*:{sched_port} and deserializes every received frame with
pickle.loads. With no authentication, allowlist, or format validation, any
peer that can reach the port can execute arbitrary code under the server
process identity by sending a crafted pickle payload (GitHub issue kvcache-ai#2042,
Finding 1).
The scheduler RPC is local-only by design: it transports CUDA IPC tensor
handles produced by mp.reductions.reduce_tensor (valid only on the same
host), and SchedulerClient always connects to localhost. Binding to the
loopback interface therefore removes the network attack surface without
changing the wire protocol, eliminating the pre-auth remote RCE.
There was a problem hiding this comment.
Code Review
This pull request improves security by binding the scheduler RPC server to the loopback interface (127.0.0.1) instead of all interfaces, mitigating potential remote code execution risks. The reviewer correctly noted that this change might cause connection issues if the client attempts to connect via 'localhost' (which may resolve to IPv6 '::1') and suggested updating the client to use '127.0.0.1' as well. Additionally, the reviewer identified another instance of this file in a different directory that requires the same security fix.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # host, and SchedulerClient always connects to localhost, so this RPC is | ||
| # local-only by design. Bind to the loopback interface instead of all | ||
| # interfaces (tcp://*) so the pickle sink is never exposed to the network. | ||
| self.frontend.bind(f"tcp://127.0.0.1:{main_args.sched_port}") |
There was a problem hiding this comment.
Binding the server to 127.0.0.1 restricts it to the IPv4 loopback interface. However, SchedulerClient (on line 169) connects to tcp://localhost:{sched_port}. On systems where localhost resolves to the IPv6 loopback address (::1) first, the client will fail to connect to the server. To ensure reliable local connectivity, please also update SchedulerClient to connect to 127.0.0.1 instead of localhost in this file. Additionally, please note that there is another identical scheduler RPC file at archive/kt-sft/ktransformers/server/balance_serve/sched_rpc.py which still binds to * and should be updated similarly to prevent the same security vulnerability.
…pickle RCE
The balance_serve scheduler RPC (sched_rpc.py) binds its ZMQ ROUTER socket to tcp://*:{sched_port} and deserializes every received frame with pickle.loads. With no authentication, allowlist, or format validation, any peer that can reach the port can execute arbitrary code under the server process identity by sending a crafted pickle payload (GitHub issue #2042, Finding 1).
The scheduler RPC is local-only by design: it transports CUDA IPC tensor handles produced by mp.reductions.reduce_tensor (valid only on the same host), and SchedulerClient always connects to localhost. Binding to the loopback interface therefore removes the network attack surface without changing the wire protocol, eliminating the pre-auth remote RCE.
What does this PR do?
Fixes # (issue)
Before submitting