Support rollout openai/claude and service wrapper by YanhuiDua · Pull Request #1616 · InternLM/xtuner

YanhuiDua · 2026-03-23T09:17:36Z

No description provided.

YanhuiDua · 2026-03-23T09:24:05Z

@claude review

xtuner/v1/rl/rollout/controller.py

xtuner/v1/rl/rollout/openai_chat.py

xtuner/v1/rl/rollout/api_server.py

claude · 2026-03-23T09:45:46Z

Claude: ## Summary

This PR adds OpenAI and Claude chat API wrappers around the rollout controller, plus a FastAPI-based API server exposing /generate, /v1/chat/completions, /v1/messages, and management endpoints. Good test coverage with both unit and integration tests.

Issues

Critical

api_server.py:136-141 — /generate endpoint returns JSONResponse on auth failure but declares RolloutState return type, breaking the type contract and causing inconsistent response schemas for clients.

Warning

openai_chat.py:35-36 — Redundant ensure_request_id (public) and _ensure_request_id (private) static methods that both delegate to the same module-level function.
openai_chat.py:155 — _ensure_request_id called twice on the same rollout_state (once in _build_rollout_state, once in chat()).
api_server.py:150 — Raw exception message leaked to client in error response; could expose internal details.
worker.py — Default api_host="0.0.0.0" binds to all interfaces; consider 127.0.0.1 if only local access is needed, or document the rationale.
controller.py — Unbounded port-scanning loop in _start_api_server; consider a max retry count.

Nit

controller.py:14-17 — Import ordering has a stray blank line breaking the grouping.
api_server.py:114-117 — Missing type hint on logger parameter.
api_server.py:39-40 — Unnecessary _ensure_request_id wrapper; could use ensure_rollout_request_id directly.

Verdict

REQUEST_CHANGES

YanhuiDua · 2026-03-23T12:05:45Z

@claude review

claude · 2026-03-23T12:21:58Z

xtuner/v1/rl/rollout/api_server.py

+    request_id: str | None = None,
+    protocol: str = "openai",
+) -> JSONResponse:
+    if protocol == "claude":
+        payload = {


Claude: Critical: All authentication has been completely removed from every API endpoint (/generate, /shutdown, /pause, /offload, /onload, /v1/chat/completions, /v1/messages, /healthz, /metadata). The api_key parameter was removed from create_rollout_api_app, and all _authorize_http_request calls were deleted.

Combined with the default api_host="0.0.0.0" in RolloutConfig, this means any network-reachable client can invoke generation, pause/continue, offload/onload, and shutdown the rollout controller without any credentials.

If auth is intended to be handled at a different layer (e.g., a reverse proxy or middleware), that should be documented. Otherwise, this is a serious security regression.

Additionally, the integration test at test_claude_messages (line 1767-1778 in the test file) still expects a 401 response for unauthorized requests — that test will fail now since there's no auth check.

claude · 2026-03-23T12:22:03Z

xtuner/v1/rl/rollout/api_server.py

+        request_id = request.headers.get("X-Request-Id")
+        if isinstance(exc.detail, dict) and "error" in exc.detail:
+            return JSONResponse(status_code=exc.status_code, content=exc.detail)
+        return _build_error_response(


Claude: Critical: The /readyz endpoint was removed in this patch, but:

_wait_until_ready() in the integration test polls /readyz (test file line 126)

test_management_endpoints_via_real_controller asserts on /readyz response (test file line 248)

These tests will 404. If /healthz is meant to replace /readyz, the tests need to be updated accordingly. Note that the semantics also changed — /healthz now calls get_ready_status() instead of check_health(), conflating liveness and readiness checks (which are typically separate in health check conventions).

claude · 2026-03-23T12:22:09Z

xtuner/v1/rl/rollout/lmdeploy.py

            sample_params.return_routed_experts = True if self.enable_return_routed_experts else False
            lmdeploy_sample_params = self._transform_sample_params(sample_params)
-            payload.update(sample_params)
+            payload.update(lmdeploy_sample_params)


Claude: Good bug fix — the previous code was calling payload.update(sample_params) which would serialize the SampleParams Pydantic model instead of the transformed lmdeploy-specific parameters.

claude · 2026-03-23T12:22:12Z

xtuner/v1/rl/rollout/controller.py

+            sock.settimeout(0.2)
+            return sock.connect_ex((host, port)) == 0
+
+    def _start_api_server(self, host: str | None = None, port: int | None = None):


Claude: Warning: This while loop has no upper bound. If many ports are in use, it could spin indefinitely. Consider adding a max retry count (e.g., 100 ports) and raising an error if no free port is found.

(Carried over from previous review — still not addressed.)

claude · 2026-03-23T12:22:15Z

tests/rl/test_rollout_api_server.py

+        accelerator = RESOURCE_MAP[torch.accelerator.current_accelerator().type]
+        cls.temp_dir = tempfile.TemporaryDirectory()
+        cls.worker_log_dir = os.path.join(cls.temp_dir.name, "work_dirs")
+        cls.resources_cfg = AcceleratorResourcesConfig(


Claude: Nit: Per CLAUDE.md, prefer pathlib.Path over os.path for filesystem operations:

Suggested change

cls.resources_cfg = AcceleratorResourcesConfig(

cls.worker_log_dir = str(pathlib.Path(cls.temp_dir.name) / "work_dirs")

claude · 2026-03-23T12:22:18Z

tests/rl/test_rollout_api_server.py

+                except Exception:
+                    pass


Claude: Warning: Multiple except Exception: pass blocks in teardown (lines 85, 91-92, 117). Per CLAUDE.md: "Avoid silencing exceptions with pass unless there is a documented reason." Consider at least logging the exception:

except Exception: import traceback traceback.print_exc()

claude · 2026-03-23T12:22:41Z

xtuner/v1/rl/rollout/controller.py

        }
        return rollout_metadata

+    def get_ready_status(self) -> tuple[bool, dict[str, Any]]:


Claude: Critical: api_server_url was removed from this metadata dict, but:

The integration test setUpClass does cls.base_url = cls.metadata["api_server_url"] (test file line 72) — this will KeyError and prevent all tests from running.

test_management_endpoints_via_real_controller asserts metadata.json()["api_server_url"] == self.base_url (test file line 257).

The RolloutWorkerMetadata TypedDict also removed api_server_url — callers may still expect this key.

claude · 2026-03-23T12:23:11Z

Claude: ## Summary

This PR adds OpenAI and Claude chat API wrappers for the rollout controller, a FastAPI-based API server, and a refactoring pass that moves ensure_rollout_request_id to utils, removes redundant wrappers, and merges error helpers. It also includes a real bug fix in lmdeploy.py (sample_params → lmdeploy_sample_params). Good test coverage with unit and integration tests.

Issues

Critical

Auth completely removed — All authentication was stripped from every API endpoint (/generate, /shutdown, /v1/chat/completions, /v1/messages, etc.) with no replacement. Combined with default api_host="0.0.0.0", any network-reachable client can shutdown/pause/offload the controller. The integration test still expects 401 for unauthorized /v1/messages requests, which will now fail.
/readyz endpoint removed but tests still use it — _wait_until_ready() polls /readyz, and test_management_endpoints_via_real_controller asserts on /readyz. These will 404. The /healthz endpoint now conflates liveness and readiness semantics.
api_server_url removed from metadata but tests expect it — Test setUpClass does cls.base_url = cls.metadata["api_server_url"] which will KeyError, preventing all integration tests from running.

Warning

Unbounded port-scanning loop in _start_api_server (carried over from prior review, still not addressed).
except Exception: pass in test teardown without documented reason.

Nit

os.path.join used in tests instead of pathlib.Path per CLAUDE.md.

Verdict

REQUEST_CHANGES — The three critical issues (auth removal, /readyz removal, metadata key removal) will cause test failures and a security regression. The lmdeploy.py bug fix is good and should be preserved.

YanhuiDua added 3 commits March 23, 2026 17:18

Add OpenAI chat wrapper for rollout controller

4aff12d

Add Claude chat wrapper for rollout controller

7f6c009

Add rollout API server integration

0297788

YanhuiDua force-pushed the rollout_wrapper branch from 7d7a933 to 3bbec43 Compare March 23, 2026 09:21