Skip to content

Support rollout openai/claude and service wrapper#1616

Open
YanhuiDua wants to merge 4 commits intoInternLM:rl_designfrom
YanhuiDua:rollout_wrapper
Open

Support rollout openai/claude and service wrapper#1616
YanhuiDua wants to merge 4 commits intoInternLM:rl_designfrom
YanhuiDua:rollout_wrapper

Conversation

@YanhuiDua
Copy link
Collaborator

No description provided.

@YanhuiDua
Copy link
Collaborator Author

@claude review

@claude
Copy link

claude bot commented Mar 23, 2026

Claude: ## Summary

This PR adds OpenAI and Claude chat API wrappers around the rollout controller, plus a FastAPI-based API server exposing /generate, /v1/chat/completions, /v1/messages, and management endpoints. Good test coverage with both unit and integration tests.

Issues

Critical

  • api_server.py:136-141/generate endpoint returns JSONResponse on auth failure but declares RolloutState return type, breaking the type contract and causing inconsistent response schemas for clients.

Warning

  • openai_chat.py:35-36 — Redundant ensure_request_id (public) and _ensure_request_id (private) static methods that both delegate to the same module-level function.
  • openai_chat.py:155_ensure_request_id called twice on the same rollout_state (once in _build_rollout_state, once in chat()).
  • api_server.py:150 — Raw exception message leaked to client in error response; could expose internal details.
  • worker.py — Default api_host="0.0.0.0" binds to all interfaces; consider 127.0.0.1 if only local access is needed, or document the rationale.
  • controller.py — Unbounded port-scanning loop in _start_api_server; consider a max retry count.

Nit

Verdict

REQUEST_CHANGES

@YanhuiDua
Copy link
Collaborator Author

@claude review

Comment on lines +28 to +32
request_id: str | None = None,
protocol: str = "openai",
) -> JSONResponse:
if protocol == "claude":
payload = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Critical: All authentication has been completely removed from every API endpoint (/generate, /shutdown, /pause, /offload, /onload, /v1/chat/completions, /v1/messages, /healthz, /metadata). The api_key parameter was removed from create_rollout_api_app, and all _authorize_http_request calls were deleted.

Combined with the default api_host="0.0.0.0" in RolloutConfig, this means any network-reachable client can invoke generation, pause/continue, offload/onload, and shutdown the rollout controller without any credentials.

If auth is intended to be handled at a different layer (e.g., a reverse proxy or middleware), that should be documented. Otherwise, this is a serious security regression.

Additionally, the integration test at test_claude_messages (line 1767-1778 in the test file) still expects a 401 response for unauthorized requests — that test will fail now since there's no auth check.

Comment on lines +62 to +65
request_id = request.headers.get("X-Request-Id")
if isinstance(exc.detail, dict) and "error" in exc.detail:
return JSONResponse(status_code=exc.status_code, content=exc.detail)
return _build_error_response(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Critical: The /readyz endpoint was removed in this patch, but:

  1. _wait_until_ready() in the integration test polls /readyz (test file line 126)
  2. test_management_endpoints_via_real_controller asserts on /readyz response (test file line 248)

These tests will 404. If /healthz is meant to replace /readyz, the tests need to be updated accordingly. Note that the semantics also changed — /healthz now calls get_ready_status() instead of check_health(), conflating liveness and readiness checks (which are typically separate in health check conventions).

sample_params.return_routed_experts = True if self.enable_return_routed_experts else False
lmdeploy_sample_params = self._transform_sample_params(sample_params)
payload.update(sample_params)
payload.update(lmdeploy_sample_params)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Good bug fix — the previous code was calling payload.update(sample_params) which would serialize the SampleParams Pydantic model instead of the transformed lmdeploy-specific parameters.

Comment on lines +346 to +349
sock.settimeout(0.2)
return sock.connect_ex((host, port)) == 0

def _start_api_server(self, host: str | None = None, port: int | None = None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: This while loop has no upper bound. If many ports are in use, it could spin indefinitely. Consider adding a max retry count (e.g., 100 ports) and raising an error if no free port is found.

(Carried over from previous review — still not addressed.)

accelerator = RESOURCE_MAP[torch.accelerator.current_accelerator().type]
cls.temp_dir = tempfile.TemporaryDirectory()
cls.worker_log_dir = os.path.join(cls.temp_dir.name, "work_dirs")
cls.resources_cfg = AcceleratorResourcesConfig(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: Per CLAUDE.md, prefer pathlib.Path over os.path for filesystem operations:

Suggested change
cls.resources_cfg = AcceleratorResourcesConfig(
cls.worker_log_dir = str(pathlib.Path(cls.temp_dir.name) / "work_dirs")

Comment on lines +84 to +85
except Exception:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: Multiple except Exception: pass blocks in teardown (lines 85, 91-92, 117). Per CLAUDE.md: "Avoid silencing exceptions with pass unless there is a documented reason." Consider at least logging the exception:

except Exception:
    import traceback
    traceback.print_exc()

Comment on lines 121 to +124
}
return rollout_metadata

def get_ready_status(self) -> tuple[bool, dict[str, Any]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Critical: api_server_url was removed from this metadata dict, but:

  1. The integration test setUpClass does cls.base_url = cls.metadata["api_server_url"] (test file line 72) — this will KeyError and prevent all tests from running.
  2. test_management_endpoints_via_real_controller asserts metadata.json()["api_server_url"] == self.base_url (test file line 257).
  3. The RolloutWorkerMetadata TypedDict also removed api_server_url — callers may still expect this key.

@claude
Copy link

claude bot commented Mar 23, 2026

Claude: ## Summary

This PR adds OpenAI and Claude chat API wrappers for the rollout controller, a FastAPI-based API server, and a refactoring pass that moves ensure_rollout_request_id to utils, removes redundant wrappers, and merges error helpers. It also includes a real bug fix in lmdeploy.py (sample_paramslmdeploy_sample_params). Good test coverage with unit and integration tests.

Issues

Critical

  • Auth completely removed — All authentication was stripped from every API endpoint (/generate, /shutdown, /v1/chat/completions, /v1/messages, etc.) with no replacement. Combined with default api_host="0.0.0.0", any network-reachable client can shutdown/pause/offload the controller. The integration test still expects 401 for unauthorized /v1/messages requests, which will now fail.
  • /readyz endpoint removed but tests still use it_wait_until_ready() polls /readyz, and test_management_endpoints_via_real_controller asserts on /readyz. These will 404. The /healthz endpoint now conflates liveness and readiness semantics.
  • api_server_url removed from metadata but tests expect it — Test setUpClass does cls.base_url = cls.metadata["api_server_url"] which will KeyError, preventing all integration tests from running.

Warning

  • Unbounded port-scanning loop in _start_api_server (carried over from prior review, still not addressed).
  • except Exception: pass in test teardown without documented reason.

Nit

  • os.path.join used in tests instead of pathlib.Path per CLAUDE.md.

Verdict

REQUEST_CHANGES — The three critical issues (auth removal, /readyz removal, metadata key removal) will cause test failures and a security regression. The lmdeploy.py bug fix is good and should be preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant