Support rollout openai/claude and service wrapper#1616
Support rollout openai/claude and service wrapper#1616YanhuiDua wants to merge 4 commits intoInternLM:rl_designfrom
Conversation
7d7a933 to
3bbec43
Compare
|
@claude review |
|
Claude: ## Summary This PR adds OpenAI and Claude chat API wrappers around the rollout controller, plus a FastAPI-based API server exposing IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
3bbec43 to
8ce26eb
Compare
|
@claude review |
| request_id: str | None = None, | ||
| protocol: str = "openai", | ||
| ) -> JSONResponse: | ||
| if protocol == "claude": | ||
| payload = { |
There was a problem hiding this comment.
Claude: Critical: All authentication has been completely removed from every API endpoint (/generate, /shutdown, /pause, /offload, /onload, /v1/chat/completions, /v1/messages, /healthz, /metadata). The api_key parameter was removed from create_rollout_api_app, and all _authorize_http_request calls were deleted.
Combined with the default api_host="0.0.0.0" in RolloutConfig, this means any network-reachable client can invoke generation, pause/continue, offload/onload, and shutdown the rollout controller without any credentials.
If auth is intended to be handled at a different layer (e.g., a reverse proxy or middleware), that should be documented. Otherwise, this is a serious security regression.
Additionally, the integration test at test_claude_messages (line 1767-1778 in the test file) still expects a 401 response for unauthorized requests — that test will fail now since there's no auth check.
| request_id = request.headers.get("X-Request-Id") | ||
| if isinstance(exc.detail, dict) and "error" in exc.detail: | ||
| return JSONResponse(status_code=exc.status_code, content=exc.detail) | ||
| return _build_error_response( |
There was a problem hiding this comment.
Claude: Critical: The /readyz endpoint was removed in this patch, but:
_wait_until_ready()in the integration test polls/readyz(test file line 126)test_management_endpoints_via_real_controllerasserts on/readyzresponse (test file line 248)
These tests will 404. If /healthz is meant to replace /readyz, the tests need to be updated accordingly. Note that the semantics also changed — /healthz now calls get_ready_status() instead of check_health(), conflating liveness and readiness checks (which are typically separate in health check conventions).
| sample_params.return_routed_experts = True if self.enable_return_routed_experts else False | ||
| lmdeploy_sample_params = self._transform_sample_params(sample_params) | ||
| payload.update(sample_params) | ||
| payload.update(lmdeploy_sample_params) |
There was a problem hiding this comment.
Claude: Good bug fix — the previous code was calling payload.update(sample_params) which would serialize the SampleParams Pydantic model instead of the transformed lmdeploy-specific parameters.
| sock.settimeout(0.2) | ||
| return sock.connect_ex((host, port)) == 0 | ||
|
|
||
| def _start_api_server(self, host: str | None = None, port: int | None = None): |
There was a problem hiding this comment.
Claude: Warning: This while loop has no upper bound. If many ports are in use, it could spin indefinitely. Consider adding a max retry count (e.g., 100 ports) and raising an error if no free port is found.
(Carried over from previous review — still not addressed.)
tests/rl/test_rollout_api_server.py
Outdated
| accelerator = RESOURCE_MAP[torch.accelerator.current_accelerator().type] | ||
| cls.temp_dir = tempfile.TemporaryDirectory() | ||
| cls.worker_log_dir = os.path.join(cls.temp_dir.name, "work_dirs") | ||
| cls.resources_cfg = AcceleratorResourcesConfig( |
There was a problem hiding this comment.
Claude: Nit: Per CLAUDE.md, prefer pathlib.Path over os.path for filesystem operations:
| cls.resources_cfg = AcceleratorResourcesConfig( | |
| cls.worker_log_dir = str(pathlib.Path(cls.temp_dir.name) / "work_dirs") |
tests/rl/test_rollout_api_server.py
Outdated
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Claude: Warning: Multiple except Exception: pass blocks in teardown (lines 85, 91-92, 117). Per CLAUDE.md: "Avoid silencing exceptions with pass unless there is a documented reason." Consider at least logging the exception:
except Exception:
import traceback
traceback.print_exc()| } | ||
| return rollout_metadata | ||
|
|
||
| def get_ready_status(self) -> tuple[bool, dict[str, Any]]: |
There was a problem hiding this comment.
Claude: Critical: api_server_url was removed from this metadata dict, but:
- The integration test
setUpClassdoescls.base_url = cls.metadata["api_server_url"](test file line 72) — this willKeyErrorand prevent all tests from running. test_management_endpoints_via_real_controllerassertsmetadata.json()["api_server_url"] == self.base_url(test file line 257).- The
RolloutWorkerMetadataTypedDict also removedapi_server_url— callers may still expect this key.
|
Claude: ## Summary This PR adds OpenAI and Claude chat API wrappers for the rollout controller, a FastAPI-based API server, and a refactoring pass that moves IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES — The three critical issues (auth removal, |
8ce26eb to
0e59d89
Compare
No description provided.