Description
The /api/chat endpoint in backend/main.py makes direct LLM API calls with no rate limiting or request throttling. This creates risks of uncontrolled API costs, unhandled 429 errors from LLM providers, and vulnerability to abuse or accidental request loops in multi-user deployments.
Problem
Currently, there is no mechanism to limit how many requests a user can send to the LLM API within a given time window. The chat_endpoint function in main.py directly calls assistant.handle_chat() with no throttling.
Impact
- Unlimited rapid requests hit the LLM API simultaneously
- Google Gemini/Vertex AI rate limits trigger unhandled errors
- No cost control or usage visibility
- No protection against bot abuse or accidental loops
Steps to Reproduce
Send 50 rapid concurrent requests — all go through with zero throttling:
for i in $(seq 1 50); do
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is a neuron?"}' &
done
Expected Behavior
Requests should be throttled or queued after a configurable limit.
Actual Behavior
All 50 requests hit the LLM API simultaneously, causing potential rate limit errors or unexpected billing spikes.
Proposed Solution
Integrate slowapi — a FastAPI-compatible rate limiting library.
Add to backend/main.py:
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi.responses import JSONResponse
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.exception_handler(RateLimitExceeded)
async def rate_limit_handler(request, exc):
return JSONResponse(
status_code=429,
content={"detail": "Too many requests. Please wait and try again."}
)
@app.post("/api/chat", response_model=ChatResponse, tags=["Chat"])
@limiter.limit("10/minute")
async def chat_endpoint(request: Request, msg: ChatMessage):
...
Add to pyproject.toml dependencies:
Add to .env.template:
Acceptance Criteria
Environment
- OS: Any (Linux/macOS/Windows)
- Python: 3.12+
- Framework: FastAPI
- LLM Provider: Google Gemini / Vertex AI
- Relevant Files:
backend/main.py, pyproject.toml, .env.template
Description
The
/api/chatendpoint inbackend/main.pymakes direct LLM API calls with no rate limiting or request throttling. This creates risks of uncontrolled API costs, unhandled429errors from LLM providers, and vulnerability to abuse or accidental request loops in multi-user deployments.Problem
Currently, there is no mechanism to limit how many requests a user can send to the LLM API within a given time window. The
chat_endpointfunction inmain.pydirectly callsassistant.handle_chat()with no throttling.Impact
Steps to Reproduce
Send 50 rapid concurrent requests — all go through with zero throttling:
Expected Behavior
Requests should be throttled or queued after a configurable limit.
Actual Behavior
All 50 requests hit the LLM API simultaneously, causing potential rate limit errors or unexpected billing spikes.
Proposed Solution
Integrate slowapi — a FastAPI-compatible rate limiting library.
Add to
backend/main.py:Add to
pyproject.tomldependencies:Add to
.env.template:Acceptance Criteria
/api/chatendpoint inbackend/main.py.envfile429response with user-friendly error messageEnvironment
backend/main.py,pyproject.toml,.env.template