[Feature] Add Rate Limiting Middleware to Prevent LLM API Overuse

## Description

The `/api/chat` endpoint in `backend/main.py` makes direct LLM API calls with **no rate limiting or request throttling**. This creates risks of uncontrolled API costs, unhandled `429` errors from LLM providers, and vulnerability to abuse or accidental request loops in multi-user deployments.

## Problem

Currently, there is no mechanism to limit how many requests a user can send to the LLM API within a given time window. The `chat_endpoint` function in `main.py` directly calls `assistant.handle_chat()` with no throttling.

## Impact

- Unlimited rapid requests hit the LLM API simultaneously
- Google Gemini/Vertex AI rate limits trigger unhandled errors
- No cost control or usage visibility
- No protection against bot abuse or accidental loops

## Steps to Reproduce

Send 50 rapid concurrent requests — all go through with zero throttling:

    for i in $(seq 1 50); do
      curl -X POST http://localhost:8000/api/chat \
        -H "Content-Type: application/json" \
        -d '{"query": "What is a neuron?"}' &
    done

## Expected Behavior

Requests should be throttled or queued after a configurable limit.

## Actual Behavior

All 50 requests hit the LLM API simultaneously, causing potential rate limit errors or unexpected billing spikes.

## Proposed Solution

Integrate [slowapi](https://github.com/laurentS/slowapi) — a FastAPI-compatible rate limiting library.

Add to `backend/main.py`:

    from slowapi import Limiter
    from slowapi.util import get_remote_address
    from slowapi.errors import RateLimitExceeded
    from fastapi.responses import JSONResponse

    limiter = Limiter(key_func=get_remote_address)
    app.state.limiter = limiter

    @app.exception_handler(RateLimitExceeded)
    async def rate_limit_handler(request, exc):
        return JSONResponse(
            status_code=429,
            content={"detail": "Too many requests. Please wait and try again."}
        )

    @app.post("/api/chat", response_model=ChatResponse, tags=["Chat"])
    @limiter.limit("10/minute")
    async def chat_endpoint(request: Request, msg: ChatMessage):
        ...

Add to `pyproject.toml` dependencies:

    "slowapi>=0.1.9",

Add to `.env.template`:

    RATE_LIMIT=10/minute

## Acceptance Criteria

- [ ] Rate limiting middleware added to `/api/chat` endpoint in `backend/main.py`
- [ ] Limit is configurable via `.env` file
- [ ] Returns clear `429` response with user-friendly error message
- [ ] Basic request count logging added for monitoring
- [ ] Existing tests still pass after integration

## Environment

- **OS:** Any (Linux/macOS/Windows)
- **Python:** 3.12+
- **Framework:** FastAPI
- **LLM Provider:** Google Gemini / Vertex AI
- **Relevant Files:** `backend/main.py`, `pyproject.toml`, `.env.template`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Rate Limiting Middleware to Prevent LLM API Overuse #55

Description

Problem

Impact

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Solution

Acceptance Criteria

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Add Rate Limiting Middleware to Prevent LLM API Overuse #55

Description

Description

Problem

Impact

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Solution

Acceptance Criteria

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions