Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .example.env
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,10 @@ SESSION_SECRET_KEY=your_secret_key
WEBHOOK_SECRET=your_webhook_secret
REPO_PATH_LOCALLY=/path/to/sugar-ai
GIT_PATH=/usr/bin/git

# GitHub API authentication (optional)
# Used when fetching Sugar documentation from GitHub
# Provides higher API rate limits (5000 requests/hour vs 60)
# Generate token at: https://github.com/settings/tokens
# Required scopes: public_repo (read-only access)
GITHUB_TOKEN=your_github_personal_access_token
227 changes: 227 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,233 @@ Review the terminal output for further details and error messages.

When deploying Sugar-AI in CI/CD pipelines, you'll need to configure environment variables properly. Current CI/CD uses github webhooks. So make sure to create a webhook secret and add it to the `.env`.

## Keeping RAG Docs Up to Date

Sugar-AI includes a dynamic document fetching system to keep your RAG (Retrieval-Augmented Generation) documentation fresh and current. Instead of manually managing static documentation files, you can automatically fetch and index the latest Sugar documentation from GitHub.

### Overview

The dynamic document fetching system:
- Fetches documentation from Sugar Labs repositories on GitHub
- Converts markdown to clean plain text (removes headers, HTML tags, etc.)
- Adds metadata about the source URL and fetch timestamp
- Rebuilds the vector store for efficient retrieval
- Supports GitHub API authentication for higher rate limits

### Manual Document Fetching

To fetch and update documentation manually:

#### Basic Usage (No Authentication)

```bash
python scripts/fetch_sugar_docs.py
```

This will fetch all configured documentation sources and save them to the `docs/` directory.

#### With GitHub Authentication

For higher API rate limits, you can provide a GitHub personal access token:

```bash
export GITHUB_TOKEN=your_github_personal_access_token
python scripts/fetch_sugar_docs.py
```

To generate a GitHub token:
1. Go to https://github.com/settings/tokens
2. Click "Generate new token (classic)"
3. Select scope: `public_repo` (read-only access)
4. Copy the token and use it as shown above

#### Expected Output

```
============================================================
SUGAR-AI DOCUMENT FETCH SUMMARY
============================================================
Timestamp: 2026-03-21T10:30:45.123456
Total documents attempted: 3
Successfully fetched: 3
Failed: 0

Fetched documents:
✓ sugar-activity.txt
✓ sugar-activity-tutorial.txt
✓ hello-world-readme.txt

============================================================
Fetched 3 docs successfully
```

#### Handling Errors

The script handles common errors gracefully:
- **404 Not Found**: If a documentation URL no longer exists
- **Network Failures**: Connection timeouts or network errors
- **Authentication Issues**: Invalid or expired GitHub tokens

Failed documents are reported in the output, and remaining documents are still fetched and indexed.

### Automated Document Refreshing via API

#### Using the /refresh-docs Endpoint

For automated updates, you can use the `/refresh-docs` endpoint. This requires admin permissions (`can_change_model: true` in your API key configuration).

#### Example Request

```bash
curl -X POST "http://localhost:8000/refresh-docs" \
-H "X-API-Key: sugarai2024"
```

#### Example Response

```json
{
"status": "success",
"docs_refreshed": [
"sugar-activity.txt",
"sugar-activity-tutorial.txt",
"hello-world-readme.txt"
],
"vectorstore_rebuilt": true,
"timestamp": "2026-03-21T10:45:30.123456",
"total_docs_count": 6
}
```

#### Using OAuth Authentication

If you're authenticated via OAuth with admin permissions:

```bash
# Using OAuth session (admin with can_change_model: true)
curl -X POST "http://localhost:8000/refresh-docs" \
-H "Cookie: session=your_session_cookie"
```

#### Error Handling

If the refresh fails, the endpoint returns an error response:

```json
{
"detail": "Failed to fetch some documents: Document not found (404): https://raw.githubusercontent.com/sugarlabs/sugar-docs/master/src/sugar-activity.md"
}
```

### Configuration

The documentation sources are defined in `scripts/fetch_sugar_docs.py`:

```python
DOCS_TO_FETCH = [
{
"url": "https://raw.githubusercontent.com/sugarlabs/sugar-docs/master/src/sugar-activity.md",
"filename": "sugar-activity.txt"
},
{
"url": "https://raw.githubusercontent.com/sugarlabs/sugar-docs/master/src/sugar-activity-tutorial.md",
"filename": "sugar-activity-tutorial.txt"
},
{
"url": "https://raw.githubusercontent.com/sugarlabs/hello-world/master/README.md",
"filename": "hello-world-readme.txt"
}
]
```

To add more documentation sources, edit this list with additional `url` and `filename` pairs.

### Scheduling Document Updates

#### Using Cron (Unix/Linux/macOS)

Schedule automatic document updates daily at 2 AM:

```bash
0 2 * * * cd /path/to/sugar-ai && GITHUB_TOKEN=your_token python scripts/fetch_sugar_docs.py
```

#### Using GitHub Actions

Create `.github/workflows/refresh-docs.yml`:

```yaml
name: Refresh Sugar Docs

on:
schedule:
# Runs at 2 AM UTC daily
- cron: '0 2 * * *'
workflow_dispatch: # Allow manual trigger

jobs:
refresh-docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'

- name: Install dependencies
run: pip install -r requirements.txt

- name: Fetch and update docs
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python scripts/fetch_sugar_docs.py

- name: Commit and push updated docs
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add docs/
git commit -m "chore: update Sugar documentation"
git push
```

### Document Format

Each fetched document includes:
1. **Header**: Source URL and timestamp in format `# Fetched from [url] on [timestamp]`
2. **Content**: Converted from markdown to plain text
- Markdown headers (`#`, `##`, etc.) converted to plain text
- HTML tags removed
- Excessive whitespace cleaned up
- Maintains readability for RAG retrieval

### Troubleshooting

#### Rate Limiting Issues

Without GitHub authentication, you're limited to 60 requests/hour. With authentication:
- Classic tokens: 5000 requests/hour
- App tokens: Higher limits depending on configuration

If you see rate limit errors, use GitHub authentication as shown above.

#### Mixed Authentication Methods

The system supports multiple authentication methods in order of preference:
1. **X-API-Key Header**: API key-based access
2. **Admin Cookie**: OAuth session with admin permissions
3. **Unauthenticated**: Via environment variable `GITHUB_TOKEN`

#### Rebuild Issues

If the vectorstore rebuild fails, check:
1. Document files exist in the `docs/` directory
2. Documents are readable and contain valid text
3. Sufficient system memory for embedding generation
4. Check logs: `tail -f sugar_ai.log`

## Using the Streamlit App

Sugar-AI also provides a Streamlit-based interface for quick interactions and visualizations.
Expand Down
10 changes: 9 additions & 1 deletion app/ai.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,18 @@ def setup_vectorstore(self, file_paths: List[str]) -> Optional[FAISS]:
if file_path.endswith(".pdf"):
loader = PyMuPDFLoader(file_path)
else:
loader = TextLoader(file_path)
loader = TextLoader(file_path, encoding="utf-8")
documents = loader.load()
all_documents.extend(documents)

# Filter out documents with minimal content (less than 50 characters)
# This removes placeholder, stub, or empty documents that don't add value to RAG
# More lenient for PDFs which may have sparse content on some pages
all_documents = [doc for doc in all_documents if len(doc.page_content.strip()) > 50]

if not all_documents:
raise ValueError("No valid documents found after filtering. Check that document files contain sufficient content.")

embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
Expand Down
70 changes: 70 additions & 0 deletions app/routes/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,16 @@
import json
from datetime import datetime
from typing import Dict, Optional, List
import sys

from app.database import get_db, APIKey
from app.ai import RAGAgent, extract_answer_from_output
from app.config import settings
from app.auth import get_current_user

# Import document fetching module
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
from scripts.fetch_sugar_docs import fetch_all_docs

# Pydantic models for chat completions
class ChatMessage(BaseModel):
Expand Down Expand Up @@ -326,3 +332,67 @@ async def change_model(
except Exception as e:
logger.error(f"Error changing model to {model} by {user_info['name']}: {str(e)}")
raise HTTPException(status_code=500, detail=f"Error changing model: {str(e)}")

@router.post("/refresh-docs")
async def refresh_docs(
user_data: tuple = Depends(get_current_user),
db: Session = Depends(get_db),
request: Request = None
):
"""
Refresh documentation by fetching latest docs from GitHub and rebuilding vectorstore.
Requires admin permission (can_change_model: true).
"""
# Extract user from tuple (user, authenticated)
user, authenticated = user_data
client_ip = request.client.host if request else "unknown"

# Check authentication
if not authenticated or not user or not user.can_change_model:
logger.warning(f"Unauthorized refresh-docs attempt from {client_ip}")
raise HTTPException(status_code=403, detail="Unauthorized. Admin permission required.")

logger.info(f"REQUEST - /refresh-docs - User: {user.name} - IP: {client_ip}")

try:
timestamp = datetime.now().isoformat()

# Get GitHub token from environment (optional)
github_token = os.getenv("GITHUB_TOKEN", None)

# Fetch all documents
logger.info("Fetching Sugar documentation from GitHub...")
fetch_results = fetch_all_docs(github_token=github_token)

if not fetch_results["success"]:
error_msg = f"Failed to fetch some documents: {', '.join(fetch_results['errors'])}"
logger.error(f"Error refreshing docs - User: {user.name} - {error_msg}")
raise HTTPException(status_code=500, detail=error_msg)

# Rebuild vectorstore with all docs
logger.info("Rebuilding vectorstore with fetched documents...")
doc_paths = settings.DOC_PATHS

# Add newly fetched docs to the list if they're not already there
docs_dir = "docs"
fetched_files = [os.path.join(docs_dir, doc) for doc in fetch_results["fetched_docs"]]
all_docs = list(set(doc_paths + fetched_files))

# Rebuild vectorstore
agent.setup_vectorstore(all_docs)

logger.info(f"SUCCESS - /refresh-docs - User: {user.name} - Fetched {len(fetch_results['fetched_docs'])} docs")

return {
"status": "success",
"docs_refreshed": fetch_results["fetched_docs"],
"vectorstore_rebuilt": True,
"timestamp": timestamp,
"total_docs_count": len(all_docs)
}

except HTTPException:
raise
except Exception as e:
logger.error(f"ERROR - /refresh-docs - User: {user.name} - Error: {str(e)}")
raise HTTPException(status_code=500, detail=f"Error refreshing documentation: {str(e)}")
Loading