sugarlabs · skypank-coder · Mar 22, 2026
diff --git a/.example.env b/.example.env
@@ -33,3 +33,10 @@ SESSION_SECRET_KEY=your_secret_key
 WEBHOOK_SECRET=your_webhook_secret
 REPO_PATH_LOCALLY=/path/to/sugar-ai
 GIT_PATH=/usr/bin/git
+
+# GitHub API authentication (optional)
+# Used when fetching Sugar documentation from GitHub
+# Provides higher API rate limits (5000 requests/hour vs 60)
+# Generate token at: https://github.com/settings/tokens
+# Required scopes: public_repo (read-only access)
+GITHUB_TOKEN=your_github_personal_access_token
diff --git a/README.md b/README.md
@@ -499,6 +499,233 @@ Review the terminal output for further details and error messages.
 
 When deploying Sugar-AI in CI/CD pipelines, you'll need to configure environment variables properly. Current CI/CD uses github webhooks. So make sure to create a webhook secret and add it to the `.env`.
 
+## Keeping RAG Docs Up to Date
+
+Sugar-AI includes a dynamic document fetching system to keep your RAG (Retrieval-Augmented Generation) documentation fresh and current. Instead of manually managing static documentation files, you can automatically fetch and index the latest Sugar documentation from GitHub.
+
+### Overview
+
+The dynamic document fetching system:
+- Fetches documentation from Sugar Labs repositories on GitHub
+- Converts markdown to clean plain text (removes headers, HTML tags, etc.)
+- Adds metadata about the source URL and fetch timestamp
+- Rebuilds the vector store for efficient retrieval
+- Supports GitHub API authentication for higher rate limits
+
+### Manual Document Fetching
+
+To fetch and update documentation manually:
+
+#### Basic Usage (No Authentication)
+
+```bash
+python scripts/fetch_sugar_docs.py
+```
+
+This will fetch all configured documentation sources and save them to the `docs/` directory.
+
+#### With GitHub Authentication
+
+For higher API rate limits, you can provide a GitHub personal access token:
+
+```bash
+export GITHUB_TOKEN=your_github_personal_access_token
+python scripts/fetch_sugar_docs.py
+```
+
+To generate a GitHub token:
+1. Go to https://github.com/settings/tokens
+2. Click "Generate new token (classic)"
+3. Select scope: `public_repo` (read-only access)
+4. Copy the token and use it as shown above
+
+#### Expected Output
+
+```
+============================================================
+SUGAR-AI DOCUMENT FETCH SUMMARY
+============================================================
+Timestamp: 2026-03-21T10:30:45.123456
+Total documents attempted: 3
+Successfully fetched: 3
+Failed: 0
+
+Fetched documents:
+  ✓ sugar-activity.txt
+  ✓ sugar-activity-tutorial.txt
+  ✓ hello-world-readme.txt
+
+============================================================
+Fetched 3 docs successfully
+```
+
+#### Handling Errors
+
+The script handles common errors gracefully:
+- **404 Not Found**: If a documentation URL no longer exists
+- **Network Failures**: Connection timeouts or network errors
+- **Authentication Issues**: Invalid or expired GitHub tokens
+
+Failed documents are reported in the output, and remaining documents are still fetched and indexed.
+
+### Automated Document Refreshing via API
+
+#### Using the /refresh-docs Endpoint
+
+For automated updates, you can use the `/refresh-docs` endpoint. This requires admin permissions (`can_change_model: true` in your API key configuration).
+
+#### Example Request
+
+```bash
+curl -X POST "http://localhost:8000/refresh-docs" \
+  -H "X-API-Key: sugarai2024"
+```
+
+#### Example Response
+
+```json
+{
+  "status": "success",
+  "docs_refreshed": [
+    "sugar-activity.txt",
+    "sugar-activity-tutorial.txt",
+    "hello-world-readme.txt"
+  ],
+  "vectorstore_rebuilt": true,
+  "timestamp": "2026-03-21T10:45:30.123456",
+  "total_docs_count": 6
+}
+```
+
+#### Using OAuth Authentication
+
+If you're authenticated via OAuth with admin permissions:
+
+```bash
+# Using OAuth session (admin with can_change_model: true)
+curl -X POST "http://localhost:8000/refresh-docs" \
+  -H "Cookie: session=your_session_cookie"
+```
+
+#### Error Handling
+
+If the refresh fails, the endpoint returns an error response:
+
+```json
+{
+  "detail": "Failed to fetch some documents: Document not found (404): https://raw.githubusercontent.com/sugarlabs/sugar-docs/master/src/sugar-activity.md"
+}
+```
+
+### Configuration
+
+The documentation sources are defined in `scripts/fetch_sugar_docs.py`:
+
+```python
+DOCS_TO_FETCH = [
+    {
+        "url": "https://raw.githubusercontent.com/sugarlabs/sugar-docs/master/src/sugar-activity.md",
+        "filename": "sugar-activity.txt"
+    },
+    {
+        "url": "https://raw.githubusercontent.com/sugarlabs/sugar-docs/master/src/sugar-activity-tutorial.md",
+        "filename": "sugar-activity-tutorial.txt"
+    },
+    {
+        "url": "https://raw.githubusercontent.com/sugarlabs/hello-world/master/README.md",
+        "filename": "hello-world-readme.txt"
+    }
+]
+```
+
+To add more documentation sources, edit this list with additional `url` and `filename` pairs.
+
+### Scheduling Document Updates
+
+#### Using Cron (Unix/Linux/macOS)
+
+Schedule automatic document updates daily at 2 AM:
+
+```bash
+0 2 * * * cd /path/to/sugar-ai && GITHUB_TOKEN=your_token python scripts/fetch_sugar_docs.py
+```
+
+#### Using GitHub Actions
+
+Create `.github/workflows/refresh-docs.yml`:
+
+```yaml
+name: Refresh Sugar Docs
+
+on:
+  schedule:
+    # Runs at 2 AM UTC daily
+    - cron: '0 2 * * *'
+  workflow_dispatch:  # Allow manual trigger
+
+jobs:
+  refresh-docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.9'
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Fetch and update docs
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: python scripts/fetch_sugar_docs.py
+
+      - name: Commit and push updated docs
+        run: |
+          git config --local user.email "action@github.com"
+          git config --local user.name "GitHub Action"
+          git add docs/
+          git commit -m "chore: update Sugar documentation"
+          git push
+```
+
+### Document Format
+
+Each fetched document includes:
+1. **Header**: Source URL and timestamp in format `# Fetched from [url] on [timestamp]`
+2. **Content**: Converted from markdown to plain text
+   - Markdown headers (`#`, `##`, etc.) converted to plain text
+   - HTML tags removed
+   - Excessive whitespace cleaned up
+   - Maintains readability for RAG retrieval
+
+### Troubleshooting
+
+#### Rate Limiting Issues
+
+Without GitHub authentication, you're limited to 60 requests/hour. With authentication:
+- Classic tokens: 5000 requests/hour
+- App tokens: Higher limits depending on configuration
+
+If you see rate limit errors, use GitHub authentication as shown above.
+
+#### Mixed Authentication Methods
+
+The system supports multiple authentication methods in order of preference:
+1. **X-API-Key Header**: API key-based access
+2. **Admin Cookie**: OAuth session with admin permissions
+3. **Unauthenticated**: Via environment variable `GITHUB_TOKEN`
+
+#### Rebuild Issues
+
+If the vectorstore rebuild fails, check:
+1. Document files exist in the `docs/` directory
+2. Documents are readable and contain valid text
+3. Sufficient system memory for embedding generation
+4. Check logs: `tail -f sugar_ai.log`
+
 ## Using the Streamlit App
 
 Sugar-AI also provides a Streamlit-based interface for quick interactions and visualizations.

diff --git a/app/ai.py b/app/ai.py
@@ -141,10 +141,18 @@ def setup_vectorstore(self, file_paths: List[str]) -> Optional[FAISS]:
                 if file_path.endswith(".pdf"):
                     loader = PyMuPDFLoader(file_path)
                 else:
-                    loader = TextLoader(file_path)
+                    loader = TextLoader(file_path, encoding="utf-8")
                 documents = loader.load()
                 all_documents.extend(documents)
 
+        # Filter out documents with minimal content (less than 50 characters)
+        # This removes placeholder, stub, or empty documents that don't add value to RAG
+        # More lenient for PDFs which may have sparse content on some pages
+        all_documents = [doc for doc in all_documents if len(doc.page_content.strip()) > 50]
+
+        if not all_documents:
+            raise ValueError("No valid documents found after filtering. Check that document files contain sufficient content.")
+
         embeddings = HuggingFaceEmbeddings(
             model_name="sentence-transformers/all-MiniLM-L6-v2"
         )

diff --git a/app/routes/api.py b/app/routes/api.py
@@ -10,10 +10,16 @@
 import json
 from datetime import datetime
 from typing import Dict, Optional, List
+import sys
 
 from app.database import get_db, APIKey
 from app.ai import RAGAgent, extract_answer_from_output
 from app.config import settings
+from app.auth import get_current_user
+
+# Import document fetching module
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
+from scripts.fetch_sugar_docs import fetch_all_docs
 
 # Pydantic models for chat completions
 class ChatMessage(BaseModel):
@@ -326,3 +332,67 @@ async def change_model(
     except Exception as e:
         logger.error(f"Error changing model to {model} by {user_info['name']}: {str(e)}")
         raise HTTPException(status_code=500, detail=f"Error changing model: {str(e)}")
+
+@router.post("/refresh-docs")
+async def refresh_docs(
+    user_data: tuple = Depends(get_current_user),
+    db: Session = Depends(get_db),
+    request: Request = None
+):
+    """
+    Refresh documentation by fetching latest docs from GitHub and rebuilding vectorstore.
+    Requires admin permission (can_change_model: true).
+    """
+    # Extract user from tuple (user, authenticated)
+    user, authenticated = user_data
+    client_ip = request.client.host if request else "unknown"
+
+    # Check authentication
+    if not authenticated or not user or not user.can_change_model:
+        logger.warning(f"Unauthorized refresh-docs attempt from {client_ip}")
+        raise HTTPException(status_code=403, detail="Unauthorized. Admin permission required.")
+
+    logger.info(f"REQUEST - /refresh-docs - User: {user.name} - IP: {client_ip}")
+
+    try:
+        timestamp = datetime.now().isoformat()
+
+        # Get GitHub token from environment (optional)
+        github_token = os.getenv("GITHUB_TOKEN", None)
+
+        # Fetch all documents
+        logger.info("Fetching Sugar documentation from GitHub...")
+        fetch_results = fetch_all_docs(github_token=github_token)
+
+        if not fetch_results["success"]:
+            error_msg = f"Failed to fetch some documents: {', '.join(fetch_results['errors'])}"
+            logger.error(f"Error refreshing docs - User: {user.name} - {error_msg}")
+            raise HTTPException(status_code=500, detail=error_msg)
+
+        # Rebuild vectorstore with all docs
+        logger.info("Rebuilding vectorstore with fetched documents...")
+        doc_paths = settings.DOC_PATHS
+
+        # Add newly fetched docs to the list if they're not already there
+        docs_dir = "docs"
+        fetched_files = [os.path.join(docs_dir, doc) for doc in fetch_results["fetched_docs"]]
+        all_docs = list(set(doc_paths + fetched_files))
+
+        # Rebuild vectorstore
+        agent.setup_vectorstore(all_docs)
+
+        logger.info(f"SUCCESS - /refresh-docs - User: {user.name} - Fetched {len(fetch_results['fetched_docs'])} docs")
+
+        return {
+            "status": "success",
+            "docs_refreshed": fetch_results["fetched_docs"],
+            "vectorstore_rebuilt": True,
+            "timestamp": timestamp,
+            "total_docs_count": len(all_docs)
+        }
+
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"ERROR - /refresh-docs - User: {user.name} - Error: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error refreshing documentation: {str(e)}")