Skip to content

Conversation

@chrisgscott
Copy link

@chrisgscott chrisgscott commented Sep 22, 2025

🐛 Bug Fixes

This PR resolves two critical bugs that prevent R2R from working with modern OpenAI models:

1. Graph Extraction Message Formatting Bug

  • Issue: Graph extraction was failing with error
  • Root Cause: Debug logging code expected message objects but received dictionaries
  • Fix: Added proper handling for both dict and message object formats
  • Impact: Enables graph extraction to work with GPT-5, GPT-4o, and other modern models

2. Audio Transcription Parameter Filtering Bug

  • Issue: Audio transcription failing with error
  • Root Cause: Text chunking parameters were being passed to audio transcription API
  • Fix: Filter out text-specific parameters before calling audio transcription
  • Impact: Enables audio transcription with gpt-4o-mini-transcribe and other new models

3. Enhanced Debug Logging

  • Added comprehensive logging for troubleshooting extraction and parsing issues
  • Logs exact LLM responses and XML parsing steps
  • Helps identify and resolve future compatibility issues

🧪 Testing

  • ✅ Tested with GPT-5-mini for graph extraction
  • ✅ Tested with gpt-4o-mini-transcribe for audio transcription
  • ✅ Verified entity and relationship extraction works properly
  • ✅ Confirmed audio transcription processes successfully

📊 Impact

These fixes are critical for R2R compatibility with:

  • GPT-5 and GPT-5-mini models
  • GPT-4o and GPT-4o-mini variants
  • Modern OpenAI transcription models

Without these fixes, graph extraction and audio transcription fail completely with newer models.

🔍 Files Changed

    • Fixed message formatting and added debug logging
    • Fixed parameter filtering and added debug logging

✅ Checklist

  • Bug fixes tested and verified
  • No breaking changes to existing functionality
  • Enhanced logging for better troubleshooting
  • Compatible with both old and new OpenAI models

Important

Fixes critical bugs in graph extraction and audio transcription, adding enhanced logging for compatibility with modern OpenAI models.

  • Graph Extraction Bug:
    • Fixed message formatting in _extract_graph_search_results_from_chunk_group() in graph_service.py to handle both dict and message object formats.
    • Added detailed debug logging for message types, content, and LLM responses.
  • Audio Transcription Bug:
    • Fixed parameter filtering in ingest() in audio_parser.py to exclude text-specific parameters.
    • Added detailed debug logging for audio transcription process and response handling.
  • Enhanced Debug Logging:
    • Added comprehensive logging in both graph_service.py and audio_parser.py for troubleshooting extraction and parsing issues.
    • Logs include LLM responses, XML parsing steps, and audio transcription details.

This description was created by Ellipsis for 4c5d56d. You can customize this summary. It will automatically update as commits are pushed.

- Fix graph extraction message formatting bug that prevented entity extraction
  * Handle both dict and message object formats in debug logging
  * Resolves 'dict object has no attribute role' error
  * Enables successful extraction with GPT-5, GPT-4o and other modern models

- Fix audio transcription parameter filtering
  * Filter out text chunking parameters from audio API calls
  * Resolves 'Invalid chunking_strategy' error with gpt-4o-mini-transcribe
  * Enables successful audio transcription with new OpenAI models

- Add comprehensive debug logging for troubleshooting extraction issues

These are critical bug fixes that enable R2R to work properly with
modern OpenAI models including GPT-5 and GPT-4o variants.
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 4c5d56d in 1 minute and 43 seconds. Click for details.
  • Reviewed 164 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. py/core/parsers/media/audio_parser.py:75
  • Draft comment:
    The file opened with open(temp_file_path, 'rb') is not explicitly closed. Consider using a 'with' statement to ensure the file handle is properly closed.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_vVg3QbWGZXgezhTm

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

for attempt in range(retries):
try:
# DEBUG LOGGING: Log the exact prompt being sent
logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logger.debug instead of logger.info for verbose debug logging (e.g. logging the prompt details) to avoid cluttering production logs.

Suggested change
logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")
logger.debug(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")


cleaned_xml = sanitize_xml(response_str)
logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML length: {len(cleaned_xml)}")
logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verbose logging of XML content (e.g. cleaned and wrapped XML) may expose sensitive data; consider using debug level or gating these logs behind a debug flag.

Suggested change
logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")
logger.debug(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")

- Untrack docker/env/r2r-full.env to prevent API key exposure
- Enhanced .gitignore to block sensitive files
- File remains local for development but won't be committed
✨ Features Added:
• Hierarchical chunking with parent-child linking for better context
• Enhanced spreadsheet processing with narrative + structured data storage
• Tool-augmented orchestration with automatic Text-to-SQL queries
• Citation system with deep links and confidence scoring
• Web search integration with smart fallback and user controls
• Supabase integration with enhanced schema and RLS policies
• MCP server for standardized API access across applications

🤖 AI Model Support:
• GPT-5, O3-mini, Claude-3.7-Sonnet integration
• Advanced query strategies: RAG Fusion, HyDE
• Multi-modal processing: text, images, audio, spreadsheets
• High-quality embeddings (3072 dimensions)

🛠️ Developer Experience:
• One-command setup with ./setup-new-project.sh
• Comprehensive documentation and examples
• Security hardened with proper .gitignore and templates
• Production-ready configuration
• Complete test suite

🏗️ Architecture:
• Bug fixes for graph extraction and audio transcription
• Enhanced metadata providers for better citations
• Supabase-optimized database schema
• MCP integration for frontend applications
• Ellen V2 project documentation and planning

This template now provides enterprise-grade RAG capabilities that surpass
the original Ellen V2 specification, with complete source transparency,
intelligent fallbacks, and standardized API access.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant