fix: resolve critical graph extraction and audio transcription bugs #2261

chrisgscott · 2025-09-22T22:58:17Z

🐛 Bug Fixes

This PR resolves two critical bugs that prevent R2R from working with modern OpenAI models:

1. Graph Extraction Message Formatting Bug

Issue: Graph extraction was failing with error
Root Cause: Debug logging code expected message objects but received dictionaries
Fix: Added proper handling for both dict and message object formats
Impact: Enables graph extraction to work with GPT-5, GPT-4o, and other modern models

2. Audio Transcription Parameter Filtering Bug

Issue: Audio transcription failing with error
Root Cause: Text chunking parameters were being passed to audio transcription API
Fix: Filter out text-specific parameters before calling audio transcription
Impact: Enables audio transcription with gpt-4o-mini-transcribe and other new models

3. Enhanced Debug Logging

Added comprehensive logging for troubleshooting extraction and parsing issues
Logs exact LLM responses and XML parsing steps
Helps identify and resolve future compatibility issues

🧪 Testing

✅ Tested with GPT-5-mini for graph extraction
✅ Tested with gpt-4o-mini-transcribe for audio transcription
✅ Verified entity and relationship extraction works properly
✅ Confirmed audio transcription processes successfully

📊 Impact

These fixes are critical for R2R compatibility with:

GPT-5 and GPT-5-mini models
GPT-4o and GPT-4o-mini variants
Modern OpenAI transcription models

Without these fixes, graph extraction and audio transcription fail completely with newer models.

🔍 Files Changed

- Fixed message formatting and added debug logging
- Fixed parameter filtering and added debug logging

✅ Checklist

Bug fixes tested and verified
No breaking changes to existing functionality
Enhanced logging for better troubleshooting
Compatible with both old and new OpenAI models

Important

Fixes critical bugs in graph extraction and audio transcription, adding enhanced logging for compatibility with modern OpenAI models.

Graph Extraction Bug:
- Fixed message formatting in _extract_graph_search_results_from_chunk_group() in graph_service.py to handle both dict and message object formats.
- Added detailed debug logging for message types, content, and LLM responses.
Audio Transcription Bug:
- Fixed parameter filtering in ingest() in audio_parser.py to exclude text-specific parameters.
- Added detailed debug logging for audio transcription process and response handling.
Enhanced Debug Logging:
- Added comprehensive logging in both graph_service.py and audio_parser.py for troubleshooting extraction and parsing issues.
- Logs include LLM responses, XML parsing steps, and audio transcription details.

^{This description was created by}^{for 4c5d56d. You can customize this summary. It will automatically update as commits are pushed.}

- Fix graph extraction message formatting bug that prevented entity extraction * Handle both dict and message object formats in debug logging * Resolves 'dict object has no attribute role' error * Enables successful extraction with GPT-5, GPT-4o and other modern models - Fix audio transcription parameter filtering * Filter out text chunking parameters from audio API calls * Resolves 'Invalid chunking_strategy' error with gpt-4o-mini-transcribe * Enables successful audio transcription with new OpenAI models - Add comprehensive debug logging for troubleshooting extraction issues These are critical bug fixes that enable R2R to work properly with modern OpenAI models including GPT-5 and GPT-4o variants.

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 4c5d56d in 1 minute and 43 seconds. Click for details.

Reviewed 164 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. py/core/parsers/media/audio_parser.py:75

Draft comment:
The file opened with open(temp_file_path, 'rb') is not explicitly closed. Consider using a 'with' statement to ensure the file handle is properly closed.
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_vVg3QbWGZXgezhTm

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-09-22T23:00:04Z

py/core/main/services/graph_service.py

        for attempt in range(retries):
            try:
+                # DEBUG LOGGING: Log the exact prompt being sent
+                logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")


Consider using logger.debug instead of logger.info for verbose debug logging (e.g. logging the prompt details) to avoid cluttering production logs.

Suggested change

logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")

logger.debug(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")

ellipsis-dev · 2025-09-22T23:00:05Z

py/core/main/services/graph_service.py

+
        cleaned_xml = sanitize_xml(response_str)
+        logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML length: {len(cleaned_xml)}")
+        logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")


Verbose logging of XML content (e.g. cleaned and wrapped XML) may expose sensitive data; consider using debug level or gating these logs behind a debug flag.

Suggested change

logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")

logger.debug(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")

- Untrack docker/env/r2r-full.env to prevent API key exposure - Enhanced .gitignore to block sensitive files - File remains local for development but won't be committed

✨ Features Added: • Hierarchical chunking with parent-child linking for better context • Enhanced spreadsheet processing with narrative + structured data storage • Tool-augmented orchestration with automatic Text-to-SQL queries • Citation system with deep links and confidence scoring • Web search integration with smart fallback and user controls • Supabase integration with enhanced schema and RLS policies • MCP server for standardized API access across applications 🤖 AI Model Support: • GPT-5, O3-mini, Claude-3.7-Sonnet integration • Advanced query strategies: RAG Fusion, HyDE • Multi-modal processing: text, images, audio, spreadsheets • High-quality embeddings (3072 dimensions) 🛠️ Developer Experience: • One-command setup with ./setup-new-project.sh • Comprehensive documentation and examples • Security hardened with proper .gitignore and templates • Production-ready configuration • Complete test suite 🏗️ Architecture: • Bug fixes for graph extraction and audio transcription • Enhanced metadata providers for better citations • Supabase-optimized database schema • MCP integration for frontend applications • Ellen V2 project documentation and planning This template now provides enterprise-grade RAG capabilities that surpass the original Ellen V2 specification, with complete source transparency, intelligent fallbacks, and standardized API access.

ellipsis-dev bot reviewed Sep 22, 2025

View reviewed changes

chrisgscott added 2 commits September 22, 2025 17:11

Remove sensitive env file from tracking and enhance gitignore

d87231a

- Untrack docker/env/r2r-full.env to prevent API key exposure - Enhanced .gitignore to block sensitive files - File remains local for development but won't be committed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve critical graph extraction and audio transcription bugs #2261

fix: resolve critical graph extraction and audio transcription bugs #2261

Uh oh!

chrisgscott commented Sep 22, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Sep 22, 2025

Uh oh!

ellipsis-dev bot Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")
	logger.debug(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})")

	logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")
	logger.debug(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'")

fix: resolve critical graph extraction and audio transcription bugs #2261

Are you sure you want to change the base?

fix: resolve critical graph extraction and audio transcription bugs #2261

Uh oh!

Conversation

chrisgscott commented Sep 22, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug Fixes

1. Graph Extraction Message Formatting Bug

2. Audio Transcription Parameter Filtering Bug

3. Enhanced Debug Logging

🧪 Testing

📊 Impact

🔍 Files Changed

✅ Checklist

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chrisgscott commented Sep 22, 2025 •

edited by ellipsis-dev bot

Loading