-
Notifications
You must be signed in to change notification settings - Fork 608
fix: resolve critical graph extraction and audio transcription bugs #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: resolve critical graph extraction and audio transcription bugs #2261
Conversation
- Fix graph extraction message formatting bug that prevented entity extraction * Handle both dict and message object formats in debug logging * Resolves 'dict object has no attribute role' error * Enables successful extraction with GPT-5, GPT-4o and other modern models - Fix audio transcription parameter filtering * Filter out text chunking parameters from audio API calls * Resolves 'Invalid chunking_strategy' error with gpt-4o-mini-transcribe * Enables successful audio transcription with new OpenAI models - Add comprehensive debug logging for troubleshooting extraction issues These are critical bug fixes that enable R2R to work properly with modern OpenAI models including GPT-5 and GPT-4o variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to 4c5d56d in 1 minute and 43 seconds. Click for details.
- Reviewed
164lines of code in2files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. py/core/parsers/media/audio_parser.py:75
- Draft comment:
The file opened with open(temp_file_path, 'rb') is not explicitly closed. Consider using a 'with' statement to ensure the file handle is properly closed. - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_vVg3QbWGZXgezhTm
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| for attempt in range(retries): | ||
| try: | ||
| # DEBUG LOGGING: Log the exact prompt being sent | ||
| logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using logger.debug instead of logger.info for verbose debug logging (e.g. logging the prompt details) to avoid cluttering production logs.
| logger.info(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})") | |
| logger.debug(f"GRAPH EXTRACTION DEBUG: Sending prompt to LLM (attempt {attempt + 1})") |
|
|
||
| cleaned_xml = sanitize_xml(response_str) | ||
| logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML length: {len(cleaned_xml)}") | ||
| logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verbose logging of XML content (e.g. cleaned and wrapped XML) may expose sensitive data; consider using debug level or gating these logs behind a debug flag.
| logger.info(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'") | |
| logger.debug(f"GRAPH EXTRACTION DEBUG: Cleaned XML content: '{cleaned_xml}'") |
- Untrack docker/env/r2r-full.env to prevent API key exposure - Enhanced .gitignore to block sensitive files - File remains local for development but won't be committed
✨ Features Added: • Hierarchical chunking with parent-child linking for better context • Enhanced spreadsheet processing with narrative + structured data storage • Tool-augmented orchestration with automatic Text-to-SQL queries • Citation system with deep links and confidence scoring • Web search integration with smart fallback and user controls • Supabase integration with enhanced schema and RLS policies • MCP server for standardized API access across applications 🤖 AI Model Support: • GPT-5, O3-mini, Claude-3.7-Sonnet integration • Advanced query strategies: RAG Fusion, HyDE • Multi-modal processing: text, images, audio, spreadsheets • High-quality embeddings (3072 dimensions) 🛠️ Developer Experience: • One-command setup with ./setup-new-project.sh • Comprehensive documentation and examples • Security hardened with proper .gitignore and templates • Production-ready configuration • Complete test suite 🏗️ Architecture: • Bug fixes for graph extraction and audio transcription • Enhanced metadata providers for better citations • Supabase-optimized database schema • MCP integration for frontend applications • Ellen V2 project documentation and planning This template now provides enterprise-grade RAG capabilities that surpass the original Ellen V2 specification, with complete source transparency, intelligent fallbacks, and standardized API access.
🐛 Bug Fixes
This PR resolves two critical bugs that prevent R2R from working with modern OpenAI models:
1. Graph Extraction Message Formatting Bug
2. Audio Transcription Parameter Filtering Bug
3. Enhanced Debug Logging
🧪 Testing
📊 Impact
These fixes are critical for R2R compatibility with:
Without these fixes, graph extraction and audio transcription fail completely with newer models.
🔍 Files Changed
✅ Checklist
Important
Fixes critical bugs in graph extraction and audio transcription, adding enhanced logging for compatibility with modern OpenAI models.
_extract_graph_search_results_from_chunk_group()ingraph_service.pyto handle both dict and message object formats.ingest()inaudio_parser.pyto exclude text-specific parameters.graph_service.pyandaudio_parser.pyfor troubleshooting extraction and parsing issues.This description was created by
for 4c5d56d. You can customize this summary. It will automatically update as commits are pushed.