-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Feature Request
Current Behavior: The system returns ALL unique pages from retrieved chunks that were provided to the AI as context, regardless of whether the AI actually referenced those pages in its response.
Desired Behavior: Only show pages that the AI actually references in its response, providing more accurate and relevant page citations.
Problem Description
When a user asks a question, the ensemble retriever might fetch chunks from 5-10 different pages to provide context to the AI. However, the AI might only actually use content from 2-3 pages in its answer.
Example Scenario:
- AI receives chunks from pages: [0, 0, 1, 1, 2, 3, 4, 4]
- AI mentions only pages 1 and 2 in its response
- Current: Shows references [0, 1, 2, 3, 4] (all pages)
- Desired: Shows references [1, 2] (only mentioned pages)
Proposed Solutions
Option 1: AI-Generated References (Recommended)
Modify the system prompt to ask the AI to explicitly list pages it references:
const systemPrompt = `${SYSTEM_PROMPTS[selectedStyle]}
After your answer, include a "References:" section listing only the page numbers
you actually referenced in your response. Format as: "References: Page 1, Page 3, Page 5"`;Pros:
- Most accurate - AI knows exactly which pages it used
- Clear and explicit references
- Can be implemented as simple prompt addition
Cons:
- Increases token usage slightly
- Requires consistent parsing of AI responses
Option 2: Post-Processing Analysis
Parse the AI's response to extract page numbers it mentions:
function extractPageNumbersFromResponse(response: string): number[] {
// Look for patterns like "page 2", "Page 3", "pages 1-3", etc.
const pageRegex = /(?:page|pages?)\s*(\d+(?:\s*[-–]\s*\d+)?)/gi;
// Extract and parse page numbers
}Pros:
- Works with existing AI responses
- No prompt changes needed
Cons:
- AI might reference pages inconsistently
- Complex parsing for edge cases ("the previous page", "the section above")
Option 3: Relevance-Based Filtering
Show only the top N most relevant pages based on similarity scores:
const topPages = documents
.sort((a, b) => (a.metadata?.distance ?? 1) - (b.metadata?.distance ?? 1))
.slice(0, 3)
.map(doc => doc.metadata?.page);Pros:
- Simple to implement
- Consistent behavior
Cons:
- Still might show pages AI didn't use
- Arbitrary cutoff
�� Recommended Implementation
Start with Option 1 (AI-Generated References) with fallback to current behavior:
- Modify system prompt to request explicit references
- Parse references from AI response
- If parsing fails, fallback to current behavior (all pages)
- Add user preference for reference detail level
Implementation Tasks
- Update system prompts to request explicit page references
- Implement reference parsing with multiple format support
- Add fallback logic for failed parsing
- Update frontend to handle new reference format
- Add test cases for various reference formats
- Add user preference setting for reference detail
Test Cases
- AI mentions specific pages: "According to page 2..." → Should show [2]
- AI mentions ranges: "See pages 1-3" → Should show [1, 2, 3]
- AI lists multiple: "Pages 0, 2, and 5 contain relevant info" → Should show [0, 2, 5]
- AI doesn't mention pages: Uses content implicitly → Fallback to current behavior
- AI uses inconsistent format: "the previous section" → Fallback to current behavior
Success Metrics
- Accuracy: % of times shown pages match AI-referenced pages
- User Satisfaction: Feedback on reference relevance
- Performance: No significant increase in response time
- Reliability: Consistent reference extraction across responses
Related Issues
- feat: Integrate DataLab native pagination for OCR processing #89: Integrate DataLab native pagination for OCR processing (completed)
- This issue builds on the accurate page metadata from feat: Integrate DataLab native pagination for OCR processing #89
Priority
Medium: This is a UX improvement that will make the reference pages more useful and less confusing for users. The current behavior can show too many pages, making it unclear which ones are actually relevant.
Would improve user experience by showing only truly relevant page references instead of all pages provided as context.