Skip to content

Enhance reference pages to show only AI-mentioned pages instead of all retrieved pages #90

@kien-ship-it

Description

@kien-ship-it

Feature Request

Current Behavior: The system returns ALL unique pages from retrieved chunks that were provided to the AI as context, regardless of whether the AI actually referenced those pages in its response.

Desired Behavior: Only show pages that the AI actually references in its response, providing more accurate and relevant page citations.

Problem Description

When a user asks a question, the ensemble retriever might fetch chunks from 5-10 different pages to provide context to the AI. However, the AI might only actually use content from 2-3 pages in its answer.

Example Scenario:

  • AI receives chunks from pages: [0, 0, 1, 1, 2, 3, 4, 4]
  • AI mentions only pages 1 and 2 in its response
  • Current: Shows references [0, 1, 2, 3, 4] (all pages)
  • Desired: Shows references [1, 2] (only mentioned pages)

Proposed Solutions

Option 1: AI-Generated References (Recommended)

Modify the system prompt to ask the AI to explicitly list pages it references:

const systemPrompt = `${SYSTEM_PROMPTS[selectedStyle]}

After your answer, include a "References:" section listing only the page numbers 
you actually referenced in your response. Format as: "References: Page 1, Page 3, Page 5"`;

Pros:

  • Most accurate - AI knows exactly which pages it used
  • Clear and explicit references
  • Can be implemented as simple prompt addition

Cons:

  • Increases token usage slightly
  • Requires consistent parsing of AI responses

Option 2: Post-Processing Analysis

Parse the AI's response to extract page numbers it mentions:

function extractPageNumbersFromResponse(response: string): number[] {
  // Look for patterns like "page 2", "Page 3", "pages 1-3", etc.
  const pageRegex = /(?:page|pages?)\s*(\d+(?:\s*[-]\s*\d+)?)/gi;
  // Extract and parse page numbers
}

Pros:

  • Works with existing AI responses
  • No prompt changes needed

Cons:

  • AI might reference pages inconsistently
  • Complex parsing for edge cases ("the previous page", "the section above")

Option 3: Relevance-Based Filtering

Show only the top N most relevant pages based on similarity scores:

const topPages = documents
  .sort((a, b) => (a.metadata?.distance ?? 1) - (b.metadata?.distance ?? 1))
  .slice(0, 3)
  .map(doc => doc.metadata?.page);

Pros:

  • Simple to implement
  • Consistent behavior

Cons:

  • Still might show pages AI didn't use
  • Arbitrary cutoff

�� Recommended Implementation

Start with Option 1 (AI-Generated References) with fallback to current behavior:

  1. Modify system prompt to request explicit references
  2. Parse references from AI response
  3. If parsing fails, fallback to current behavior (all pages)
  4. Add user preference for reference detail level

Implementation Tasks

  • Update system prompts to request explicit page references
  • Implement reference parsing with multiple format support
  • Add fallback logic for failed parsing
  • Update frontend to handle new reference format
  • Add test cases for various reference formats
  • Add user preference setting for reference detail

Test Cases

  • AI mentions specific pages: "According to page 2..." → Should show [2]
  • AI mentions ranges: "See pages 1-3" → Should show [1, 2, 3]
  • AI lists multiple: "Pages 0, 2, and 5 contain relevant info" → Should show [0, 2, 5]
  • AI doesn't mention pages: Uses content implicitly → Fallback to current behavior
  • AI uses inconsistent format: "the previous section" → Fallback to current behavior

Success Metrics

  • Accuracy: % of times shown pages match AI-referenced pages
  • User Satisfaction: Feedback on reference relevance
  • Performance: No significant increase in response time
  • Reliability: Consistent reference extraction across responses

Related Issues

Priority

Medium: This is a UX improvement that will make the reference pages more useful and less confusing for users. The current behavior can show too many pages, making it unclear which ones are actually relevant.


Would improve user experience by showing only truly relevant page references instead of all pages provided as context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions