Skip to content

Conversation

@rexjohannes
Copy link

@rexjohannes rexjohannes commented Sep 17, 2025

Fixes #2238 (limit inconsistency) and #2222 (invisible documents) + "Documents disappear on delete but after a delay appear again"

This pull request primarily increases the maximum limit parameter for various API endpoints from 100 to 1000, allowing clients to request larger result sets. It also improves how collection IDs are handled during document ingestion and updates the database transaction isolation for document upserts to increase reliability. Below are the most important changes grouped by theme:

API Parameter Updates

  • Increased the maximum allowed value for the limit parameter from 100 to 1000 across multiple endpoints in documents_router.py, collections_router.py, users_router.py, and chunks_router.py, as well as in the corresponding OpenAPI documentation (llms.txt). This change enables clients to retrieve up to 1000 objects per request instead of 100. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Collection ID Handling in Document Ingestion

  • Updated ingestion workflows and services to consistently use document_info.collection_ids for assigning and propagating collection IDs, ensuring documents and chunks are correctly associated with collections. This includes changes in ingestion_service.py, documents_router.py, and orchestration workflows. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Database Reliability Improvements

  • Changed transaction isolation level to serializable in upsert_documents_overview to reduce race conditions and added handling for SerializationFailureError to improve retry logic during concurrent document upserts. [1] [2]

Document Status Update Safeguards

  • Added a check in ingestion_service.py to ensure a document still exists before updating its status, preventing accidental recreation of deleted documents during ingestion.

Minor Query Construction Fix

  • Improved query construction in get_documents_overview to ensure conditions are properly combined.

Important

Increased API limit parameter to 1000, improved collection ID handling, and enhanced database reliability and document status updates.

  • API Parameter Updates
    • Increased limit parameter from 100 to 1000 in documents_router.py, collections_router.py, and users_router.py to allow larger result sets.
  • Collection ID Handling in Document Ingestion
    • Updated ingestion_service.py and documents_router.py to use document_info.collection_ids for consistent collection ID assignment.
  • Database Reliability Improvements
    • Changed transaction isolation to serializable in upsert_documents_overview to reduce race conditions.
    • Added SerializationFailureError handling for retries during document upserts.
  • Document Status Update Safeguards
    • Added checks in ingestion_service.py to ensure document existence before status updates, preventing accidental recreation of deleted documents.

This description was created by Ellipsis for 3f97b08. You can customize this summary. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 3f97b08 in 1 minute and 20 seconds. Click for details.
  • Reviewed 325 lines of code in 9 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. py/core/providers/database/documents.py:387
  • Draft comment:
    The _get_ids_from_table query uses 'AND $2 = ANY(collection_ids)' without checking if collection_id is provided. If collection_id is None, this condition may behave unexpectedly. Consider adding explicit logic to handle a missing collection constraint.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. py/core/providers/database/documents.py:327
  • Draft comment:
    In upsert_documents_overview, exponential backoff is used (wait_time = 0.1 * (2**retries)). Consider adding jitter to this formula to avoid thundering herd issues when many concurrent updates occur.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. py/core/providers/database/documents.py:667
  • Draft comment:
    Parsing 'summary_embedding' by slicing the string (using [1:-1] and splitting by commas) can be brittle. Consider storing embeddings in a structured format (e.g., as JSON) so you can reliably convert them without manual string manipulation.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. py/core/providers/database/documents.py:775
  • Draft comment:
    In semantic_document_search, the SQL query dynamically builds parameter placeholders using the length of the params list (e.g., LIMIT ${len(params) + 1}). This approach can be error-prone. Consider using a query builder or a more explicit parameter numbering strategy.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_ad1NTX5viMYMheWa

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@rexjohannes rexjohannes mentioned this pull request Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

List documents does not work according to docs

1 participant