perf(ingest/dataplex): stream list_entries / search_entries instead of eager list()#17730
Merged
Merged
Conversation
…list() Calling list() on the GCP pager forces the client to fetch every page before processing begins. For large entry groups like @bigquery with tens-of-thousands of tables this causes an apparent hang. Switch all five call sites to iterate the pager directly so processing starts as soon as the first page arrives and pages are fetched on demand. https://claude.ai/code/session_019UExFjBkJnaKzN4BbdPt9o
…mance concern Glossaries, categories, and terms are small datasets that don't suffer from the eager-evaluation hang. Revert dataplex_glossary.py to its original state; only the entries-specific list_entries / search_entries calls needed the streaming fix. https://claude.ai/code/session_019UExFjBkJnaKzN4BbdPt9o
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
treff7es
approved these changes
Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Dataplex
@bigqueryentry group is a virtual group that automatically contains metadata for every BigQuery dataset and table in a project. For large projects this can be tens or hundreds of thousands of entries.The old code wrapped both
list_entriesandsearch_entriesinlist(...), which forces the GCP client library to fetch every page before the first entry is processed. With a page size of 100, a project with 50 k tables requires 500 sequential API calls before anything happens — causing an apparent hang.What changed
dataplex_entries.pyonly:_list_entry_stubs: iterate thelist_entriespager directly; the processing loop is now inside thePerfTimerblock so it fires page-by-page._process_spanner_entries: same treatment forsearch_entries; the result loop moved inside the existingtry/exceptso lazy page-fetch errors are still caught.Glossary/term/category calls are left unchanged — those datasets are small and don't exhibit the problem.
Generated by Claude Code