Skip to content

perf(ingest/dataplex): stream list_entries / search_entries instead of eager list()#17730

Merged
sgomezvillamor merged 2 commits into
masterfrom
claude/tender-heisenberg-cywuF
Jun 4, 2026
Merged

perf(ingest/dataplex): stream list_entries / search_entries instead of eager list()#17730
sgomezvillamor merged 2 commits into
masterfrom
claude/tender-heisenberg-cywuF

Conversation

@sgomezvillamor
Copy link
Copy Markdown
Contributor

Why

The Dataplex @bigquery entry group is a virtual group that automatically contains metadata for every BigQuery dataset and table in a project. For large projects this can be tens or hundreds of thousands of entries.

The old code wrapped both list_entries and search_entries in list(...), which forces the GCP client library to fetch every page before the first entry is processed. With a page size of 100, a project with 50 k tables requires 500 sequential API calls before anything happens — causing an apparent hang.

What changed

dataplex_entries.py only:

  • _list_entry_stubs: iterate the list_entries pager directly; the processing loop is now inside the PerfTimer block so it fires page-by-page.
  • _process_spanner_entries: same treatment for search_entries; the result loop moved inside the existing try/except so lazy page-fetch errors are still caught.

Glossary/term/category calls are left unchanged — those datasets are small and don't exhibit the problem.


Generated by Claude Code

claude added 2 commits June 3, 2026 09:08
…list()

Calling list() on the GCP pager forces the client to fetch every page
before processing begins. For large entry groups like @bigquery with
tens-of-thousands of tables this causes an apparent hang. Switch all
five call sites to iterate the pager directly so processing starts as
soon as the first page arrives and pages are fetched on demand.

https://claude.ai/code/session_019UExFjBkJnaKzN4BbdPt9o
…mance concern

Glossaries, categories, and terms are small datasets that don't suffer
from the eager-evaluation hang. Revert dataplex_glossary.py to its
original state; only the entries-specific list_entries / search_entries
calls needed the streaming fix.

https://claude.ai/code/session_019UExFjBkJnaKzN4BbdPt9o
@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 4, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ahub/ingestion/source/dataplex/dataplex_entries.py 0.00% 21 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

@sgomezvillamor sgomezvillamor merged commit 51dcf91 into master Jun 4, 2026
74 of 75 checks passed
@sgomezvillamor sgomezvillamor deleted the claude/tender-heisenberg-cywuF branch June 4, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants