feat(ingest): Couchbase ingestion source #12345

mminichino · 2025-01-15T00:11:29Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

codecov · 2025-01-15T00:43:22Z

Codecov Report

Attention: Patch coverage is 82.59304% with 145 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../ingestion/source/couchbase/couchbase_profiling.py	77.09%	41 Missing ⚠️
...ngestion/source/couchbase/couchbase_data_reader.py	0.00%	35 Missing ⚠️
...on/src/datahub/ingestion/source/couchbase/retry.py	69.84%	19 Missing ⚠️
...ub/ingestion/source/couchbase/couchbase_connect.py	87.68%	17 Missing ⚠️
...tahub/ingestion/source/couchbase/couchbase_main.py	89.47%	16 Missing ⚠️
.../ingestion/source/couchbase/couchbase_aggregate.py	82.92%	14 Missing ⚠️
.../ingestion/source/couchbase/couchbase_kv_schema.py	97.95%	2 Missing ⚠️
...estion/source/couchbase/couchbase_schema_reader.py	95.00%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (82.50%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Files with missing lines	Coverage Δ
...b-react/src/app/ingest/source/builder/constants.ts	`100.00% <100.00%> (ø)`
...hub/ingestion/source/couchbase/couchbase_common.py	`100.00% <100.00%> (ø)`
...atahub/ingestion/source/couchbase/couchbase_sql.py	`100.00% <100.00%> (ø)`
...estion/source/couchbase/couchbase_schema_reader.py	`95.00% <95.00%> (ø)`
.../ingestion/source/couchbase/couchbase_kv_schema.py	`97.95% <97.95%> (ø)`
.../ingestion/source/couchbase/couchbase_aggregate.py	`82.92% <82.92%> (ø)`
...tahub/ingestion/source/couchbase/couchbase_main.py	`89.47% <89.47%> (ø)`
...ub/ingestion/source/couchbase/couchbase_connect.py	`87.68% <87.68%> (ø)`
...on/src/datahub/ingestion/source/couchbase/retry.py	`69.84% <69.84%> (ø)`
...ngestion/source/couchbase/couchbase_data_reader.py	`0.00% <0.00%> (ø)`
... and 1 more

... and 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5309ae0...2cc5ba5. Read the comment docs.

…atahub into mm--couchbase-ingest-source

hsheth2

Left some initial comments on the code

It feels like some features here were added just because they exist for other sources e.g. profiling/classification/domain mapping. Are those strictly necessary for what we're trying to accomplish with this source? Are the implementations sufficiently performant to work at scale?

hsheth2 · 2025-01-23T04:20:39Z

metadata-ingestion/src/datahub/ingestion/source/couchbase/retry.py

can we just use tenacity or another library for this?

I suppose I could. Is this already imported? I have always used this code with Python and Couchbase so I know it is stable which is why I used it.

hsheth2 · 2025-01-23T04:21:34Z

metadata-ingestion/src/datahub/ingestion/source/couchbase/couchbase_kv_schema.py

+    samples: List[Any] = []
+
+
+def json_schema(


we already have a number of utilities for doing "schema inference" from objects - e.g.

datahub/metadata-ingestion/src/datahub/ingestion/source/schema_inference/object.py

Line 86 in 0361f24

def construct_schema(

Does this stuff still make sense?

hsheth2 · 2025-01-23T04:23:19Z

datahub-web-react/src/app/ingest/source/builder/constants.ts

@@ -147,6 +150,7 @@ export const PLATFORM_URN_TO_LOGO = {
    [BIGQUERY_URN]: bigqueryLogo,
    [CLICKHOUSE_URN]: clickhouseLogo,
    [COCKROACHDB_URN]: cockroachdbLogo,
+    [COUCHBASE_URN]: couchbaseLogo,


in general, I prefer to have the UI forms get added in a separate PR.

Small PRs are easier to review - and in this case, I like to have the UI changes reviewed by someone who does more frontend dev than I do

I can remove this.

hsheth2 · 2025-01-23T04:24:53Z

metadata-ingestion/tests/integration/couchbase/docker-compose.yml

this docker compose and entrypoint file seem more complex than I would have expected

Usually docker containers come with good defaults, which means our docker setups are very simple

I was required to add these. Couchbase requires a lot of checkpoints to validate the software stack is ready to use when deployed and configured in a container. The provided port check is not sufficient. I can look at removing one check and waiting longer on the second, but some additional checking is needed before the cluster can be used.

hsheth2 · 2025-01-23T04:25:19Z

metadata-ingestion/tests/unit/couchbase/test_couchbase_source.py

looks like duplicate files here

See previous comment on using Couchbase in a container with CI/CD. To exercise connectivity with a cluster, you need the same assets as with the integration test.

hsheth2 · 2025-01-23T04:29:05Z

metadata-ingestion/src/datahub/ingestion/source/couchbase/__init__.py

This is the first source that really uses async

In general, we should almost never be using "bare" asyncio. instead, the anyio library is preferred. We also should not really have references to the event loop all over the place - that tends to be an anti-pattern. Methods that need an event loop should be async. For cpu-bound operations, we can use the asyncer library to bridge between asyncio tasks and worker threads.

The CB Python SDK is designed to work with either asyncio or Twisted. I can look into asyncer as opposed to calling asyncio methods to call async methods.

mminichino · 2025-01-23T16:19:50Z

Left some initial comments on the code

It feels like some features here were added just because they exist for other sources e.g. profiling/classification/domain mapping. Are those strictly necessary for what we're trying to accomplish with this source? Are the implementations sufficiently performant to work at scale?

The implementation was written to work at scale. Performance at scale is one of Couchbase's key characteristics. We should be able to support all the features and you can leverage the available settings such as sample size to limit the dataset to further accelerate the ingestion process.

mminichino added 6 commits January 14, 2025 16:50

feat(ingest): Couchbase ingestion source

ed2142e

feat(ingest): Couchbase ingestion source

365f7a1

feat(ingest): Couchbase ingestion source

c39d41b

feat(ingest): Couchbase ingestion source

1690db9

feat(ingest): fix Couchbase source lint problems

7864e3a

feat(ingest): align Couchbase source with master

32f019c

github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels Jan 15, 2025

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jan 15, 2025

vercel bot had a problem deploying to Preview January 15, 2025 01:05 Failure

feat(ingest): Couchbase source extra error handling

9af84b3

vercel bot deployed to Preview January 15, 2025 20:11 View deployment

feat(ingest): Couchbase source fixed retry logic

cdec644

vercel bot deployed to Preview January 16, 2025 01:09 View deployment

feat(ingest): Couchbase source updated query logic

88f4661

vercel bot deployed to Preview January 16, 2025 05:51 View deployment

feat(ingest): Couchbase source async refactor

b61aad8

vercel bot deployed to Preview January 16, 2025 19:18 View deployment

feat(ingest): Couchbase source test updates

b5458fe

vercel bot deployed to Preview January 16, 2025 22:25 View deployment

feat(ingest): Couchbase source additional test update

60eb467

vercel bot deployed to Preview January 16, 2025 23:56 View deployment

feat(ingest): Couchbase source additional test updates

8591db8

vercel bot deployed to Preview January 17, 2025 01:03 View deployment

mminichino marked this pull request as draft January 22, 2025 15:05

feat(ingest): Couchbase source profiling updates

bb14433

vercel bot deployed to Preview January 22, 2025 18:00 View deployment

mminichino marked this pull request as ready for review January 22, 2025 23:42

mminichino requested a review from hsheth2 January 22, 2025 23:42

mminichino added 3 commits January 22, 2025 18:58

Merge branch 'master' into mm--couchbase-ingest-source

8a07901

Merge branch 'datahub-project:master' into mm--couchbase-ingest-source

f980a86

Merge branch 'mm--couchbase-ingest-source' of github.com:mminichino/d…

d45700e

…atahub into mm--couchbase-ingest-source

vercel bot deployed to Preview January 23, 2025 02:14 View deployment

feat(ingest): Couchbase source lint fix update

2cc5ba5

vercel bot deployed to Preview January 23, 2025 02:35 View deployment

hsheth2 requested changes Jan 23, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jan 23, 2025

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jan 23, 2025

mminichino marked this pull request as draft January 23, 2025 16:29

hsheth2 removed the needs-review Label for PRs that need review from a maintainer. label Feb 12, 2025

mminichino closed this Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ingest): Couchbase ingestion source #12345

feat(ingest): Couchbase ingestion source #12345

Uh oh!

mminichino commented Jan 15, 2025

Uh oh!

codecov bot commented Jan 15, 2025 •

edited

Loading

Uh oh!

hsheth2 left a comment •

edited

Loading

Uh oh!

hsheth2 Jan 23, 2025

Uh oh!

mminichino Jan 23, 2025

Uh oh!

hsheth2 Jan 23, 2025

Uh oh!

hsheth2 Jan 23, 2025

Uh oh!

mminichino Jan 23, 2025

Uh oh!

hsheth2 Jan 23, 2025

Uh oh!

mminichino Jan 23, 2025

Uh oh!

hsheth2 Jan 23, 2025

Uh oh!

mminichino Jan 23, 2025

Uh oh!

hsheth2 Jan 23, 2025

Uh oh!

mminichino Jan 23, 2025

Uh oh!

mminichino commented Jan 23, 2025

Uh oh!

Uh oh!

feat(ingest): Couchbase ingestion source #12345

feat(ingest): Couchbase ingestion source #12345

Uh oh!

Conversation

mminichino commented Jan 15, 2025

Checklist

Uh oh!

codecov bot commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hsheth2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mminichino commented Jan 23, 2025

Uh oh!

Uh oh!

codecov bot commented Jan 15, 2025 •

edited

Loading

hsheth2 left a comment •

edited

Loading