-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): Couchbase ingestion source #12345
base: master
Are you sure you want to change the base?
feat(ingest): Couchbase ingestion source #12345
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments on the code
It feels like some features here were added just because they exist for other sources e.g. profiling/classification/domain mapping. Are those strictly necessary for what we're trying to accomplish with this source? Are the implementations sufficiently performant to work at scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use tenacity
or another library for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I could. Is this already imported? I have always used this code with Python and Couchbase so I know it is stable which is why I used it.
samples: List[Any] = [] | ||
|
||
|
||
def json_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a number of utilities for doing "schema inference" from objects - e.g.
datahub/metadata-ingestion/src/datahub/ingestion/source/schema_inference/object.py
Line 86 in 0361f24
def construct_schema( |
Does this stuff still make sense?
@@ -147,6 +150,7 @@ export const PLATFORM_URN_TO_LOGO = { | |||
[BIGQUERY_URN]: bigqueryLogo, | |||
[CLICKHOUSE_URN]: clickhouseLogo, | |||
[COCKROACHDB_URN]: cockroachdbLogo, | |||
[COUCHBASE_URN]: couchbaseLogo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, I prefer to have the UI forms get added in a separate PR.
Small PRs are easier to review - and in this case, I like to have the UI changes reviewed by someone who does more frontend dev than I do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this docker compose and entrypoint file seem more complex than I would have expected
Usually docker containers come with good defaults, which means our docker setups are very simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was required to add these. Couchbase requires a lot of checkpoints to validate the software stack is ready to use when deployed and configured in a container. The provided port check is not sufficient. I can look at removing one check and waiting longer on the second, but some additional checking is needed before the cluster can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like duplicate files here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment on using Couchbase in a container with CI/CD. To exercise connectivity with a cluster, you need the same assets as with the integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first source that really uses async
In general, we should almost never be using "bare" asyncio. instead, the anyio
library is preferred. We also should not really have references to the event loop all over the place - that tends to be an anti-pattern. Methods that need an event loop should be async. For cpu-bound operations, we can use the asyncer
library to bridge between asyncio tasks and worker threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CB Python SDK is designed to work with either asyncio or Twisted. I can look into asyncer
as opposed to calling asyncio methods to call async methods.
The implementation was written to work at scale. Performance at scale is one of Couchbase's key characteristics. We should be able to support all the features and you can leverage the available settings such as sample size to limit the dataset to further accelerate the ingestion process. |
Checklist