Skip to content

Commit bc38d04

Browse files
authored
Merge branch 'master' into jk-fix-empty-schema-field
2 parents fe375a4 + bd47b11 commit bc38d04

29 files changed

+2970
-307
lines changed

docs/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -735,7 +735,7 @@ Please see our [Integrations page](https://datahubproject.io/integrations) if yo
735735
| [bigquery](./generated/ingestion/sources/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
736736
| [datahub-lineage-file](./generated/ingestion/sources/file-based-lineage.md) | _no additional dependencies_ | Lineage File source |
737737
| [datahub-business-glossary](./generated/ingestion/sources/business-glossary.md) | _no additional dependencies_ | Business Glossary File source |
738-
| [dbt](./generated/ingestion/sources/dbt.md) | _no additional dependencies_ | dbt source |
738+
| [dbt](./generated/ingestion/sources/dbt.md) | `pip install 'acryl-datahub[dbt]'` | dbt source |
739739
| [dremio](./generated/ingestion/sources/dremio.md) | `pip install 'acryl-datahub[dremio]'` | Dremio Source |
740740
| [druid](./generated/ingestion/sources/druid.md) | `pip install 'acryl-datahub[druid]'` | Druid Source |
741741
| [feast](./generated/ingestion/sources/feast.md) | `pip install 'acryl-datahub[feast]'` | Feast source (0.26.0) |

docs/managed-datahub/operator-guide/setting-up-remote-ingestion-executor.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,50 @@ The Helm chart [datahub-executor-worker](https://executor-helm.acryl.io/index.ya
125125
--set image.tag=v0.3.1 \
126126
acryl datahub-executor-worker
127127
```
128+
9. As of DataHub Cloud `v0.3.8.2` It is possible to pass secrets to ingestion recipes using Kubernetes Secret CRDs as shown below. This allows to update secrets at runtime without restarting Remote Executor process.
129+
```
130+
# 1. Create K8s Secret object in remote executor namespace, e.g.
131+
apiVersion: v1
132+
kind: Secret
133+
metadata:
134+
name: datahub-secret-store
135+
data:
136+
REDSHIFT_PASSWORD: cmVkc2hpZnQtc2VjcmV0Cg==
137+
SNOWFLAKE_PASSWORD: c25vd2ZsYWtlLXNlY3JldAo=
138+
# 2. Add secret into your Remote Executor deployment:
139+
extraVolumes:
140+
- name: datahub-secret-store
141+
secret:
142+
secretName: datahub-secret-store
143+
# 3. Mount it under /mnt/secrets directory
144+
extraVolumeMounts:
145+
- mountPath: /mnt/secrets
146+
name: datahub-secret-store
147+
```
148+
You can then reference the mounted secrets directly in the ingestion recipe:
149+
```yaml
150+
source:
151+
type: redshift
152+
config:
153+
host_port: '<redshift host:port>'
154+
username: connector_test
155+
table_lineage_mode: mixed
156+
include_table_lineage: true
157+
include_tables: true
158+
include_views: true
159+
profiling:
160+
enabled: true
161+
profile_table_level_only: false
162+
stateful_ingestion:
163+
enabled: true
164+
password: '${REDSHIFT_PASSWORD}'
165+
```
166+
167+
By default the executor will look for files mounted in `/mnt/secrets`, this is override-able by setting the env var:
168+
`DATAHUB_EXECUTOR_FILE_SECRET_BASEDIR` to a different location (default: `/mnt/secrets`)
169+
170+
These files are expected to be under 1MB in data by default. To increase this limit set a higher value using:
171+
`DATAHUB_EXECUTOR_FILE_SECRET_MAXLEN` (default: `1024768`, size in bytes)
128172

129173
## FAQ
130174

docs/managed-datahub/release-notes/v_0_3_8.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
Release Availability Date
55
---
6-
21-Jan-2025
6+
29-Jan-2025
77

88
Recommended CLI/SDK
99
---

metadata-ingestion/setup.cfg

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ warn_unused_configs = yes
1515
disallow_untyped_defs = no
1616

1717
# try to be a bit more strict in certain areas of the codebase
18+
[mypy-datahub]
19+
# Only for datahub's __init__.py - allow implicit reexport
20+
implicit_reexport = yes
1821
[mypy-datahub.*]
1922
ignore_missing_imports = no
2023
implicit_reexport = no
@@ -54,7 +57,7 @@ addopts = --cov=src --cov-report= --cov-config setup.cfg --strict-markers -p no:
5457
markers =
5558
slow: marks tests that are slow to run, including all docker-based tests (deselect with '-m not slow')
5659
integration: marks all integration tests, across all batches (deselect with '-m "not integration"')
57-
integration_batch_0: mark tests to run in batch 0 of integration tests. This is done mainly for parallelisation in CI. Batch 0 is the default batch.
60+
integration_batch_0: mark tests to run in batch 0 of integration tests. This is done mainly for parallelization in CI. Batch 0 is the default batch.
5861
integration_batch_1: mark tests to run in batch 1 of integration tests
5962
integration_batch_2: mark tests to run in batch 2 of integration tests
6063
testpaths =
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from datahub.configuration.common import MetaError
2+
3+
# TODO: Move all other error types to this file.
4+
5+
6+
class SdkUsageError(MetaError):
7+
pass
8+
9+
10+
class AlreadyExistsError(SdkUsageError):
11+
pass
12+
13+
14+
class ItemNotFoundError(SdkUsageError):
15+
pass
16+
17+
18+
class MultipleItemsFoundError(SdkUsageError):
19+
pass
20+
21+
22+
class SchemaFieldKeyError(SdkUsageError, KeyError):
23+
pass
24+
25+
26+
class IngestionAttributionWarning(Warning):
27+
pass
28+
29+
30+
class MultipleSubtypesWarning(Warning):
31+
pass
32+
33+
34+
class ExperimentalWarning(Warning):
35+
pass

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from typing import Callable, Dict, Iterable, List, MutableMapping, Optional
77

88
from datahub.ingestion.api.report import SupportsAsObj
9+
from datahub.ingestion.source.common.subtypes import DatasetSubTypes
910
from datahub.ingestion.source.snowflake.constants import SnowflakeObjectDomain
1011
from datahub.ingestion.source.snowflake.snowflake_connection import SnowflakeConnection
1112
from datahub.ingestion.source.snowflake.snowflake_query import (
@@ -100,6 +101,9 @@ class SnowflakeTable(BaseTable):
100101
def is_hybrid(self) -> bool:
101102
return self.type is not None and self.type == "HYBRID TABLE"
102103

104+
def get_subtype(self) -> DatasetSubTypes:
105+
return DatasetSubTypes.TABLE
106+
103107

104108
@dataclass
105109
class SnowflakeView(BaseView):
@@ -109,6 +113,9 @@ class SnowflakeView(BaseView):
109113
column_tags: Dict[str, List[SnowflakeTag]] = field(default_factory=dict)
110114
is_secure: bool = False
111115

116+
def get_subtype(self) -> DatasetSubTypes:
117+
return DatasetSubTypes.VIEW
118+
112119

113120
@dataclass
114121
class SnowflakeSchema:
@@ -154,6 +161,9 @@ class SnowflakeStream:
154161
column_tags: Dict[str, List[SnowflakeTag]] = field(default_factory=dict)
155162
last_altered: Optional[datetime] = None
156163

164+
def get_subtype(self) -> DatasetSubTypes:
165+
return DatasetSubTypes.SNOWFLAKE_STREAM
166+
157167

158168
class _SnowflakeTagCache:
159169
def __init__(self) -> None:

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@
2121
from datahub.ingestion.source.aws.s3_util import make_s3_urn_for_lineage
2222
from datahub.ingestion.source.common.subtypes import (
2323
DatasetContainerSubTypes,
24-
DatasetSubTypes,
2524
)
2625
from datahub.ingestion.source.snowflake.constants import (
2726
GENERIC_PERMISSION_ERROR_KEY,
@@ -467,7 +466,13 @@ def _process_schema(
467466
context=f"{db_name}.{schema_name}",
468467
)
469468

470-
def _process_tags(self, snowflake_schema, schema_name, db_name, domain):
469+
def _process_tags(
470+
self,
471+
snowflake_schema: SnowflakeSchema,
472+
schema_name: str,
473+
db_name: str,
474+
domain: str,
475+
) -> None:
471476
snowflake_schema.tags = self.tag_extractor.get_tags_on_object(
472477
schema_name=schema_name, db_name=db_name, domain=domain
473478
)
@@ -837,15 +842,7 @@ def gen_dataset_workunits(
837842
if dpi_aspect:
838843
yield dpi_aspect
839844

840-
subTypes = SubTypes(
841-
typeNames=(
842-
[DatasetSubTypes.SNOWFLAKE_STREAM]
843-
if isinstance(table, SnowflakeStream)
844-
else [DatasetSubTypes.VIEW]
845-
if isinstance(table, SnowflakeView)
846-
else [DatasetSubTypes.TABLE]
847-
)
848-
)
845+
subTypes = SubTypes(typeNames=[table.get_subtype()])
849846

850847
yield MetadataChangeProposalWrapper(
851848
entityUrn=dataset_urn, aspect=subTypes
@@ -932,9 +929,9 @@ def get_dataset_properties(
932929
"OWNER_ROLE_TYPE": table.owner_role_type,
933930
"TABLE_NAME": table.table_name,
934931
"BASE_TABLES": table.base_tables,
935-
"STALE_AFTER": table.stale_after.isoformat()
936-
if table.stale_after
937-
else None,
932+
"STALE_AFTER": (
933+
table.stale_after.isoformat() if table.stale_after else None
934+
),
938935
}.items()
939936
if v
940937
}
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import warnings
2+
3+
import datahub.metadata.schema_classes as models
4+
from datahub.errors import ExperimentalWarning, SdkUsageError
5+
from datahub.ingestion.graph.config import DatahubClientConfig
6+
from datahub.metadata.urns import (
7+
ChartUrn,
8+
ContainerUrn,
9+
CorpGroupUrn,
10+
CorpUserUrn,
11+
DashboardUrn,
12+
DataPlatformInstanceUrn,
13+
DataPlatformUrn,
14+
DatasetUrn,
15+
DomainUrn,
16+
GlossaryTermUrn,
17+
SchemaFieldUrn,
18+
TagUrn,
19+
)
20+
from datahub.sdk.container import Container
21+
from datahub.sdk.dataset import Dataset
22+
from datahub.sdk.main_client import DataHubClient
23+
24+
warnings.warn(
25+
"The new datahub SDK (e.g. datahub.sdk.*) is experimental. "
26+
"Our typical backwards-compatibility and stability guarantees do not apply to this code. "
27+
"When it's promoted to stable, the import path will change "
28+
"from `from datahub.sdk import ...` to `from datahub import ...`.",
29+
ExperimentalWarning,
30+
stacklevel=2,
31+
)
32+
del warnings
33+
del ExperimentalWarning
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
from typing import Dict, List, Type
2+
3+
from datahub.sdk._entity import Entity
4+
from datahub.sdk.container import Container
5+
from datahub.sdk.dataset import Dataset
6+
7+
# TODO: Is there a better way to declare this?
8+
ENTITY_CLASSES_LIST: List[Type[Entity]] = [
9+
Container,
10+
Dataset,
11+
]
12+
13+
ENTITY_CLASSES: Dict[str, Type[Entity]] = {
14+
cls.get_urn_type().ENTITY_TYPE: cls for cls in ENTITY_CLASSES_LIST
15+
}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
from __future__ import annotations
2+
3+
import contextlib
4+
from typing import Iterator
5+
6+
from datahub.utilities.str_enum import StrEnum
7+
8+
9+
class KnownAttribution(StrEnum):
10+
INGESTION = "INGESTION"
11+
INGESTION_ALTERNATE = "INGESTION_ALTERNATE"
12+
13+
UI = "UI"
14+
SDK = "SDK"
15+
16+
PROPAGATION = "PROPAGATION"
17+
18+
def is_ingestion(self) -> bool:
19+
return self in (
20+
KnownAttribution.INGESTION,
21+
KnownAttribution.INGESTION_ALTERNATE,
22+
)
23+
24+
25+
_default_attribution = KnownAttribution.SDK
26+
27+
28+
def get_default_attribution() -> KnownAttribution:
29+
return _default_attribution
30+
31+
32+
def set_default_attribution(attribution: KnownAttribution) -> None:
33+
global _default_attribution
34+
_default_attribution = attribution
35+
36+
37+
@contextlib.contextmanager
38+
def change_default_attribution(attribution: KnownAttribution) -> Iterator[None]:
39+
old_attribution = get_default_attribution()
40+
try:
41+
set_default_attribution(attribution)
42+
yield
43+
finally:
44+
set_default_attribution(old_attribution)
45+
46+
47+
def is_ingestion_attribution() -> bool:
48+
return get_default_attribution().is_ingestion()

0 commit comments

Comments
 (0)