You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When materializing assets in Dagster and publishing their metadata to DataHub with the DataHub Dagster Sensor, the following occurs:
A dataset specific to the asset's target plaform is created in DataHub, with whichever URN is specified in the Dagster user code
A dataset of type "Dagster Asset" is created in DataHub, which:
is set as the upstream dependency of the target platform dataset
has the URN urn:li:dataset:(urn:li:dataPlatform:dagster,{env}.{platform}.{db_name}.{schema_name}.{asset_name},PROD)
has no external URL
A dataFlow of type "Dagster Op" is created in DataHub, which:
is set as the upstream dependency of the Dagster Asset
has the URN urn:li:dataJob:(urn:li:dataFlow:(dagster,{library_name}/__ASSET_JOB,PROD),{library_name}/{env}__{platform}__{db_name}__{schema_name}__{asset_name})
has the external URL https://{dagster_host}/locations/{library_name}/jobs/__ASSET_JOB/{env}__{platform}__{db_name}__{schema_name}__{asset_name}
following this link leads to a page in the Dagster UI with the error "Pipeline not found" (see screenshots)
To Reproduce
Steps to reproduce the behavior:
Install Dagster 1.9.6 on any platform
Create a user code deployment that uses assets - the example here uses the following asset code:
Wait for the sensor to publish the metadata to DataHub
Expected behavior
This behavior is unexpected for two reasons.
The link to the Dagster UI is broken. A valid link would be: https://{dagster_host}/assets/{env}/{platform}/{db_name}/{schema_name}/{asset_name}
There's no obvious value in this case for creating three separate entities - a "Databricks Dataset", a "Dagster Op" and a "Dagster Asset" - for each individual table in the target platform.
What we're interested in is helping our stakeholders explore which data depends on which other data. As it currently stands, the lineage graph is confusing to explore. The worst side effect is that each Databricks dataset currently appears to have no downstream assets - even if in reality it does - because that dependency information is stored in the Dagster asset entity.
Ideally, we would have a single entity that represents this asset, and contains links to the table in the Databricks Unity Catalog UI, the asset in the Dagster UI, and the location of the asset's source code (for us, this is GitHub). Alternatively, emitting two metadata entities for each table (Databricks dataset and Dagster asset) would be acceptable, if it is also possible to automate the creation of lineage dependencies between the Databricks datasets somehow.
Screenshots
Desktop (please complete the following information):
N/A
Additional context
DataHub Version: 0.15.0.1
Dagster Version: 1.9.6
Server OS: Kubernetes 1.30.5 (Ubuntu Linux) on Azure Kubernetes Service
The text was updated successfully, but these errors were encountered:
Describe the bug
When materializing assets in Dagster and publishing their metadata to DataHub with the DataHub Dagster Sensor, the following occurs:
urn:li:dataset:(urn:li:dataPlatform:dagster,{env}.{platform}.{db_name}.{schema_name}.{asset_name},PROD)
urn:li:dataJob:(urn:li:dataFlow:(dagster,{library_name}/__ASSET_JOB,PROD),{library_name}/{env}__{platform}__{db_name}__{schema_name}__{asset_name})
https://{dagster_host}/locations/{library_name}/jobs/__ASSET_JOB/{env}__{platform}__{db_name}__{schema_name}__{asset_name}
To Reproduce
Steps to reproduce the behavior:
Install Dagster 1.9.6 on any platform
Create a user code deployment that uses assets - the example here uses the following asset code:
Follow the instructions at https://datahubproject.io/docs/lineage/dagster/#using-datahubs-dagster-sensor to add the DataHub Dagster Sensor to the user code deployment
Deploy and materialize the assets in Dagster
Wait for the sensor to publish the metadata to DataHub
Expected behavior
This behavior is unexpected for two reasons.
The link to the Dagster UI is broken. A valid link would be:
https://{dagster_host}/assets/{env}/{platform}/{db_name}/{schema_name}/{asset_name}
There's no obvious value in this case for creating three separate entities - a "Databricks Dataset", a "Dagster Op" and a "Dagster Asset" - for each individual table in the target platform.
What we're interested in is helping our stakeholders explore which data depends on which other data. As it currently stands, the lineage graph is confusing to explore. The worst side effect is that each Databricks dataset currently appears to have no downstream assets - even if in reality it does - because that dependency information is stored in the Dagster asset entity.
Ideally, we would have a single entity that represents this asset, and contains links to the table in the Databricks Unity Catalog UI, the asset in the Dagster UI, and the location of the asset's source code (for us, this is GitHub). Alternatively, emitting two metadata entities for each table (Databricks dataset and Dagster asset) would be acceptable, if it is also possible to automate the creation of lineage dependencies between the Databricks datasets somehow.
Screenshots
Desktop (please complete the following information):
N/A
Additional context
The text was updated successfully, but these errors were encountered: