Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dagster] Links to Dagster UI are broken when materializing assets #12577

Open
jtv8 opened this issue Feb 7, 2025 · 0 comments
Open

[Dagster] Links to Dagster UI are broken when materializing assets #12577

jtv8 opened this issue Feb 7, 2025 · 0 comments
Labels
bug Bug report

Comments

@jtv8
Copy link

jtv8 commented Feb 7, 2025

Describe the bug
When materializing assets in Dagster and publishing their metadata to DataHub with the DataHub Dagster Sensor, the following occurs:

  • A dataset specific to the asset's target plaform is created in DataHub, with whichever URN is specified in the Dagster user code
  • A dataset of type "Dagster Asset" is created in DataHub, which:
    • is set as the upstream dependency of the target platform dataset
    • has the URN
      urn:li:dataset:(urn:li:dataPlatform:dagster,{env}.{platform}.{db_name}.{schema_name}.{asset_name},PROD)
    • has no external URL
  • A dataFlow of type "Dagster Op" is created in DataHub, which:
    • is set as the upstream dependency of the Dagster Asset
    • has the URN
      urn:li:dataJob:(urn:li:dataFlow:(dagster,{library_name}/__ASSET_JOB,PROD),{library_name}/{env}__{platform}__{db_name}__{schema_name}__{asset_name})
    • has the external URL https://{dagster_host}/locations/{library_name}/jobs/__ASSET_JOB/{env}__{platform}__{db_name}__{schema_name}__{asset_name}
    • following this link leads to a page in the Dagster UI with the error "Pipeline not found" (see screenshots)

To Reproduce
Steps to reproduce the behavior:

  1. Install Dagster 1.9.6 on any platform

  2. Create a user code deployment that uses assets - the example here uses the following asset code:

    @asset(
        key_prefix=["prod", "databricks", "demo_db", "demo_schema"],
        required_resource_keys={"pyspark_step_launcher", "pyspark"},
        partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
    )
    def flights(context: AssetExecutionContext) -> DataFrame:
        partition_date = date.fromisoformat(context.partition_key)
    
        data = [
            ("ATL", "LHR", partition_date, "9:10", "17:30"),
            ("ATL", "ORD", partition_date, "10:20", "12:20"),
            ("LHR", "FRA", partition_date, "11:30", "13:04"),
            ("LHR", "DEN", partition_date, "12:40", "22:20"),
        ]
    
        return context.resources.pyspark.spark_session.createDataFrame(
            data,
            "dep_airport: string, arr_airport: string, date: date, dep_time: string, arr_time: string",
        )
    
  3. Follow the instructions at https://datahubproject.io/docs/lineage/dagster/#using-datahubs-dagster-sensor to add the DataHub Dagster Sensor to the user code deployment

  4. Deploy and materialize the assets in Dagster

  5. Wait for the sensor to publish the metadata to DataHub

Expected behavior
This behavior is unexpected for two reasons.

  1. The link to the Dagster UI is broken. A valid link would be:
    https://{dagster_host}/assets/{env}/{platform}/{db_name}/{schema_name}/{asset_name}

  2. There's no obvious value in this case for creating three separate entities - a "Databricks Dataset", a "Dagster Op" and a "Dagster Asset" - for each individual table in the target platform.

    What we're interested in is helping our stakeholders explore which data depends on which other data. As it currently stands, the lineage graph is confusing to explore. The worst side effect is that each Databricks dataset currently appears to have no downstream assets - even if in reality it does - because that dependency information is stored in the Dagster asset entity.

    Ideally, we would have a single entity that represents this asset, and contains links to the table in the Databricks Unity Catalog UI, the asset in the Dagster UI, and the location of the asset's source code (for us, this is GitHub). Alternatively, emitting two metadata entities for each table (Databricks dataset and Dagster asset) would be acceptable, if it is also possible to automate the creation of lineage dependencies between the Databricks datasets somehow.

Screenshots

Image

Image

Desktop (please complete the following information):
N/A

Additional context

  • DataHub Version: 0.15.0.1
  • Dagster Version: 1.9.6
  • Server OS: Kubernetes 1.30.5 (Ubuntu Linux) on Azure Kubernetes Service
@jtv8 jtv8 added the bug Bug report label Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

1 participant