Skip to content

Feature requirement: Native lineage support for (Apache Doris with Spark-Doris-Connector) in acryl-spark-lineage agent #16424

@james604s

Description

@james604s

Feature Description
Add a native resolver to the acryl-spark-lineage agent to support the official spark-doris-connector. The agent should recognize Spark LogicalPlan nodes (like DataSourceV2Relation) where the provider is doris and automatically extract metadata from the doris.table.identifier option. ``

Why is this needed?

Plug-and-Play Experience: Users currently have to write manual emitters or custom Scala extensions to capture Doris lineage, which is difficult to scale across hundreds of ETL jobs. ``

Prevent Lineage Breaks: In architectures moving data from Hudi to Doris, the lineage chain often breaks because the agent cannot identify the Doris sink, leading to missing or incorrectly labeled (hdfs/file) entities. ``

URN Standardization: Automates the generation of consistent URNs (e.g., urn:li:dataset:(urn:li:dataPlatform:doris,...)), ensuring unified data governance and preventing duplicate entities. ``

Environment Support: Honor the spark.datahub.cluster (or env) configuration to correctly set the environment tag (e.g., PROD, DEV).

Describe alternatives you've considered
Manual DataHub Emitter: Using the Python SDK to manually emit lineage after the Spark action. This is intrusive to the ETL code and hard to maintain across hundreds of jobs.

OpenLineage Custom Extensions: Writing a custom QueryPlanVisitor for OpenLineage. While this works, it requires extra Scala development and maintenance of custom JARs outside the official DataHub release.

Additional context
The spark-doris-connector (official Apache Doris connector) typically uses the following options in Spark:

doris.table.identifier: The fully qualified name of the Doris table.

doris.fenodes: The FE node address.

Example of expected URN:
urn:li:dataset:(urn:li:dataPlatform:doris,Data%20Governance.schema.table,PROD)

Adding this support would significantly improve the data governance experience for the growing number of Apache Doris users who utilize Spark for their ETL/ELT pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions