-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Feature Description
Add a native resolver to the acryl-spark-lineage agent to support the official spark-doris-connector. The agent should recognize Spark LogicalPlan nodes (like DataSourceV2Relation) where the provider is doris and automatically extract metadata from the doris.table.identifier option. ``
Why is this needed?
Plug-and-Play Experience: Users currently have to write manual emitters or custom Scala extensions to capture Doris lineage, which is difficult to scale across hundreds of ETL jobs. ``
Prevent Lineage Breaks: In architectures moving data from Hudi to Doris, the lineage chain often breaks because the agent cannot identify the Doris sink, leading to missing or incorrectly labeled (hdfs/file) entities. ``
URN Standardization: Automates the generation of consistent URNs (e.g., urn:li:dataset:(urn:li:dataPlatform:doris,...)), ensuring unified data governance and preventing duplicate entities. ``
Environment Support: Honor the spark.datahub.cluster (or env) configuration to correctly set the environment tag (e.g., PROD, DEV).
Describe alternatives you've considered
Manual DataHub Emitter: Using the Python SDK to manually emit lineage after the Spark action. This is intrusive to the ETL code and hard to maintain across hundreds of jobs.
OpenLineage Custom Extensions: Writing a custom QueryPlanVisitor for OpenLineage. While this works, it requires extra Scala development and maintenance of custom JARs outside the official DataHub release.
Additional context
The spark-doris-connector (official Apache Doris connector) typically uses the following options in Spark:
doris.table.identifier: The fully qualified name of the Doris table.
doris.fenodes: The FE node address.
Example of expected URN:
urn:li:dataset:(urn:li:dataPlatform:doris,Data%20Governance.schema.table,PROD)
Adding this support would significantly improve the data governance experience for the growing number of Apache Doris users who utilize Spark for their ETL/ELT pipelines.