Integrate spline with pyspark #862

Nandini-2801 · 2025-02-10T07:40:54Z

I have been trying to run this configuration on Jupyter workbook with emr-serverless application attached.

.config("spark.jars", 
            "s3://bucket/spark-3.5-spline-agent-bundle_2.12-2.2.1.jar") \
    .config("spark.sql.queryExecutionListeners", "za.co.absa.spline.harvester.listener.SplineQueryExecutionListener") \
    .config("spark.spline.lineageDispatcher", "console,file") \
    .config("spark.spline.lineageDispatcher.file.className", "za.co.absa.spline.harvester.dispatcher.FileLineageDispatcher") \
    .config("spark.spline.lineageDispatcher.file.fileName", 
            "s3://bucket/spline_workbook/lineage.csv")

script trying to run:

empsDF = spark.read \
    .option("header", "true") \
    .option("inferschema", "true") \
    .csv(input_file_1) 
empsDF1 = empsDF.withColumnRenamed('name', 'Name')
empsDF1.show()

deptsDF = spark.read \
    .option("header", "true") \
    .option("inferschema", "true") \
    .csv(input_file_2)

resultsDF = empsDF1.join(deptsDF, empsDF1.dept_id==deptsDF.dept_id1, "left_outer")
resultsDF.write.csv( output_file_1, header=True, mode = "overwrite")
xdf = empsDF.groupBy('manager_id')
ydf = xdf.agg(sf.sum('salary').alias('total_salary'))
ydf.show()
ydf.coalesce(1).write.csv( output_file_2, header=True, mode = "overwrite")

However, even though the run is successful, the lineage file is not created created at the s3 location.

wajda · 2025-02-10T12:53:00Z

Why do you think it should be created in S3 location? Which location?

Nandini-2801 · 2025-02-11T05:45:00Z

Hi @wajda , I am using the file lineage dispatcher, accordingly, I was hoping the lineage doc will be created in the s3 location "s3://bucket/spline_workbook/lineage.csv" as given parameter to spark.spline.lineageDispatcher.file.fileName. If that is not the case please guide me otherwise. Appreciate your support.

Nandini-2801 · 2025-02-11T06:29:39Z

I am trying to use spline for AWS EMR serverless jobs. Can you please guide me through the steps to do so. Thanks

wajda transferred this issue from AbsaOSS/spline Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate spline with pyspark #862

Integrate spline with pyspark #862

Nandini-2801 commented Feb 10, 2025 •

edited by wajda

Loading

wajda commented Feb 10, 2025

Nandini-2801 commented Feb 11, 2025

Nandini-2801 commented Feb 11, 2025

Integrate spline with pyspark #862

Integrate spline with pyspark #862

Comments

Nandini-2801 commented Feb 10, 2025 • edited by wajda Loading

wajda commented Feb 10, 2025

Nandini-2801 commented Feb 11, 2025

Nandini-2801 commented Feb 11, 2025

Nandini-2801 commented Feb 10, 2025 •

edited by wajda

Loading