Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate spline with pyspark #862

Open
Nandini-2801 opened this issue Feb 10, 2025 · 3 comments
Open

Integrate spline with pyspark #862

Nandini-2801 opened this issue Feb 10, 2025 · 3 comments

Comments

@Nandini-2801
Copy link

Nandini-2801 commented Feb 10, 2025

I have been trying to run this configuration on Jupyter workbook with emr-serverless application attached.

.config("spark.jars", 
            "s3://bucket/spark-3.5-spline-agent-bundle_2.12-2.2.1.jar") \
    .config("spark.sql.queryExecutionListeners", "za.co.absa.spline.harvester.listener.SplineQueryExecutionListener") \
    .config("spark.spline.lineageDispatcher", "console,file") \
    .config("spark.spline.lineageDispatcher.file.className", "za.co.absa.spline.harvester.dispatcher.FileLineageDispatcher") \
    .config("spark.spline.lineageDispatcher.file.fileName", 
            "s3://bucket/spline_workbook/lineage.csv")

script trying to run:

empsDF = spark.read \
    .option("header", "true") \
    .option("inferschema", "true") \
    .csv(input_file_1) 
empsDF1 = empsDF.withColumnRenamed('name', 'Name')
empsDF1.show()

deptsDF = spark.read \
    .option("header", "true") \
    .option("inferschema", "true") \
    .csv(input_file_2)

resultsDF = empsDF1.join(deptsDF, empsDF1.dept_id==deptsDF.dept_id1, "left_outer")
resultsDF.write.csv( output_file_1, header=True, mode = "overwrite")
xdf = empsDF.groupBy('manager_id')
ydf = xdf.agg(sf.sum('salary').alias('total_salary'))
ydf.show()
ydf.coalesce(1).write.csv( output_file_2, header=True, mode = "overwrite")

However, even though the run is successful, the lineage file is not created created at the s3 location.

@wajda wajda transferred this issue from AbsaOSS/spline Feb 10, 2025
@wajda
Copy link
Contributor

wajda commented Feb 10, 2025

Why do you think it should be created in S3 location? Which location?

@Nandini-2801
Copy link
Author

Hi @wajda , I am using the file lineage dispatcher, accordingly, I was hoping the lineage doc will be created in the s3 location "s3://bucket/spline_workbook/lineage.csv" as given parameter to spark.spline.lineageDispatcher.file.fileName. If that is not the case please guide me otherwise. Appreciate your support.

@Nandini-2801
Copy link
Author

I am trying to use spline for AWS EMR serverless jobs. Can you please guide me through the steps to do so. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

2 participants