Skip to content

[Bug] OSS access timed out during the transaction commit phase, resulting in the failure of a long-running Spark write job #7459

@MaxLinyun

Description

@MaxLinyun

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

1.2

Compute Engine

spark 3.3

Minimal reproduce step

Spark execute Insert overwrite select command

What doesn't meet your expectations?

diagnostics: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Exception occurs when preparing snapshot #37 by user 04be6b3a-7459-4ebe-bc84-2f6e53f08a07 with hash 9223372036854775807 and kind OVERWRITE. Clean up.
at org.apache.paimon.spark.commands.PaimonSparkWriter.commit(PaimonSparkWriter.scala:331)
at org.apache.paimon.spark.commands.WriteIntoPaimonTable.run(WriteIntoPaimonTable.scala:80)
at org.apache.paimon.spark.commands.PaimonDynamicPartitionOverwriteCommand.run(PaimonDynamicPartitionOverwriteCommand.scala:69)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:84)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:82)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:93)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$3(QueryExecution.scala:117)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:186)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:803)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:117)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78)
at org.apache.spark.sql.Dataset.(Dataset.scala:220)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:803)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
at org.apache.spark.sql.SparkSession.$anonfun$sql$3(SparkSession.scala:641)
at org.apache.spark.sql.SparkSession.$anonfun$sql$3$adapted(SparkSession.scala:636)
at org.apache.spark.sql.util.SqlRetryHelper$.wrapBodyWithRetry(SqlRetryHelper.scala:121)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:636)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:803)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:621)
at DataWorksSparkSqlRunner$.$anonfun$main$1(DataWorksSparkSqlRunner.scala:46)
at DataWorksSparkSqlRunner$.$anonfun$main$1$adapted(DataWorksSparkSqlRunner.scala:43)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at DataWorksSparkSqlRunner$.main(DataWorksSparkSqlRunner.scala:43)
at DataWorksSparkSqlRunner.main(DataWorksSparkSqlRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:739)
Caused by: java.lang.RuntimeException: Exception occurs when preparing snapshot #37 by user 04be6b3a-7459-4ebe-bc84-2f6e53f08a07 with hash 9223372036854775807 and kind OVERWRITE. Clean up.
at org.apache.paimon.operation.FileStoreCommitImpl.tryCommitOnce(FileStoreCommitImpl.java:1090)
at org.apache.paimon.operation.FileStoreCommitImpl.tryCommit(FileStoreCommitImpl.java:782)
at org.apache.paimon.operation.FileStoreCommitImpl.tryOverwrite(FileStoreCommitImpl.java:861)
at org.apache.paimon.operation.FileStoreCommitImpl.overwrite(FileStoreCommitImpl.java:500)
at org.apache.paimon.table.sink.TableCommitImpl.commitMultiple(TableCommitImpl.java:236)
at org.apache.paimon.table.sink.TableCommitImpl.commit(TableCommitImpl.java:211)
at org.apache.paimon.table.sink.TableCommitImpl.commit(TableCommitImpl.java:190)
at org.apache.paimon.table.sink.TableCommitImpl.commit(TableCommitImpl.java:159)
at org.apache.paimon.spark.commands.PaimonSparkWriter.commit(PaimonSparkWriter.scala:329)
... 47 more
Caused by: java.lang.RuntimeException: java.io.IOException: Close stream oss://my_bucket/my_warehouse/my_db.db/my_table/manifest/manifest-84a53d0a-0ef4-49a8-b7a8-1b1a7ad40504-0 error: [HostId]: oss-cn-hangzhou-internal.aliyuncs.com [ErrorMessage]: [E1008]Reached timeout=30000ms @100.114.102.201:80 [ErrorCode]: 1008 [RequestId]:
at org.apache.paimon.operation.ManifestFileMerger.merge(ManifestFileMerger.java:95)
at org.apache.paimon.operation.FileStoreCommitImpl.tryCommitOnce(FileStoreCommitImpl.java:995)
... 55 more
Caused by: java.io.IOException: Close stream oss://my_bucket/my_warehouse/my_db.db/my_table/manifest/manifest-84a53d0a-0ef4-49a8-b7a8-1b1a7ad40504-0 error: [HostId]: oss-cn-hangzhou-internal.aliyuncs.com [ErrorMessage]: [E1008]Reached timeout=30000ms @100.114.102.201:80 [ErrorCode]: 1008 [RequestId]:
at com.aliyun.jindodata.common.JindoOutputStream.close(JindoOutputStream.java:148)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at org.apache.paimon.fs.hadoop.HadoopFileIO$HadoopPositionOutputStream.close(HadoopFileIO.java:325)
at org.apache.paimon.io.SingleFileWriter.close(SingleFileWriter.java:185)
at org.apache.paimon.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:130)
at org.apache.paimon.io.RollingFileWriter.write(RollingFileWriter.java:84)
at org.apache.paimon.io.FileWriter.write(FileWriter.java:91)
at org.apache.paimon.operation.ManifestFileMerger.tryFullCompaction(ManifestFileMerger.java:260)
at org.apache.paimon.operation.ManifestFileMerger.merge(ManifestFileMerger.java:73)
... 56 more
Caused by: java.io.IOException: [HostId]: oss-cn-hangzhou-internal.aliyuncs.com [ErrorMessage]: [E1008]Reached timeout=30000ms @100.114.102.201:80 [ErrorCode]: 1008 [RequestId]:
at com.aliyun.jindodata.jnative.JindoNativeWriter.finalizeFile(Native Method)
at com.aliyun.jindodata.common.JindoOutputStream.close(JindoOutputStream.java:141)
... 65 more

Anything else?

The INSERT SELECT command executed on Spark ran for over 2 hours, yet the entire write operation failed at the final commit phase due to a timeout when writing to OSS. This outcome is unacceptable. Therefore, catching exceptions and retrying during the commit process can prevent this issue.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions