Skip to content

fix the issue reported in issue #247#349

Open
kendrick-ren wants to merge 2 commits intoscylladb:masterfrom
kendrick-ren:s3-export-savepoints-error
Open

fix the issue reported in issue #247#349
kendrick-ren wants to merge 2 commits intoscylladb:masterfrom
kendrick-ren:s3-export-savepoints-error

Conversation

@kendrick-ren
Copy link
Copy Markdown
Collaborator

This PR is to fix the issue reported in #247

The root cause is that DynamoDbSavepointsManager only handles the HadoopPartition, while migrateFromS3Export path creates the RDD using spark.parallelize() which generates ParallelCollectionPartition instead.

When save points manager tries to extract DynamoDBSplit segiment from the partitions, it fails processing such partition.

The PR introduces a parameter in underlying migrate() method on which type of partition it is, and invokes writers.DynamoDB.writeRDD for S3 ParallelCollectionPartition paths.

The PR doesn't have the save points support for S3 path, as it would require a different mechanism to keep tracking of the processed files and filter out these files on restart. We can have an enhancement to support this in future.

@kendrick-ren kendrick-ren force-pushed the s3-export-savepoints-error branch 3 times, most recently from d47d40e to 87b4563 Compare April 7, 2026 14:54
Copy link
Copy Markdown
Contributor

@dkropachev dkropachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior change itself looks reasonable, but the new test does not actually verify it. Please tighten the coverage so the test fails on the pre-PR implementation and passes only when the S3 export path really skips savepoint management.

Comment on lines +71 to +89
// Perform the migration and verify no savepoint files are created (issue #247:
// S3 export sources use ParallelCollectionPartitions, not HadoopPartitions,
// so DynamoDbSavepointsManager must not be instantiated for this path).
val beforeMigration = System.currentTimeMillis()
successfullyPerformMigration(configFile)

val savepointsDir = Paths.get("docker/spark-master")
if (Files.exists(savepointsDir)) {
Using.resource(Files.list(savepointsDir)) { stream =>
val newSavepoints = stream
.iterator()
.asScala
.filter(p => Files.isRegularFile(p))
.filter(_.getFileName.toString.startsWith("savepoint_"))
.filter(p => Files.getLastModifiedTime(p).toMillis >= beforeMigration)
.toList
assert(
newSavepoints.isEmpty,
s"S3 export import should not create savepoint files, but found: $newSavepoints"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion does not actually prove that DynamoDbSavepointsManager was skipped. Savepoints are only written on signal or after intervalSeconds elapses, and close() does not emit a final savepoint. With the current test config (intervalSeconds: 300), the pre-PR implementation would also usually leave no savepoint_* file here, so this test can still pass even if the savepoints manager is instantiated. Please make the test observe the behavior more directly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkropachev enhanced the test case, pls take a look.

@kendrick-ren kendrick-ren force-pushed the s3-export-savepoints-error branch from 87b4563 to f58e321 Compare April 9, 2026 01:29
@kendrick-ren kendrick-ren requested a review from dkropachev April 9, 2026 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants