Align repairTable behavior with master#3561
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the repairTable and repairCoordinatorTables methods across all storage implementations (Cassandra, Cosmos, DynamoDB, and JDBC) to re-create missing tables, secondary indexes, and metadata instead of throwing an exception. It also introduces logic in ConsensusCommitAdmin to handle coordinator table schema upgrades when group commit is toggled. Feedback was provided for the DynamoDB implementation to ensure that tables are in an ACTIVE state before attempting to enable auto-scaling or continuous backups during a repair operation.
There was a problem hiding this comment.
Pull request overview
Backports the expanded Admin#repairTable semantics from master to branch 3, turning it into a reconciliation operation that can recreate missing tables/containers, repair secondary indexes (and Dynamo GSIs / Cosmos indexing policy), and refresh ScalarDB metadata. It also extends ConsensusCommitAdmin#repairCoordinatorTables to recover missing coordinator namespace/tables and reconcile the coordinator schema with the runtime group-commit configuration.
Changes:
- Updated
repairTableimplementations across JDBC/Cassandra/Cosmos/Dynamo to (re)create missing tables and indexes/GSIs/IndexingPolicy and upsert metadata instead of failing on missing tables. - Extended
ConsensusCommitAdmin#repairCoordinatorTablesto create the coordinator namespace (if missing) and reconcile the coordinator schema (includingchild_idscolumn handling). - Updated/added unit and integration tests to validate the new reconciliation behavior and remove prior “throws on missing table” expectations.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| core/src/main/java/com/scalar/db/api/Admin.java | Updates repairTable Javadoc to reflect reconciliation semantics. |
| core/src/main/java/com/scalar/db/storage/jdbc/JdbcAdmin.java | Makes repairTable create the table (if missing) and upsert metadata. |
| core/src/main/java/com/scalar/db/storage/cassandra/CassandraAdmin.java | Adds IF NOT EXISTS support for table/index creation and aligns repairTable behavior. |
| core/src/main/java/com/scalar/db/storage/cosmos/CosmosAdmin.java | Makes repairTable recreate missing containers, ensure stored procedure, and refresh IndexingPolicy/metadata. |
| core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java | Refactors to createTableInternal(...ifNotExists...), adds existence checks, and aligns repairTable to recreate tables and missing GSIs. |
| core/src/main/java/com/scalar/db/transaction/consensuscommit/ConsensusCommitAdmin.java | Extends repairCoordinatorTables to create namespace, reconcile metadata schema choice, and add missing columns. |
| core/src/test/java/com/scalar/db/storage/jdbc/JdbcAdminTest.java | Consolidates repairTable tests into a parameterized verification-based test. |
| core/src/test/java/com/scalar/db/storage/cassandra/CassandraAdminTest.java | Updates unit tests to reflect “create if missing” + “create indexes if missing” repair behavior. |
| core/src/test/java/com/scalar/db/storage/cosmos/CosmosAdminTestBase.java | Updates repair tests to reflect idempotent creation and IndexingPolicy refresh. |
| core/src/test/java/com/scalar/db/storage/dynamo/DynamoAdminTestBase.java | Updates/extends repair tests to cover creating missing tables/metadata tables and revised creation ordering. |
| core/src/test/java/com/scalar/db/transaction/consensuscommit/ConsensusCommitAdminTestBase.java | Adds coordinator schema reconciliation test scenarios (including child_ids column handling). |
| integration-test/src/main/java/com/scalar/db/api/DistributedStorageAdminRepairTableIntegrationTestBase.java | Adds integration scenarios for “do nothing” and “create table when missing but metadata exists”. |
| integration-test/src/main/java/com/scalar/db/api/DistributedTransactionAdminRepairTableIntegrationTestBase.java | Adds integration scenarios for coordinator table recreation and idempotency. |
| core/src/integration-test/java/com/scalar/db/storage/objectstorage/ObjectStorageAdminRepairTableIntegrationTest.java | Removes disabled overrides tied to old “throws on missing table” behavior. |
| core/src/integration-test/java/com/scalar/db/storage/objectstorage/ConsensusCommitAdminRepairTableIntegrationTestWithObjectStorage.java | Removes disabled overrides tied to old coordinator repair behavior. |
| core/src/integration-test/java/com/scalar/db/storage/dynamo/DynamoSchemaLoaderIntegrationTest.java | Adds --no-scaling to make schema-loader repair flow work with DynamoDB Local. |
| core/src/integration-test/java/com/scalar/db/storage/cassandra/CassandraAdminTestUtils.java | Adds constructor to allow sharing a ClusterManager. |
| core/src/integration-test/java/com/scalar/db/storage/cassandra/CassandraAdminRepairTableIntegrationTest.java | Overrides setup to share ClusterManager to reduce flakiness from async schema propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f340f1c to
3f075f4
Compare
Verifies that calling repairCoordinatorTables with scalar.db.consensus_commit.coordinator.group_commit.enabled toggled from false to true against an existing pre-group-commit Coordinator table issues ALTER TABLE ADD COLUMN to add the child_ids column.
3f075f4 to
5b1df19
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request refactors the repairTable and repairCoordinatorTables functionality across all supported storage backends to allow recreating missing tables, secondary indexes, and metadata. Previously, these methods often required the physical table to exist; now they attempt to create it if missing. Significant changes were made to ConsensusCommitAdmin to support upgrading the coordinator table schema during repair, particularly for group commit support. Review feedback suggests moving the table creation wait logic in DynamoDB to handle tables in a CREATING state, removing the fixed retry limit for DynamoDB index status checks to accommodate long-running GSI operations, and adding error handling when retrieving metadata in repairCoordinatorTables to better handle corrupted states.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the repairTable implementation across Cassandra, Cosmos, DynamoDB, and JDBC backends to ensure that missing tables, secondary indexes, and metadata are re-created during the repair process instead of throwing an exception. It also updates ConsensusCommitAdmin to handle coordinator table schema upgrades, specifically for adding the child_ids column when group commit is enabled. Unit and integration tests have been updated to reflect these behavioral changes. Feedback was provided regarding the 30-second timeout for DynamoDB Global Secondary Index activation in rawIndexExists, which may be insufficient for large tables and should potentially be made configurable.
Description
Admin#repairTablesemantics have diverged significantly betweenmasterand branch3. Onmaster, #1125 (2023-10) reworkedrepairTablefrom a metadata-only fixer into a true reconciliation operation that can rebuild the table itself, its secondary indexes, and its metadata in a single call. The change was never backported to3, whererepairTablestill has a much narrower scope — for example, onCassandraAdminit is effectively a no-op, and on the other Admins it only touches the metadata table.This PR backports the wider repair scope to branch
3. Operators who run into a partially-broken schema (table dropped manually, secondary index lost, IndexingPolicy/GSI missing, Coordinator namespace dropped, Coordinator schema lagging the current group commit setting, etc.) can now recover with a singlerepairTable/repairCoordinatorTablescall instead of having to drop and recreate by hand. As a side effect, future backports frommasterthat touch this area land cleanly instead of conflicting with the divergence.Behavior change / improvement
Related issues and/or PRs
#1125 Revise Admin's repair table behavior(commit392ac066e).Changes made
Admin#repairTableJavadoc to matchmaster's wording and dropped the@throws IllegalArgumentExceptionline.Admin#repairTablebody to mirrormaster:master.createIndexInternal(... ifNotExists), added anifNotExistsparameter tocreateTableInternalandcreateSecondaryIndexes, then ported the body.ifNotExistsbranches tocreateContainerandaddStoredProcedure, introduced acreateTableInternalwrapper, then ported the body.createTableInternal(... ifNotExists, options)fromcreateTable, introducedinternalTableExistsandrawIndexExists, then ported the body. Also renamedputTableMetadatatoupsertTableMetadatato matchmaster.master.ConsensusCommitAdmin#repairCoordinatorTables:admin.createNamespace(coordinatorNamespace, /* ifNotExists */ true, options)so thatdropCoordinatorTables(true)followed byrepairCoordinatorTables(...)is a complete recovery path even when the namespace is dropped (masterusesrepairNamespace, which does not exist on branch3; the existing defaultAdmin#createNamespace(name, ifNotExists, options)overload provides equivalent behavior).child_idscolumn, preserve the WITH_GROUP_COMMIT_ENABLED schema regardless of the current config (the runtime config independently decides whether to USE the column); otherwise use the config-dependent schema. AfterrepairTable,addNewColumnToTable(...)for any non-key columns the desired schema requires that the existing Coordinator is missing — handles thegroup_commit.enabled = false → trueupgrade in place by issuingALTER TABLE ... ADD COLUMN child_ids. Mirrors the column-migration logic thatmasterprovides via the (master-only)upgradeCoordinatorTable()path ofAdmin#upgrade().JdbcAdminTestrepairTabletests into a single@ParameterizedTest+verify-based test (6 engines × 2 variants + 2 helpers reduced to one), matchingmaster.repairTable/repairCoordinatorTablestests inCassandraAdminTest,CosmosAdminTestBase,DynamoAdminTestBase, andConsensusCommitAdminTestBase. Added new unit-test scenarios for the Coordinator schema upgrade case (existing withoutchild_ids+ group commit enabled → ALTER ADD; existing withchild_ids+ group commit disabled → preserve schema).DistributedStorageAdminRepairTableIntegrationTestBase,DistributedTransactionAdminRepairTableIntegrationTestBase), addedrepairTable_ForExistingTableAndMetadata_ShouldDoNothing,repairTable_ForNonExistingTableButExistingMetadata_ShouldCreateTable,repairCoordinatorTables_CoordinatorTablesDoNotExist_ShouldCreateCoordinatorTables, andrepairCoordinatorTables_CoordinatorTablesExist_ShouldDoNothing, and removed the IAE-expecting tests they replace.@Override @Disabledoverrides in the ObjectStorage IT subclasses for parent test methods that no longer exist after the IAE-expecting tests were deleted.ClusterManagerbetweenadminandadminTestUtilsto avoid the schema-metadata async-propagation issue that maderepairTablefollowed bytableExistsflaky. Added aCassandraAdminTestUtils(Properties, ClusterManager)constructor and overrodesetUp()inCassandraAdminRepairTableIntegrationTestto use a single sharedClusterManager.--no-scalingtoDynamoSchemaLoaderIntegrationTest#getCommandArgsForTableReparationso that the schema-loader repair flow works against DynamoDB Local, which does not support ApplicationAutoScaling. This matches the equivalent option list inmaster's Dynamo schema-loader IT.Checklist
Additional notes (optional)
Two intentional divergences from
masterremain because the missing pieces on branch3are out of scope for this PR:CosmosAdminkeepsputTableMetadatarather than renaming tomaster'supsertTableMetadata.master's 3-argupsertTableMetadatadepends on a different metadata-container bootstrap helper (createMetadataDatabaseAndTableMetadataContainerIfNotExists+getTableMetadataContainer); aligning that requires porting a broader refactor beyond the minimum scope of this PR. Branch3's 4-argputTableMetadatais kept.ConsensusCommitAdmin#repairCoordinatorTablescallscreateNamespace(..., true, options)rather thanrepairNamespace(...), and embeds the Coordinator schema reconciliation step inline rather than calling a separateAdmin#upgrade()API.repairNamespaceandAdmin#upgrade()exist only onmaster; adding them to branch3is out of scope. Equivalent behavior is achieved here via the existingcreateNamespace(name, ifNotExists, options)overload and an inlined upgrade step.For the Cassandra IT fix (shared
ClusterManager),masteroverridesinitialize(String testName)while this PR overridessetUp(). This is because branch3's IT base class createsadmininsidesetUp()rather than ininitialize(). Refactoring the IT base class to matchmaster's structure (initialize()createsadmin,setUp()only creates the table) is a broader change with non-trivial blast radius across all storage IT subclasses, and is left for a separate, dedicated PR rather than mixed into this behavior alignment.Release notes
Expanded the scope of
Admin#repairTableso that it can now recover from a wider range of partially-broken schema states — it recreates the missing table, recreates missing secondary indexes / GSI / IndexingPolicy, and refreshes the metadata in a single call (scalardb-schema-loader --repair-allinherits the same behavior).ConsensusCommitAdmin#repairCoordinatorTableswas similarly extended to recover both the Coordinator namespace and the Coordinator table when missing, and to add thechild_idscolumn to an existing Coordinator table when group commit has been turned on against a pre-existing one. As part of this contract change,Admin#repairTableno longer throwsIllegalArgumentExceptionfor a missing table.