Skip to content

Align repairTable behavior with master#3561

Open
brfrn169 wants to merge 2 commits into
3from
align-repairtable-master-and-3
Open

Align repairTable behavior with master#3561
brfrn169 wants to merge 2 commits into
3from
align-repairtable-master-and-3

Conversation

@brfrn169
Copy link
Copy Markdown
Collaborator

@brfrn169 brfrn169 commented May 8, 2026

Description

Admin#repairTable semantics have diverged significantly between master and branch 3. On master, #1125 (2023-10) reworked repairTable from a metadata-only fixer into a true reconciliation operation that can rebuild the table itself, its secondary indexes, and its metadata in a single call. The change was never backported to 3, where repairTable still has a much narrower scope — for example, on CassandraAdmin it is effectively a no-op, and on the other Admins it only touches the metadata table.

This PR backports the wider repair scope to branch 3. Operators who run into a partially-broken schema (table dropped manually, secondary index lost, IndexingPolicy/GSI missing, Coordinator namespace dropped, Coordinator schema lagging the current group commit setting, etc.) can now recover with a single repairTable / repairCoordinatorTables call instead of having to drop and recreate by hand. As a side effect, future backports from master that touch this area land cleanly instead of conflicting with the divergence.

Behavior change / improvement

Storage Before — what `repairTable` could fix After — what `repairTable` can fix
Cassandra Effectively nothing (no metadata table; threw `IllegalArgumentException` if the table was missing) Recreates the missing table; recreates missing secondary indexes
Cosmos Metadata + missing stored procedure on an existing container; `IllegalArgumentException` if the container was missing Recreates the missing container; refreshes the IndexingPolicy; metadata + stored procedure as before
Dynamo Metadata only on an existing table; `IllegalArgumentException` if the table was missing Recreates the missing table; recreates missing GSI; metadata as before
JDBC Metadata only on an existing table; threw `TABLE_NOT_FOUND` if the table was missing Recreates the missing table; metadata as before
`ConsensusCommitAdmin#repairCoordinatorTables` Could only recover the Coordinator table (and threw `IllegalArgumentException` if the table was missing); could not recover the namespace; could not migrate the schema when the group commit setting was toggled Recreates both the Coordinator namespace and the Coordinator table when missing; also reconciles the Coordinator schema with the runtime `group_commit.enabled` config — adds the `child_ids` column via `ALTER TABLE ... ADD COLUMN` when toggling from disabled to enabled, and preserves the WITH-`child_ids` schema when an existing Coordinator already has the column (so ScalarDB metadata stays aligned with the physical column set across config toggles)
`scalardb-schema-loader --repair-all` Failed with `IllegalArgumentException` if any schema entry's table no longer existed Reconciles missing tables along the way (inherits the new `repairTable` semantics)

Related issues and/or PRs

  • Backport of the master change #1125 Revise Admin's repair table behavior (commit 392ac066e).

Changes made

  • Updated the Admin#repairTable Javadoc to match master's wording and dropped the @throws IllegalArgumentException line.
  • Replaced each storage Admin#repairTable body to mirror master:
    • JdbcAdmin: body-only swap; the helpers already match master.
    • CassandraAdmin: introduced createIndexInternal(... ifNotExists), added an ifNotExists parameter to createTableInternal and createSecondaryIndexes, then ported the body.
    • CosmosAdmin: added ifNotExists branches to createContainer and addStoredProcedure, introduced a createTableInternal wrapper, then ported the body.
    • DynamoAdmin: extracted createTableInternal(... ifNotExists, options) from createTable, introduced internalTableExists and rawIndexExists, then ported the body. Also renamed putTableMetadata to upsertTableMetadata to match master.
    • ObjectStorageAdmin: no change — the implementation is byte-identical to master.
  • Extended ConsensusCommitAdmin#repairCoordinatorTables:
    • Prepended admin.createNamespace(coordinatorNamespace, /* ifNotExists */ true, options) so that dropCoordinatorTables(true) followed by repairCoordinatorTables(...) is a complete recovery path even when the namespace is dropped (master uses repairNamespace, which does not exist on branch 3; the existing default Admin#createNamespace(name, ifNotExists, options) overload provides equivalent behavior).
    • Snapshot the pre-repair Coordinator metadata, then pick the desired schema such that the ScalarDB-side metadata always stays aligned with the physical column set: if the existing Coordinator already has the child_ids column, preserve the WITH_GROUP_COMMIT_ENABLED schema regardless of the current config (the runtime config independently decides whether to USE the column); otherwise use the config-dependent schema. After repairTable, addNewColumnToTable(...) for any non-key columns the desired schema requires that the existing Coordinator is missing — handles the group_commit.enabled = false → true upgrade in place by issuing ALTER TABLE ... ADD COLUMN child_ids. Mirrors the column-migration logic that master provides via the (master-only) upgradeCoordinatorTable() path of Admin#upgrade().
  • Updated unit and integration tests to reflect the new behavior:
    • Consolidated the SQL-string-list JdbcAdminTest repairTable tests into a single @ParameterizedTest + verify-based test (6 engines × 2 variants + 2 helpers reduced to one), matching master.
    • Updated the repairTable / repairCoordinatorTables tests in CassandraAdminTest, CosmosAdminTestBase, DynamoAdminTestBase, and ConsensusCommitAdminTestBase. Added new unit-test scenarios for the Coordinator schema upgrade case (existing without child_ids + group commit enabled → ALTER ADD; existing with child_ids + group commit disabled → preserve schema).
    • In the integration test bases (DistributedStorageAdminRepairTableIntegrationTestBase, DistributedTransactionAdminRepairTableIntegrationTestBase), added repairTable_ForExistingTableAndMetadata_ShouldDoNothing, repairTable_ForNonExistingTableButExistingMetadata_ShouldCreateTable, repairCoordinatorTables_CoordinatorTablesDoNotExist_ShouldCreateCoordinatorTables, and repairCoordinatorTables_CoordinatorTablesExist_ShouldDoNothing, and removed the IAE-expecting tests they replace.
  • Removed dead @Override @Disabled overrides in the ObjectStorage IT subclasses for parent test methods that no longer exist after the IAE-expecting tests were deleted.
  • In the Cassandra IT, share the ClusterManager between admin and adminTestUtils to avoid the schema-metadata async-propagation issue that made repairTable followed by tableExists flaky. Added a CassandraAdminTestUtils(Properties, ClusterManager) constructor and overrode setUp() in CassandraAdminRepairTableIntegrationTest to use a single shared ClusterManager.
  • Added --no-scaling to DynamoSchemaLoaderIntegrationTest#getCommandArgsForTableReparation so that the schema-loader repair flow works against DynamoDB Local, which does not support ApplicationAutoScaling. This matches the equivalent option list in master's Dynamo schema-loader IT.

Checklist

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • I have considered whether similar issues could occur in other products, components, or modules if this PR is for bug fixes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Two intentional divergences from master remain because the missing pieces on branch 3 are out of scope for this PR:

  1. CosmosAdmin keeps putTableMetadata rather than renaming to master's upsertTableMetadata. master's 3-arg upsertTableMetadata depends on a different metadata-container bootstrap helper (createMetadataDatabaseAndTableMetadataContainerIfNotExists + getTableMetadataContainer); aligning that requires porting a broader refactor beyond the minimum scope of this PR. Branch 3's 4-arg putTableMetadata is kept.

  2. ConsensusCommitAdmin#repairCoordinatorTables calls createNamespace(..., true, options) rather than repairNamespace(...), and embeds the Coordinator schema reconciliation step inline rather than calling a separate Admin#upgrade() API. repairNamespace and Admin#upgrade() exist only on master; adding them to branch 3 is out of scope. Equivalent behavior is achieved here via the existing createNamespace(name, ifNotExists, options) overload and an inlined upgrade step.

For the Cassandra IT fix (shared ClusterManager), master overrides initialize(String testName) while this PR overrides setUp(). This is because branch 3's IT base class creates admin inside setUp() rather than in initialize(). Refactoring the IT base class to match master's structure (initialize() creates admin, setUp() only creates the table) is a broader change with non-trivial blast radius across all storage IT subclasses, and is left for a separate, dedicated PR rather than mixed into this behavior alignment.

Release notes

Expanded the scope of Admin#repairTable so that it can now recover from a wider range of partially-broken schema states — it recreates the missing table, recreates missing secondary indexes / GSI / IndexingPolicy, and refreshes the metadata in a single call (scalardb-schema-loader --repair-all inherits the same behavior). ConsensusCommitAdmin#repairCoordinatorTables was similarly extended to recover both the Coordinator namespace and the Coordinator table when missing, and to add the child_ids column to an existing Coordinator table when group commit has been turned on against a pre-existing one. As part of this contract change, Admin#repairTable no longer throws IllegalArgumentException for a missing table.

@brfrn169 brfrn169 self-assigned this May 8, 2026
@brfrn169 brfrn169 requested a review from Copilot May 8, 2026 03:58
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the repairTable and repairCoordinatorTables methods across all storage implementations (Cassandra, Cosmos, DynamoDB, and JDBC) to re-create missing tables, secondary indexes, and metadata instead of throwing an exception. It also introduces logic in ConsensusCommitAdmin to handle coordinator table schema upgrades when group commit is toggled. Feedback was provided for the DynamoDB implementation to ensure that tables are in an ACTIVE state before attempting to enable auto-scaling or continuous backups during a repair operation.

Comment thread core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backports the expanded Admin#repairTable semantics from master to branch 3, turning it into a reconciliation operation that can recreate missing tables/containers, repair secondary indexes (and Dynamo GSIs / Cosmos indexing policy), and refresh ScalarDB metadata. It also extends ConsensusCommitAdmin#repairCoordinatorTables to recover missing coordinator namespace/tables and reconcile the coordinator schema with the runtime group-commit configuration.

Changes:

  • Updated repairTable implementations across JDBC/Cassandra/Cosmos/Dynamo to (re)create missing tables and indexes/GSIs/IndexingPolicy and upsert metadata instead of failing on missing tables.
  • Extended ConsensusCommitAdmin#repairCoordinatorTables to create the coordinator namespace (if missing) and reconcile the coordinator schema (including child_ids column handling).
  • Updated/added unit and integration tests to validate the new reconciliation behavior and remove prior “throws on missing table” expectations.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
core/src/main/java/com/scalar/db/api/Admin.java Updates repairTable Javadoc to reflect reconciliation semantics.
core/src/main/java/com/scalar/db/storage/jdbc/JdbcAdmin.java Makes repairTable create the table (if missing) and upsert metadata.
core/src/main/java/com/scalar/db/storage/cassandra/CassandraAdmin.java Adds IF NOT EXISTS support for table/index creation and aligns repairTable behavior.
core/src/main/java/com/scalar/db/storage/cosmos/CosmosAdmin.java Makes repairTable recreate missing containers, ensure stored procedure, and refresh IndexingPolicy/metadata.
core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java Refactors to createTableInternal(...ifNotExists...), adds existence checks, and aligns repairTable to recreate tables and missing GSIs.
core/src/main/java/com/scalar/db/transaction/consensuscommit/ConsensusCommitAdmin.java Extends repairCoordinatorTables to create namespace, reconcile metadata schema choice, and add missing columns.
core/src/test/java/com/scalar/db/storage/jdbc/JdbcAdminTest.java Consolidates repairTable tests into a parameterized verification-based test.
core/src/test/java/com/scalar/db/storage/cassandra/CassandraAdminTest.java Updates unit tests to reflect “create if missing” + “create indexes if missing” repair behavior.
core/src/test/java/com/scalar/db/storage/cosmos/CosmosAdminTestBase.java Updates repair tests to reflect idempotent creation and IndexingPolicy refresh.
core/src/test/java/com/scalar/db/storage/dynamo/DynamoAdminTestBase.java Updates/extends repair tests to cover creating missing tables/metadata tables and revised creation ordering.
core/src/test/java/com/scalar/db/transaction/consensuscommit/ConsensusCommitAdminTestBase.java Adds coordinator schema reconciliation test scenarios (including child_ids column handling).
integration-test/src/main/java/com/scalar/db/api/DistributedStorageAdminRepairTableIntegrationTestBase.java Adds integration scenarios for “do nothing” and “create table when missing but metadata exists”.
integration-test/src/main/java/com/scalar/db/api/DistributedTransactionAdminRepairTableIntegrationTestBase.java Adds integration scenarios for coordinator table recreation and idempotency.
core/src/integration-test/java/com/scalar/db/storage/objectstorage/ObjectStorageAdminRepairTableIntegrationTest.java Removes disabled overrides tied to old “throws on missing table” behavior.
core/src/integration-test/java/com/scalar/db/storage/objectstorage/ConsensusCommitAdminRepairTableIntegrationTestWithObjectStorage.java Removes disabled overrides tied to old coordinator repair behavior.
core/src/integration-test/java/com/scalar/db/storage/dynamo/DynamoSchemaLoaderIntegrationTest.java Adds --no-scaling to make schema-loader repair flow work with DynamoDB Local.
core/src/integration-test/java/com/scalar/db/storage/cassandra/CassandraAdminTestUtils.java Adds constructor to allow sharing a ClusterManager.
core/src/integration-test/java/com/scalar/db/storage/cassandra/CassandraAdminRepairTableIntegrationTest.java Overrides setup to share ClusterManager to reduce flakiness from async schema propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@brfrn169 brfrn169 force-pushed the align-repairtable-master-and-3 branch from f340f1c to 3f075f4 Compare May 8, 2026 04:30
Verifies that calling repairCoordinatorTables with scalar.db.consensus_commit.coordinator.group_commit.enabled toggled from false to true against an existing pre-group-commit Coordinator table issues ALTER TABLE ADD COLUMN to add the child_ids column.
@brfrn169 brfrn169 force-pushed the align-repairtable-master-and-3 branch from 3f075f4 to 5b1df19 Compare May 8, 2026 04:50
@brfrn169 brfrn169 requested a review from Copilot May 8, 2026 04:54
@brfrn169
Copy link
Copy Markdown
Collaborator Author

brfrn169 commented May 8, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors the repairTable and repairCoordinatorTables functionality across all supported storage backends to allow recreating missing tables, secondary indexes, and metadata. Previously, these methods often required the physical table to exist; now they attempt to create it if missing. Significant changes were made to ConsensusCommitAdmin to support upgrading the coordinator table schema during repair, particularly for group commit support. Review feedback suggests moving the table creation wait logic in DynamoDB to handle tables in a CREATING state, removing the fixed retry limit for DynamoDB index status checks to accommodate long-running GSI operations, and adding error handling when retrieving metadata in repairCoordinatorTables to better handle corrupted states.

Comment thread core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java
Comment thread core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

@brfrn169 brfrn169 requested a review from Copilot May 8, 2026 05:16
@brfrn169
Copy link
Copy Markdown
Collaborator Author

brfrn169 commented May 8, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the repairTable implementation across Cassandra, Cosmos, DynamoDB, and JDBC backends to ensure that missing tables, secondary indexes, and metadata are re-created during the repair process instead of throwing an exception. It also updates ConsensusCommitAdmin to handle coordinator table schema upgrades, specifically for adding the child_ids column when group commit is enabled. Unit and integration tests have been updated to reflect these behavioral changes. Feedback was provided regarding the 30-second timeout for DynamoDB Global Secondary Index activation in rawIndexExists, which may be insufficient for large tables and should potentially be made configurable.

Comment thread core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

@brfrn169 brfrn169 requested review from a team, KodaiD, Torch3333, feeblefakie and komamitsu and removed request for a team May 8, 2026 05:30
Copy link
Copy Markdown
Contributor

@Torch3333 Torch3333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants