CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842

ifesdjeen · 2025-01-28T18:11:55Z

preclude TCM of being behind Accord if newer epoch is reported via withEpoch/fetchTopologyInternal
improve topology discovery during first boot and replay
fix races between config service TCM listener reporting topologies, and fetched topologies during

Patch by Alex Petrov, reviewed by Ariel Weisberg for CASSANDRA-20245

test/distributed/org/apache/cassandra/distributed/test/accord/AccordDropTableTest.java

test/distributed/org/apache/cassandra/distributed/test/accord/AccordDropKeyspaceTest.java

src/java/org/apache/cassandra/service/accord/FetchMinEpoch.java

src/java/org/apache/cassandra/service/accord/FetchTopology.java

dcapwell · 2025-01-29T16:20:42Z

src/java/org/apache/cassandra/service/accord/FetchTopology.java

@@ -123,18 +116,20 @@ public Response(long epoch, Topology topology)
        long epoch = message.payload.epoch;
        Topology topology = AccordService.instance().topology().maybeGlobalForEpoch(epoch);
        if (topology == null)
-            MessagingService.instance().respond(Response.UNKNOWN, message);
+            MessagingService.instance().respond(Response.unkonwn(epoch), message);


im confused by this line

private static Response unkonwn(long epoch) { throw new IllegalStateException("Unknown topology: " + epoch); }

Response.unkonwn only throws, so this response won't do anything as that method won't be called

Is this for test code that can catch the actual exception?

i couldn't find any tests for this logic. Since this is a verb handle we should get back an UNKNOWN exception; that could be a NPE or an unknown epoch... its truly unknown.

switched to an explicit throw so that we would return UNKNOWN via messaging.

I don't feel this is a good idea.... UNKNOWN can also mean NPE... its truly unknown whats going on and isn't a good answer to "do you know about this epoch". For example, if the epoch isn't known there isn't a reason to retry on that node... but with UNKNOWN error we then will retry

created #3842 (comment) so this doesn't get lost...

src/java/org/apache/cassandra/service/accord/FetchTopology.java

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

src/java/org/apache/cassandra/service/accord/AccordService.java

aweisberg

Noticed a few things that might matter.

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

aweisberg · 2025-01-28T20:06:26Z

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

@@ -407,7 +407,11 @@ void maybeReportMetadata(ClusterMetadata metadata)
        long epoch = metadata.epoch.getEpoch();
        synchronized (epochs)
        {
-            if (epochs.maxEpoch() == 0)
+            long maxEpoch = epochs.maxEpoch();
+            if (maxEpoch >= epoch)


Does Accord knowing about the epoch guarantee that TCM has already loaded it? We don't want to skip the TCM loading step by not indirectly calling fetchTopologyInternal or something to ensure TCM loaded it.

Good point.

I think this was also an edge-case. If I remember it correctly: if you're a second node to be brought up, querying other nodes for min epochs will yield epoch 4, which you will instantiate with a fetched topology.

But then if you race with a TCM callback, you will create epoch - 1 epoch, which no-one has ever heard about.
I should have fixed it differently, maybe this way:

// Create a -1 epoch iif we know this epoch may actually exist if (metadata.epoch.getEpoch() > minEpoch()) getOrCreateEpochState(epoch - 1).acknowledged().addCallback(() -> reportMetadata(metadata));

src/java/org/apache/cassandra/service/accord/AccordService.java

aweisberg · 2025-01-29T16:35:44Z

src/java/org/apache/cassandra/service/accord/AccordService.java

-            for (long epoch = minEpoch; epoch <= metadata.epoch.getEpoch(); epoch++)
-                node.configService().fetchTopologyForEpoch(epoch);
+            // Fetch topologies up to current
+            List<Topology> topologies = fetchTopologies(0, metadata);


Maybe make 0 a constant indicating that it is actually supposed to find the minEpoch to fetch?

Switched to capital L Long.

A named constant that is null is still clearer just because you can describe the semantic of it.

test/distributed/org/apache/cassandra/distributed/test/accord/AccordDropKeyspaceTest.java

test/unit/org/apache/cassandra/service/accord/FetchMinEpochTest.java

aweisberg · 2025-01-29T16:51:01Z

src/java/org/apache/cassandra/service/accord/FetchMinEpoch.java

                                                                                          Iterators.cycle(to),
-                                                                                          RetryPredicate.times(DatabaseDescriptor.getAccord().minEpochSyncRetry.maxAttempts.value),
+                                                                                          RetryPredicate.ALWAYS_RETRY,


Does this mean that fetching the min epoch requires all nodes to be up to complete? Just looking at how this is accumulated by the caller of fetch which combines all the futures and can't complete until every future completes which means any down node would stop this from working?

iirc this is faithful to the original implementation, which I believe might have been not entirely correct. We're collecting responses from all nodes, but we consider only successes here, which might mean we will not discover an early enough epoch.

We planned to address this when implementing epoch GC, where we'll indicate which epoch is retired, and a better success criteria for this.

I think that's not faithful to what was here, because it had a maximum number of attempts before so eventually it would complete? Now it retries forever so it will never stop?

src/java/org/apache/cassandra/service/accord/AccordService.java

- preclude TCM of being behind Accord if newer epoch is reported via withEpoch/fetchTopologyInternal - improve topology discovery during first boot and replay - fix races between config service TCM listener reporting topologies, and fetched topologies during Patch by Alex Petrov, reviewed by Ariel Weisberg for CASSANDRA-20245

src/java/org/apache/cassandra/config/AccordSpec.java

dcapwell · 2025-01-30T22:38:59Z

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

@@ -213,6 +213,7 @@ public EpochDiskState truncateTopologyUntil(long epoch, EpochDiskState diskState
        }
    }

+    // TODO: should not be public


Suggested change

// TODO: should not be public

//TODO (???): should not be public

2 things

//TODO vs // TODO, not sure why intellij cares...

(...) should have a scope?

dcapwell · 2025-01-30T22:41:58Z

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

-        getOrCreateEpochState(epoch - 1).acknowledged().addCallback(() -> reportMetadata(metadata));
+
+        // Create a -1 epoch iif we know this epoch may actually exist
+        if (metadata.epoch.getEpoch() > minEpoch())


why this change? this worries me. Making sure this protocol was safe took a lot of effort, so adding this extra state into it makes me worry that we will regress...

If TCM is reporting, we won't hit this... so only non-TCM reporting should hit this... we should guard there IMO

dcapwell · 2025-01-30T22:42:30Z

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

+
+            // In most cases, after fetching log from CMS, we will be caught up to the required epoch.
+            // This TCM will also notify Accord via reportMetadata, so we do not need to fetch topologies.
+            // If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch,


Suggested change

// If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch,

// If metadata has reported has skipped one or more epochs, and is _ahead_ of the requested epoch,

dcapwell · 2025-01-30T22:45:03Z

src/java/org/apache/cassandra/service/accord/AccordService.java

-                throw new RuntimeException(e);
+                try
+                {
+                    epochReady(metadata.epoch).get(5, SECONDS);


Suggested change

epochReady(metadata.epoch).get(5, SECONDS);

epochReady(metadata.epoch).get(waitSeconds, SECONDS);

dcapwell · 2025-01-30T22:49:32Z

src/java/org/apache/cassandra/service/accord/FetchTopology.java

            MessagingService.instance().respond(new Response(epoch, topology), message);
+        else
+            throw new IllegalStateException("Unknown topology: " + epoch);


said this in another thread, but given the refactor its not showing up in the file, so reposing.

I do not think this is a good idea. UNKNOWN failure is truly unknown... it could be a NPE, it could be we ran before we were ready... it could be anything... its unknown... I do not think its a good idea to conflate that with "I don't know this epoch", because the handling is different...

if UNKNOWN exception is returned, retrying on the same node makes sense, it could be an ephemeral issue.
If the node doesn't know about the epoch, calling them again isn't the best idea... we should try another node, and if no node knows the epoch we are in a bad state...

Switched to specialized UNKNOWN_TOPLOGY failure response.

dcapwell · 2025-01-30T22:49:53Z

src/java/org/apache/cassandra/service/accord/FetchTopology.java

+                                                                            request,
+                                                                            MessagingUtils.tryAliveFirst(SharedContext.Global.instance, peers, Verb.ACCORD_FETCH_TOPOLOGY_REQ.name()),
+                                                                            (attempt, from, failure) -> {
+                                                                                System.out.println("Got " + failure + " from " + from + " while fetching " + request);


please use a logger

dcapwell · 2025-01-31T18:10:21Z

src/java/org/apache/cassandra/config/AccordSpec.java

@@ -204,6 +205,14 @@ public MinEpochRetrySpec()
        }
    }

+    public static class FetchRetrySpec extends RetrySpec


RetrySpec has the following constructor

public RetrySpec(MaxAttempt maxAttempts, LongMillisecondsBound baseSleepTime, LongMillisecondsBound maxSleepTime)

Nothing stops us from adding

public RetrySpec(MaxAttempt maxAttempts)

or a create method... I think i did things this way mostly because of repair but nothing is wrong with RetrySpec fetchRetry = RetrySpec.create(100); IMO

I am 100% fine with this class (assuming the ref test is fixed), I am also fine with creating new methods in RetrySpec to get the same behavior

dcapwell · 2025-01-31T18:20:50Z

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

-                {
-                }
+
+                // TODO (required): fetch only _missing_ topologies.


one additional optimization... now that we waiting for TCM, TCM might have already informed us... we could check if accord has this epoch and avoid this. Maybe add a TODO for that?

dcapwell · 2025-01-31T18:28:39Z

My only open comment is https://github.com/apache/cassandra/pull/3842/files#r1936388652

If this was an assert i would be +1, but its a if condition that adds a new control flow, which worries me.... once this issue is resolved I am +1 to this patch

aweisberg

It looks like even with this change there are issues with Accord getting stuck adopting new epochs. That is what is causing MigrationFromAccordWriteRaceTest to fail.

My only significant feedback is that the retries changed to infinite here https://github.com/apache/cassandra/pull/3842/files#diff-244880e423be6becbb102197e318ef929678df99e2a4a0a9efc4e056730b0d1fR118 but you said you would be changing it again in a later patch. Not sure whether having it be infininite now is better or worse.

ifesdjeen · 2025-01-31T21:49:34Z

Pushed a new version of the change @dcapwell is referring to, and added a bit more motivation for the change.

ifesdjeen requested a review from aweisberg January 28, 2025 18:11

dcapwell requested changes Jan 29, 2025

View reviewed changes

dcapwell reviewed Jan 29, 2025

View reviewed changes

src/java/org/apache/cassandra/service/accord/FetchTopology.java Outdated Show resolved Hide resolved

dcapwell reviewed Jan 29, 2025

View reviewed changes

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java Outdated Show resolved Hide resolved

dcapwell reviewed Jan 29, 2025

View reviewed changes

src/java/org/apache/cassandra/service/accord/AccordService.java Outdated Show resolved Hide resolved

dcapwell reviewed Jan 29, 2025

View reviewed changes

src/java/org/apache/cassandra/service/accord/AccordService.java Show resolved Hide resolved

dcapwell reviewed Jan 29, 2025

View reviewed changes

src/java/org/apache/cassandra/service/accord/AccordService.java Outdated Show resolved Hide resolved

dcapwell reviewed Jan 29, 2025

View reviewed changes

src/java/org/apache/cassandra/service/accord/AccordService.java Show resolved Hide resolved

aweisberg requested changes Jan 29, 2025

View reviewed changes

ifesdjeen force-pushed the CASSANDRA-20245 branch 5 times, most recently from f216933 to 70d49a7 Compare January 29, 2025 19:06

ifesdjeen added 2 commits January 30, 2025 20:23

Address Ariel's and David's comments

6be92d2

ifesdjeen force-pushed the CASSANDRA-20245 branch from 70d49a7 to 6be92d2 Compare January 30, 2025 19:24

ifesdjeen requested review from aweisberg and dcapwell January 30, 2025 19:27

dcapwell reviewed Jan 30, 2025

View reviewed changes

src/java/org/apache/cassandra/config/AccordSpec.java Outdated Show resolved Hide resolved

dcapwell reviewed Jan 30, 2025

View reviewed changes

ifesdjeen force-pushed the CASSANDRA-20245 branch from 4adde66 to 214a1a7 Compare January 31, 2025 14:17

Address David's comments

5bdd1fb

ifesdjeen force-pushed the CASSANDRA-20245 branch from 214a1a7 to 5bdd1fb Compare January 31, 2025 14:51

dcapwell reviewed Jan 31, 2025

View reviewed changes

aweisberg approved these changes Jan 31, 2025

View reviewed changes

Document a change to configuration service

9543c60

dcapwell approved these changes Jan 31, 2025

View reviewed changes

Minor: fix CASTest

ce7ba8e

belliottsmith force-pushed the cep-15-accord branch 2 times, most recently from fee0e64 to a3a37f3 Compare February 3, 2025 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842

CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842

ifesdjeen commented Jan 28, 2025

dcapwell Jan 29, 2025

aweisberg Jan 29, 2025

dcapwell Jan 29, 2025

ifesdjeen Jan 29, 2025

dcapwell Jan 30, 2025

dcapwell Jan 30, 2025

aweisberg left a comment

aweisberg Jan 28, 2025

ifesdjeen Jan 29, 2025

aweisberg Jan 29, 2025

ifesdjeen Jan 29, 2025 •

edited

Loading

aweisberg Jan 31, 2025

aweisberg Jan 29, 2025

ifesdjeen Jan 30, 2025

aweisberg Jan 31, 2025

dcapwell Jan 30, 2025

dcapwell Jan 30, 2025

dcapwell Jan 30, 2025

dcapwell Jan 30, 2025

dcapwell Jan 30, 2025

ifesdjeen Jan 31, 2025

dcapwell Jan 30, 2025

dcapwell Jan 31, 2025

dcapwell Jan 31, 2025

dcapwell commented Jan 31, 2025

aweisberg left a comment

ifesdjeen commented Jan 31, 2025

	// TODO: should not be public
	//TODO (???): should not be public

	// If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch,
	// If metadata has reported has skipped one or more epochs, and is _ahead_ of the requested epoch,

	epochReady(metadata.epoch).get(5, SECONDS);
	epochReady(metadata.epoch).get(waitSeconds, SECONDS);

CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842

Are you sure you want to change the base?

CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842

Conversation

ifesdjeen commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aweisberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifesdjeen Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcapwell commented Jan 31, 2025

aweisberg left a comment

Choose a reason for hiding this comment

ifesdjeen commented Jan 31, 2025

ifesdjeen Jan 29, 2025 •

edited

Loading