Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842

Open
wants to merge 5 commits into
base: cep-15-accord
Choose a base branch
from

Conversation

ifesdjeen
Copy link
Contributor

  • preclude TCM of being behind Accord if newer epoch is reported via withEpoch/fetchTopologyInternal
  • improve topology discovery during first boot and replay
  • fix races between config service TCM listener reporting topologies, and fetched topologies during

Patch by Alex Petrov, reviewed by Ariel Weisberg for CASSANDRA-20245

@ifesdjeen ifesdjeen requested a review from aweisberg January 28, 2025 18:11
@@ -123,18 +116,20 @@ public Response(long epoch, Topology topology)
long epoch = message.payload.epoch;
Topology topology = AccordService.instance().topology().maybeGlobalForEpoch(epoch);
if (topology == null)
MessagingService.instance().respond(Response.UNKNOWN, message);
MessagingService.instance().respond(Response.unkonwn(epoch), message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im confused by this line

        private static Response unkonwn(long epoch)
        {
            throw new IllegalStateException("Unknown topology: " + epoch);
        }

Response.unkonwn only throws, so this response won't do anything as that method won't be called

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for test code that can catch the actual exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i couldn't find any tests for this logic. Since this is a verb handle we should get back an UNKNOWN exception; that could be a NPE or an unknown epoch... its truly unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to an explicit throw so that we would return UNKNOWN via messaging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel this is a good idea.... UNKNOWN can also mean NPE... its truly unknown whats going on and isn't a good answer to "do you know about this epoch". For example, if the epoch isn't known there isn't a reason to retry on that node... but with UNKNOWN error we then will retry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #3842 (comment) so this doesn't get lost...

Copy link
Contributor

@aweisberg aweisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed a few things that might matter.

@@ -407,7 +407,11 @@ void maybeReportMetadata(ClusterMetadata metadata)
long epoch = metadata.epoch.getEpoch();
synchronized (epochs)
{
if (epochs.maxEpoch() == 0)
long maxEpoch = epochs.maxEpoch();
if (maxEpoch >= epoch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Accord knowing about the epoch guarantee that TCM has already loaded it? We don't want to skip the TCM loading step by not indirectly calling fetchTopologyInternal or something to ensure TCM loaded it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I think this was also an edge-case. If I remember it correctly: if you're a second node to be brought up, querying other nodes for min epochs will yield epoch 4, which you will instantiate with a fetched topology.

But then if you race with a TCM callback, you will create epoch - 1 epoch, which no-one has ever heard about.
I should have fixed it differently, maybe this way:

        // Create a -1 epoch iif we know this epoch may actually exist
        if (metadata.epoch.getEpoch() > minEpoch())
            getOrCreateEpochState(epoch - 1).acknowledged().addCallback(() -> reportMetadata(metadata));

for (long epoch = minEpoch; epoch <= metadata.epoch.getEpoch(); epoch++)
node.configService().fetchTopologyForEpoch(epoch);
// Fetch topologies up to current
List<Topology> topologies = fetchTopologies(0, metadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make 0 a constant indicating that it is actually supposed to find the minEpoch to fetch?

Copy link
Contributor Author

@ifesdjeen ifesdjeen Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to capital L Long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A named constant that is null is still clearer just because you can describe the semantic of it.

Iterators.cycle(to),
RetryPredicate.times(DatabaseDescriptor.getAccord().minEpochSyncRetry.maxAttempts.value),
RetryPredicate.ALWAYS_RETRY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that fetching the min epoch requires all nodes to be up to complete? Just looking at how this is accumulated by the caller of fetch which combines all the futures and can't complete until every future completes which means any down node would stop this from working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc this is faithful to the original implementation, which I believe might have been not entirely correct. We're collecting responses from all nodes, but we consider only successes here, which might mean we will not discover an early enough epoch.

We planned to address this when implementing epoch GC, where we'll indicate which epoch is retired, and a better success criteria for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's not faithful to what was here, because it had a maximum number of attempts before so eventually it would complete? Now it retries forever so it will never stop?

@ifesdjeen ifesdjeen force-pushed the CASSANDRA-20245 branch 5 times, most recently from f216933 to 70d49a7 Compare January 29, 2025 19:06
  - preclude TCM of being behind Accord if newer epoch is reported via withEpoch/fetchTopologyInternal
  - improve topology discovery during first boot and replay
  - fix races between config service TCM listener reporting topologies, and fetched topologies during

Patch by Alex Petrov, reviewed by Ariel Weisberg for CASSANDRA-20245
@@ -213,6 +213,7 @@ public EpochDiskState truncateTopologyUntil(long epoch, EpochDiskState diskState
}
}

// TODO: should not be public
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: should not be public
//TODO (???): should not be public

2 things

//TODO vs // TODO, not sure why intellij cares...

(...) should have a scope?

getOrCreateEpochState(epoch - 1).acknowledged().addCallback(() -> reportMetadata(metadata));

// Create a -1 epoch iif we know this epoch may actually exist
if (metadata.epoch.getEpoch() > minEpoch())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? this worries me. Making sure this protocol was safe took a lot of effort, so adding this extra state into it makes me worry that we will regress...

If TCM is reporting, we won't hit this... so only non-TCM reporting should hit this... we should guard there IMO


// In most cases, after fetching log from CMS, we will be caught up to the required epoch.
// This TCM will also notify Accord via reportMetadata, so we do not need to fetch topologies.
// If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch,
// If metadata has reported has skipped one or more epochs, and is _ahead_ of the requested epoch,

throw new RuntimeException(e);
try
{
epochReady(metadata.epoch).get(5, SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epochReady(metadata.epoch).get(5, SECONDS);
epochReady(metadata.epoch).get(waitSeconds, SECONDS);

MessagingService.instance().respond(new Response(epoch, topology), message);
else
throw new IllegalStateException("Unknown topology: " + epoch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

said this in another thread, but given the refactor its not showing up in the file, so reposing.

I do not think this is a good idea. UNKNOWN failure is truly unknown... it could be a NPE, it could be we ran before we were ready... it could be anything... its unknown... I do not think its a good idea to conflate that with "I don't know this epoch", because the handling is different...

if UNKNOWN exception is returned, retrying on the same node makes sense, it could be an ephemeral issue.
If the node doesn't know about the epoch, calling them again isn't the best idea... we should try another node, and if no node knows the epoch we are in a bad state...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to specialized UNKNOWN_TOPLOGY failure response.

request,
MessagingUtils.tryAliveFirst(SharedContext.Global.instance, peers, Verb.ACCORD_FETCH_TOPOLOGY_REQ.name()),
(attempt, from, failure) -> {
System.out.println("Got " + failure + " from " + from + " while fetching " + request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a logger

@@ -204,6 +205,14 @@ public MinEpochRetrySpec()
}
}

public static class FetchRetrySpec extends RetrySpec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetrySpec has the following constructor

public RetrySpec(MaxAttempt maxAttempts, LongMillisecondsBound baseSleepTime, LongMillisecondsBound maxSleepTime)

Nothing stops us from adding

public RetrySpec(MaxAttempt maxAttempts)

or a create method... I think i did things this way mostly because of repair but nothing is wrong with RetrySpec fetchRetry = RetrySpec.create(100); IMO

I am 100% fine with this class (assuming the ref test is fixed), I am also fine with creating new methods in RetrySpec to get the same behavior

{
}

// TODO (required): fetch only _missing_ topologies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one additional optimization... now that we waiting for TCM, TCM might have already informed us... we could check if accord has this epoch and avoid this. Maybe add a TODO for that?

@dcapwell
Copy link
Contributor

My only open comment is https://github.com/apache/cassandra/pull/3842/files#r1936388652

If this was an assert i would be +1, but its a if condition that adds a new control flow, which worries me.... once this issue is resolved I am +1 to this patch

Copy link
Contributor

@aweisberg aweisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like even with this change there are issues with Accord getting stuck adopting new epochs. That is what is causing MigrationFromAccordWriteRaceTest to fail.

My only significant feedback is that the retries changed to infinite here https://github.com/apache/cassandra/pull/3842/files#diff-244880e423be6becbb102197e318ef929678df99e2a4a0a9efc4e056730b0d1fR118 but you said you would be changing it again in a later patch. Not sure whether having it be infininite now is better or worse.

@ifesdjeen
Copy link
Contributor Author

Pushed a new version of the change @dcapwell is referring to, and added a bit more motivation for the change.

@belliottsmith belliottsmith force-pushed the cep-15-accord branch 2 times, most recently from fee0e64 to a3a37f3 Compare February 3, 2025 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants