-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CASSANDRA-20245: Fix problems and race conditions with topology fetching #3842
base: cep-15-accord
Are you sure you want to change the base?
Conversation
test/distributed/org/apache/cassandra/distributed/test/accord/AccordDropTableTest.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/accord/AccordDropKeyspaceTest.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/accord/FetchTopology.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/accord/FetchTopology.java
Outdated
Show resolved
Hide resolved
@@ -123,18 +116,20 @@ public Response(long epoch, Topology topology) | |||
long epoch = message.payload.epoch; | |||
Topology topology = AccordService.instance().topology().maybeGlobalForEpoch(epoch); | |||
if (topology == null) | |||
MessagingService.instance().respond(Response.UNKNOWN, message); | |||
MessagingService.instance().respond(Response.unkonwn(epoch), message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im confused by this line
private static Response unkonwn(long epoch)
{
throw new IllegalStateException("Unknown topology: " + epoch);
}
Response.unkonwn
only throws, so this response
won't do anything as that method won't be called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for test code that can catch the actual exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i couldn't find any tests for this logic. Since this is a verb handle we should get back an UNKNOWN
exception; that could be a NPE or an unknown epoch... its truly unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to an explicit throw so that we would return UNKNOWN via messaging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel this is a good idea.... UNKNOWN
can also mean NPE... its truly unknown whats going on and isn't a good answer to "do you know about this epoch". For example, if the epoch isn't known there isn't a reason to retry on that node... but with UNKNOWN
error we then will retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created #3842 (comment) so this doesn't get lost...
src/java/org/apache/cassandra/service/accord/FetchTopology.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/accord/AccordService.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/accord/AccordService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed a few things that might matter.
src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java
Outdated
Show resolved
Hide resolved
@@ -407,7 +407,11 @@ void maybeReportMetadata(ClusterMetadata metadata) | |||
long epoch = metadata.epoch.getEpoch(); | |||
synchronized (epochs) | |||
{ | |||
if (epochs.maxEpoch() == 0) | |||
long maxEpoch = epochs.maxEpoch(); | |||
if (maxEpoch >= epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Accord knowing about the epoch guarantee that TCM has already loaded it? We don't want to skip the TCM loading step by not indirectly calling fetchTopologyInternal
or something to ensure TCM loaded it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I think this was also an edge-case. If I remember it correctly: if you're a second node to be brought up, querying other nodes for min epochs will yield epoch 4, which you will instantiate with a fetched topology.
But then if you race with a TCM callback, you will create epoch - 1 epoch, which no-one has ever heard about.
I should have fixed it differently, maybe this way:
// Create a -1 epoch iif we know this epoch may actually exist
if (metadata.epoch.getEpoch() > minEpoch())
getOrCreateEpochState(epoch - 1).acknowledged().addCallback(() -> reportMetadata(metadata));
src/java/org/apache/cassandra/service/accord/AccordService.java
Outdated
Show resolved
Hide resolved
for (long epoch = minEpoch; epoch <= metadata.epoch.getEpoch(); epoch++) | ||
node.configService().fetchTopologyForEpoch(epoch); | ||
// Fetch topologies up to current | ||
List<Topology> topologies = fetchTopologies(0, metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make 0 a constant indicating that it is actually supposed to find the minEpoch
to fetch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to capital L Long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A named constant that is null
is still clearer just because you can describe the semantic of it.
test/distributed/org/apache/cassandra/distributed/test/accord/AccordDropKeyspaceTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/service/accord/FetchMinEpochTest.java
Outdated
Show resolved
Hide resolved
Iterators.cycle(to), | ||
RetryPredicate.times(DatabaseDescriptor.getAccord().minEpochSyncRetry.maxAttempts.value), | ||
RetryPredicate.ALWAYS_RETRY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that fetching the min epoch requires all nodes to be up to complete? Just looking at how this is accumulated by the caller of fetch
which combines all the futures and can't complete until every future completes which means any down node would stop this from working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc this is faithful to the original implementation, which I believe might have been not entirely correct. We're collecting responses from all nodes, but we consider only successes here, which might mean we will not discover an early enough epoch.
We planned to address this when implementing epoch GC, where we'll indicate which epoch is retired, and a better success criteria for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's not faithful to what was here, because it had a maximum number of attempts before so eventually it would complete? Now it retries forever so it will never stop?
src/java/org/apache/cassandra/service/accord/AccordService.java
Outdated
Show resolved
Hide resolved
f216933
to
70d49a7
Compare
- preclude TCM of being behind Accord if newer epoch is reported via withEpoch/fetchTopologyInternal - improve topology discovery during first boot and replay - fix races between config service TCM listener reporting topologies, and fetched topologies during Patch by Alex Petrov, reviewed by Ariel Weisberg for CASSANDRA-20245
70d49a7
to
6be92d2
Compare
@@ -213,6 +213,7 @@ public EpochDiskState truncateTopologyUntil(long epoch, EpochDiskState diskState | |||
} | |||
} | |||
|
|||
// TODO: should not be public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: should not be public | |
//TODO (???): should not be public |
2 things
//TODO
vs // TODO
, not sure why intellij cares...
(...)
should have a scope?
getOrCreateEpochState(epoch - 1).acknowledged().addCallback(() -> reportMetadata(metadata)); | ||
|
||
// Create a -1 epoch iif we know this epoch may actually exist | ||
if (metadata.epoch.getEpoch() > minEpoch()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change? this worries me. Making sure this protocol was safe took a lot of effort, so adding this extra state into it makes me worry that we will regress...
If TCM is reporting, we won't hit this... so only non-TCM reporting should hit this... we should guard there IMO
|
||
// In most cases, after fetching log from CMS, we will be caught up to the required epoch. | ||
// This TCM will also notify Accord via reportMetadata, so we do not need to fetch topologies. | ||
// If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If metadata has reported has skipped one or more eopchs, and is _ahead_ of the requested epoch, | |
// If metadata has reported has skipped one or more epochs, and is _ahead_ of the requested epoch, |
throw new RuntimeException(e); | ||
try | ||
{ | ||
epochReady(metadata.epoch).get(5, SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epochReady(metadata.epoch).get(5, SECONDS); | |
epochReady(metadata.epoch).get(waitSeconds, SECONDS); |
MessagingService.instance().respond(new Response(epoch, topology), message); | ||
else | ||
throw new IllegalStateException("Unknown topology: " + epoch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
said this in another thread, but given the refactor its not showing up in the file, so reposing.
I do not think this is a good idea. UNKNOWN
failure is truly unknown... it could be a NPE, it could be we ran before we were ready... it could be anything... its unknown... I do not think its a good idea to conflate that with "I don't know this epoch", because the handling is different...
if UNKNOWN
exception is returned, retrying on the same node makes sense, it could be an ephemeral issue.
If the node doesn't know about the epoch, calling them again isn't the best idea... we should try another node, and if no node knows the epoch we are in a bad state...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to specialized UNKNOWN_TOPLOGY failure response.
request, | ||
MessagingUtils.tryAliveFirst(SharedContext.Global.instance, peers, Verb.ACCORD_FETCH_TOPOLOGY_REQ.name()), | ||
(attempt, from, failure) -> { | ||
System.out.println("Got " + failure + " from " + from + " while fetching " + request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use a logger
4adde66
to
214a1a7
Compare
214a1a7
to
5bdd1fb
Compare
@@ -204,6 +205,14 @@ public MinEpochRetrySpec() | |||
} | |||
} | |||
|
|||
public static class FetchRetrySpec extends RetrySpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RetrySpec
has the following constructor
public RetrySpec(MaxAttempt maxAttempts, LongMillisecondsBound baseSleepTime, LongMillisecondsBound maxSleepTime)
Nothing stops us from adding
public RetrySpec(MaxAttempt maxAttempts)
or a create
method... I think i did things this way mostly because of repair but nothing is wrong with RetrySpec fetchRetry = RetrySpec.create(100);
IMO
I am 100% fine with this class (assuming the ref test is fixed), I am also fine with creating new methods in RetrySpec
to get the same behavior
{ | ||
} | ||
|
||
// TODO (required): fetch only _missing_ topologies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one additional optimization... now that we waiting for TCM, TCM might have already informed us... we could check if accord has this epoch and avoid this. Maybe add a TODO for that?
My only open comment is https://github.com/apache/cassandra/pull/3842/files#r1936388652 If this was an assert i would be +1, but its a if condition that adds a new control flow, which worries me.... once this issue is resolved I am +1 to this patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like even with this change there are issues with Accord getting stuck adopting new epochs. That is what is causing MigrationFromAccordWriteRaceTest
to fail.
My only significant feedback is that the retries changed to infinite here https://github.com/apache/cassandra/pull/3842/files#diff-244880e423be6becbb102197e318ef929678df99e2a4a0a9efc4e056730b0d1fR118 but you said you would be changing it again in a later patch. Not sure whether having it be infininite now is better or worse.
Pushed a new version of the change @dcapwell is referring to, and added a bit more motivation for the change. |
fee0e64
to
a3a37f3
Compare
Patch by Alex Petrov, reviewed by Ariel Weisberg for CASSANDRA-20245