Skip to content

Commit b6fcd32

Browse files
committed
Document retries.
1 parent 0281d65 commit b6fcd32

File tree

2 files changed

+199
-3
lines changed

2 files changed

+199
-3
lines changed

manual/retries/README.md

+197-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,201 @@
11
## Retries
22

3-
*Coming soon... In the meantime, see the javadoc for [RetryPolicy].*
3+
When a query fails, it sometimes makes sense to retry it: the error might be temporary, or the query might work on a
4+
different host, or with different options.
45

6+
The driver uses a configurable set of rules to determine when and how to retry.
57

6-
[RetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.html
8+
### Concepts
9+
10+
When the driver executes a statement, it first obtains a [query plan] \(a list of hosts) from the load balancing policy.
11+
Then it picks the first host and sends it the request; this host acts as the **coordinator** for the query, it will
12+
communicate with the rest of the cluster and reply to the client.
13+
14+
If the coordinator can't be reached or replies with an error, there are various things that the driver can do; they are
15+
expressed as [RetryDecision] objects:
16+
17+
* [retry()]: retry the query on the same host. It's possible to retry with a different consistency level than the one
18+
that was originally requested;
19+
* [tryNextHost()]: retry on the next host in the query plan. Again, it can be with a different CL;
20+
* [rethrow()]: rethrow the exception to the user code. This means it will be thrown from the `session.execute` call (or
21+
returned as a failed future if `executeAsync` was used);
22+
* [ignore()]: mark the request as successful, and return an empty result set.
23+
24+
If the driver retries on every host and reaches the end of the query plan, a [NoHostAvailableException] is thrown to the
25+
user code. You can use its [getErrors()] method to find out what went wrong on each host.
26+
27+
28+
### Retry policy
29+
30+
[RetryPolicy] is a pluggable component that determines the retry decisions for various types of errors. It is configured
31+
when initializing the cluster:
32+
33+
```java
34+
Cluster cluster = Cluster.builder()
35+
.addContactPoint("127.0.0.1")
36+
.withRetryPolicy(new MyCustomPolicy())
37+
.build();
38+
```
39+
40+
Once the cluster has been built, you can't change the policy, but you may inspect it at runtime:
41+
42+
```java
43+
RetryPolicy policy = cluster.getConfiguration().getPolicies().getRetryPolicy();
44+
```
45+
46+
If you don't explicitly configure it, you get a [DefaultRetryPolicy].
47+
48+
The policy's methods cover different types of errors:
49+
50+
#### [onUnavailable]
51+
52+
A request reached the coordinator, but there weren't enough live replicas to achieve the requested consistency level.
53+
The coordinator replied with an `UNAVAILABLE` error.
54+
55+
If the policy rethrows the error, the user code will get an [UnavailableException]. You can inspect the exception's
56+
fields to get the amount of replicas that were *known* to be alive when the error was triggered, as well as the amount
57+
of replicas that where *required* by the requested consistency level.
58+
59+
#### [onReadTimeout]
60+
61+
A read request reached the coordinator, which initially believed that there were enough live replicas to process it.
62+
But, for some reason, one or several replicas were too slow to answer within the predefined timeout
63+
(`read_request_timeout_in_ms` in `cassandra.yaml`), and the coordinator replied to the client with a `READ_TIMEOUT`
64+
error.
65+
66+
This could be due to a temporary overloading of these replicas, or even
67+
that they just failed or were turned off. During reads, Cassandra doesn't request data from every replica to minimize
68+
internal network traffic; instead, some replicas are only asked for a checksum of the data. A read timeout may occur
69+
even if enough replicas responded to fulfill the consistency level, but only checksum responses were received (the
70+
method's `dataRetrieved` parameter allow you to check if you're in that situation).
71+
72+
If the policy rethrows the error, the user code will get a [ReadTimeoutException].
73+
74+
Note: do not confuse this error with a [driver read timeout], which happens when the coordinator didn't reply at all to
75+
the client.
76+
77+
#### [onWriteTimeout]
78+
79+
This is similar to `onReadTimeout`, but for write operations. The reason reads and writes are handled separately is
80+
because a read is obviously a non mutating operation, whereas a write is likely to be. If a write times out at the
81+
coordinator level, there is no way to know whether the mutation was applied or not on the non-answering replica.
82+
Policy implementations are usually more conservative in `onWriteTimeout` in case the write was not idempotent.
83+
84+
If the policy rethrows the error, the user code will get a [WriteTimeoutException].
85+
86+
#### [onRequestError]
87+
88+
This gets called for any other error occurring after the request was sent.
89+
90+
Note: this method is specified by [ExtendedRetryPolicy], a child interface of `RetryPolicy`. This is for backward
91+
compatibility: `onRequestError` did not exist in previous 2.1 versions, so it wasn't added directly on `RetryPolicy` to
92+
avoid breaking legacy implementations. All built-in policies implement `ExtendedRetryPolicy`, and we recommend that you
93+
do too. If you pass a policy that only implements `RetryPolicy`, the driver will use a default `onRequestError`
94+
implementation that always retries on the next host with the same consistency level (this is consistent with what
95+
previous 2.1 versions did). In 3.0+, the two interfaces have been merged back together.
96+
97+
The method receives the exception as a parameter, so that implementations can refine their decision based on what
98+
happened. The possible exceptions are:
99+
100+
* [ServerError]: thrown by the coordinator when an unexpected error occurs. This is generally a Cassandra bug;
101+
* [OperationTimedOutException]: thrown by the client when it didn't hear back from the coordinator within the
102+
[driver read timeout];
103+
* [ConnectionException]: thrown by the client for any network issue while or after the request was written;
104+
* [OverloadedException]: thrown by the coordinator when replicas are down and the number of hinted handoffs gets too
105+
high; the coordinator temporarily refuses writes for these replicas (see [hinted handoffs] in the Cassandra
106+
documentation).
107+
108+
In all these cases, there is no absolute certainty that the request was not applied by a host, so implementations must
109+
consider the risk of retrying non-idempotent statements.
110+
111+
### Hard-coded rules
112+
113+
There are a few cases where retrying is always the right thing to do. These are not covered by `RetryPolicy`, but
114+
hard-coded in the driver:
115+
116+
* **any error before a network write was attempted**: to send a query, the driver selects a host, borrows a connection
117+
from the host's [connection pool], and then writes the message to the connection.
118+
Errors can occur before the write was even attempted, for example if the connection pool is saturated, or if the
119+
host went down right after we borrowed. In those cases, it is always safe to retry since the request wasn't sent, so
120+
the driver will transparently move to the next host in the query plan.
121+
122+
* **re-preparing a statement**: when the driver executes a prepared statement, it may find out that the coordinator
123+
doesn't know about it, and need to re-prepare it on the fly (this is described in detail [here][prepared]). The query
124+
is then retried on the same host.
125+
126+
* **trying to communicate with a host that is bootstrapping**: this is a rare edge case, as in practice the driver
127+
should never try to communicate with a bootstrapping host (the only way is if it was specified as a contact point).
128+
Anyway, it is again safe to assume that the query was not executed at all, so the driver moves to the next host.
129+
130+
Similarly, some errors have no chance of being solved by a retry. They will always be rethrown directly to the user.
131+
These include:
132+
133+
* [QueryValidationException] and any of its subclasses ([InvalidQueryException], [InvalidConfigurationInQueryException],
134+
[UnauthorizedException], [SyntaxError], [AlreadyExistsException]);
135+
* [TruncateException].
136+
137+
138+
### Retries and idempotence
139+
140+
Retry policies should avoid retrying [idempotent queries] when it's not clear whether the query was applied or not.
141+
142+
It's always safe to retry in `onReadTimeout`, since by definition we know that the query is a read, which doesn't mutate
143+
any data. Similarly, `onUnavailable` is safe: the coordinator is telling us that it didn't find enough replicas, so we
144+
know that it didn't try to apply the query.
145+
146+
`onWriteTimeout` is not safe. The default retry policy is very conservative (it only retries batch log writes) so it
147+
will never cause any issue. Custom implementations should check the statements' idempotent flag.
148+
149+
`onRequestError` isn't safe either.
150+
151+
**For historical reasons, the built-in retry policy implementations do not check the idempotent flag in
152+
`onRequestError`**. This is based on the fact that:
153+
154+
* previous versions of the driver (which did `onRequestError` internally) didn't check the flag either;
155+
* the majority of queries in an application should be idempotent;
156+
* statements start out as non-idempotent by default.
157+
158+
If `onRequestError` enforced idempotence strictly, this would cause a lot of queries that were legitimately retried with
159+
previous driver versions to not be retried anymore. This would be a big behavioral change, so we decided to prefer
160+
consistency with previous versions.
161+
162+
If you prefer strict handling of the idempotent flag, you can wrap your retry policy into an
163+
[IdempotenceAwareRetryPolicy]. Make sure to position the flag properly on all statements.
164+
165+
166+
[RetryDecision]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html
167+
[retry()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#retry-com.datastax.driver.core.ConsistencyLevel-
168+
[tryNextHost()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#tryNextHost-com.datastax.driver.core.ConsistencyLevel-
169+
[rethrow()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#rethrow--
170+
[ignore()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#ignore--
171+
[NoHostAvailableException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html
172+
[getErrors()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html#getErrors--
173+
[RetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.html
174+
[DefaultRetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html
175+
[onReadTimeout]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html#onReadTimeout-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-int-int-boolean-int-
176+
[onWriteTimeout]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html#onWriteTimeout-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-com.datastax.driver.core.WriteType-int-int-int-
177+
[onUnavailable]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html#onUnavailable-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-int-int-int-
178+
[onRequestError]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/ExtendedRetryPolicy.html#onRequestError-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-java.lang.Exception-int-
179+
[ExtendedRetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/ExtendedRetryPolicy.html
180+
[UnavailableException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/UnavailableException.html
181+
[ReadTimeoutException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/ReadTimeoutException.html
182+
[WriteTimeoutException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/WriteTimeoutException.html
183+
[OverloadedException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/OverloadedException.html
184+
[ServerError]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/ServerError.html
185+
[OperationTimedOutException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/OperationTimedOutException.html
186+
[ConnectionException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/ConnectionException.html
187+
[QueryValidationException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/QueryValidationException.html
188+
[InvalidQueryException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/InvalidQueryException.html
189+
[InvalidConfigurationInQueryException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/InvalidConfigurationInQueryException.html
190+
[UnauthorizedException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/UnauthorizedException.html
191+
[SyntaxError]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/SyntaxError.html
192+
[AlreadyExistsException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/AlreadyExistsException.html
193+
[TruncateException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/TruncateException.html
194+
[IdempotenceAwareRetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/IdempotenceAwareRetryPolicy.html
195+
196+
[query plan]: ../load_balancing/#query-plan
197+
[connection pool]: ../pooling/
198+
[prepared]: ../statements/prepared/#preparing-on-multiple-nodes
199+
[driver read timeout]: ../socket_options/#driver-read-timeout
200+
[hinted handoffs]: https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_hh_c.html?scroll=concept_ds_ifg_jqx_zj__performance
201+
[idempotent queries]: ../idempotence/

manual/socket_options/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,8 @@ multiple nodes in parallel. This way you won't have to wait for the full timeout
9393
#### Driver read timeout vs. server read timeout
9494

9595
Unfortunately, the term "read timeout" clashes with another concept that is not directly related: a Cassandra node may
96-
reply with a `Read_timeout` error when it didn't hear back from enough replicas during a read query.
96+
reply with a [READ_TIMEOUT](../retries/#on-read-timeout) error when it didn't hear back from enough replicas during a
97+
read query.
9798

9899
To clarify:
99100

0 commit comments

Comments
 (0)