|
1 | 1 | ## Retries
|
2 | 2 |
|
3 |
| -*Coming soon... In the meantime, see the javadoc for [RetryPolicy].* |
| 3 | +When a query fails, it sometimes makes sense to retry it: the error might be temporary, or the query might work on a |
| 4 | +different host, or with different options. |
4 | 5 |
|
| 6 | +The driver uses a configurable set of rules to determine when and how to retry. |
5 | 7 |
|
6 |
| -[RetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.html |
| 8 | +### Concepts |
| 9 | + |
| 10 | +When the driver executes a statement, it first obtains a [query plan] \(a list of hosts) from the load balancing policy. |
| 11 | +Then it picks the first host and sends it the request; this host acts as the **coordinator** for the query, it will |
| 12 | +communicate with the rest of the cluster and reply to the client. |
| 13 | + |
| 14 | +If the coordinator can't be reached or replies with an error, there are various things that the driver can do; they are |
| 15 | +expressed as [RetryDecision] objects: |
| 16 | + |
| 17 | +* [retry()]: retry the query on the same host. It's possible to retry with a different consistency level than the one |
| 18 | + that was originally requested; |
| 19 | +* [tryNextHost()]: retry on the next host in the query plan. Again, it can be with a different CL; |
| 20 | +* [rethrow()]: rethrow the exception to the user code. This means it will be thrown from the `session.execute` call (or |
| 21 | + returned as a failed future if `executeAsync` was used); |
| 22 | +* [ignore()]: mark the request as successful, and return an empty result set. |
| 23 | + |
| 24 | +If the driver retries on every host and reaches the end of the query plan, a [NoHostAvailableException] is thrown to the |
| 25 | +user code. You can use its [getErrors()] method to find out what went wrong on each host. |
| 26 | + |
| 27 | + |
| 28 | +### Retry policy |
| 29 | + |
| 30 | +[RetryPolicy] is a pluggable component that determines the retry decisions for various types of errors. It is configured |
| 31 | +when initializing the cluster: |
| 32 | + |
| 33 | +```java |
| 34 | +Cluster cluster = Cluster.builder() |
| 35 | + .addContactPoint("127.0.0.1") |
| 36 | + .withRetryPolicy(new MyCustomPolicy()) |
| 37 | + .build(); |
| 38 | +``` |
| 39 | + |
| 40 | +Once the cluster has been built, you can't change the policy, but you may inspect it at runtime: |
| 41 | + |
| 42 | +```java |
| 43 | +RetryPolicy policy = cluster.getConfiguration().getPolicies().getRetryPolicy(); |
| 44 | +``` |
| 45 | + |
| 46 | +If you don't explicitly configure it, you get a [DefaultRetryPolicy]. |
| 47 | + |
| 48 | +The policy's methods cover different types of errors: |
| 49 | + |
| 50 | +#### [onUnavailable] |
| 51 | + |
| 52 | +A request reached the coordinator, but there weren't enough live replicas to achieve the requested consistency level. |
| 53 | +The coordinator replied with an `UNAVAILABLE` error. |
| 54 | + |
| 55 | +If the policy rethrows the error, the user code will get an [UnavailableException]. You can inspect the exception's |
| 56 | +fields to get the amount of replicas that were *known* to be alive when the error was triggered, as well as the amount |
| 57 | +of replicas that where *required* by the requested consistency level. |
| 58 | + |
| 59 | +#### [onReadTimeout] |
| 60 | + |
| 61 | +A read request reached the coordinator, which initially believed that there were enough live replicas to process it. |
| 62 | +But, for some reason, one or several replicas were too slow to answer within the predefined timeout |
| 63 | +(`read_request_timeout_in_ms` in `cassandra.yaml`), and the coordinator replied to the client with a `READ_TIMEOUT` |
| 64 | +error. |
| 65 | + |
| 66 | +This could be due to a temporary overloading of these replicas, or even |
| 67 | +that they just failed or were turned off. During reads, Cassandra doesn't request data from every replica to minimize |
| 68 | +internal network traffic; instead, some replicas are only asked for a checksum of the data. A read timeout may occur |
| 69 | +even if enough replicas responded to fulfill the consistency level, but only checksum responses were received (the |
| 70 | +method's `dataRetrieved` parameter allow you to check if you're in that situation). |
| 71 | + |
| 72 | +If the policy rethrows the error, the user code will get a [ReadTimeoutException]. |
| 73 | + |
| 74 | +Note: do not confuse this error with a [driver read timeout], which happens when the coordinator didn't reply at all to |
| 75 | +the client. |
| 76 | + |
| 77 | +#### [onWriteTimeout] |
| 78 | + |
| 79 | +This is similar to `onReadTimeout`, but for write operations. The reason reads and writes are handled separately is |
| 80 | +because a read is obviously a non mutating operation, whereas a write is likely to be. If a write times out at the |
| 81 | +coordinator level, there is no way to know whether the mutation was applied or not on the non-answering replica. |
| 82 | +Policy implementations are usually more conservative in `onWriteTimeout` in case the write was not idempotent. |
| 83 | + |
| 84 | +If the policy rethrows the error, the user code will get a [WriteTimeoutException]. |
| 85 | + |
| 86 | +#### [onRequestError] |
| 87 | + |
| 88 | +This gets called for any other error occurring after the request was sent. |
| 89 | + |
| 90 | +Note: this method is specified by [ExtendedRetryPolicy], a child interface of `RetryPolicy`. This is for backward |
| 91 | +compatibility: `onRequestError` did not exist in previous 2.1 versions, so it wasn't added directly on `RetryPolicy` to |
| 92 | +avoid breaking legacy implementations. All built-in policies implement `ExtendedRetryPolicy`, and we recommend that you |
| 93 | +do too. If you pass a policy that only implements `RetryPolicy`, the driver will use a default `onRequestError` |
| 94 | +implementation that always retries on the next host with the same consistency level (this is consistent with what |
| 95 | +previous 2.1 versions did). In 3.0+, the two interfaces have been merged back together. |
| 96 | + |
| 97 | +The method receives the exception as a parameter, so that implementations can refine their decision based on what |
| 98 | +happened. The possible exceptions are: |
| 99 | + |
| 100 | +* [ServerError]: thrown by the coordinator when an unexpected error occurs. This is generally a Cassandra bug; |
| 101 | +* [OperationTimedOutException]: thrown by the client when it didn't hear back from the coordinator within the |
| 102 | + [driver read timeout]; |
| 103 | +* [ConnectionException]: thrown by the client for any network issue while or after the request was written; |
| 104 | +* [OverloadedException]: thrown by the coordinator when replicas are down and the number of hinted handoffs gets too |
| 105 | + high; the coordinator temporarily refuses writes for these replicas (see [hinted handoffs] in the Cassandra |
| 106 | + documentation). |
| 107 | + |
| 108 | +In all these cases, there is no absolute certainty that the request was not applied by a host, so implementations must |
| 109 | +consider the risk of retrying non-idempotent statements. |
| 110 | + |
| 111 | +### Hard-coded rules |
| 112 | + |
| 113 | +There are a few cases where retrying is always the right thing to do. These are not covered by `RetryPolicy`, but |
| 114 | +hard-coded in the driver: |
| 115 | + |
| 116 | +* **any error before a network write was attempted**: to send a query, the driver selects a host, borrows a connection |
| 117 | + from the host's [connection pool], and then writes the message to the connection. |
| 118 | + Errors can occur before the write was even attempted, for example if the connection pool is saturated, or if the |
| 119 | + host went down right after we borrowed. In those cases, it is always safe to retry since the request wasn't sent, so |
| 120 | + the driver will transparently move to the next host in the query plan. |
| 121 | + |
| 122 | +* **re-preparing a statement**: when the driver executes a prepared statement, it may find out that the coordinator |
| 123 | + doesn't know about it, and need to re-prepare it on the fly (this is described in detail [here][prepared]). The query |
| 124 | + is then retried on the same host. |
| 125 | + |
| 126 | +* **trying to communicate with a host that is bootstrapping**: this is a rare edge case, as in practice the driver |
| 127 | + should never try to communicate with a bootstrapping host (the only way is if it was specified as a contact point). |
| 128 | + Anyway, it is again safe to assume that the query was not executed at all, so the driver moves to the next host. |
| 129 | + |
| 130 | +Similarly, some errors have no chance of being solved by a retry. They will always be rethrown directly to the user. |
| 131 | +These include: |
| 132 | + |
| 133 | +* [QueryValidationException] and any of its subclasses ([InvalidQueryException], [InvalidConfigurationInQueryException], |
| 134 | + [UnauthorizedException], [SyntaxError], [AlreadyExistsException]); |
| 135 | +* [TruncateException]. |
| 136 | + |
| 137 | + |
| 138 | +### Retries and idempotence |
| 139 | + |
| 140 | +Retry policies should avoid retrying [idempotent queries] when it's not clear whether the query was applied or not. |
| 141 | + |
| 142 | +It's always safe to retry in `onReadTimeout`, since by definition we know that the query is a read, which doesn't mutate |
| 143 | +any data. Similarly, `onUnavailable` is safe: the coordinator is telling us that it didn't find enough replicas, so we |
| 144 | +know that it didn't try to apply the query. |
| 145 | + |
| 146 | +`onWriteTimeout` is not safe. The default retry policy is very conservative (it only retries batch log writes) so it |
| 147 | +will never cause any issue. Custom implementations should check the statements' idempotent flag. |
| 148 | + |
| 149 | +`onRequestError` isn't safe either. |
| 150 | + |
| 151 | +**For historical reasons, the built-in retry policy implementations do not check the idempotent flag in |
| 152 | +`onRequestError`**. This is based on the fact that: |
| 153 | + |
| 154 | +* previous versions of the driver (which did `onRequestError` internally) didn't check the flag either; |
| 155 | +* the majority of queries in an application should be idempotent; |
| 156 | +* statements start out as non-idempotent by default. |
| 157 | + |
| 158 | +If `onRequestError` enforced idempotence strictly, this would cause a lot of queries that were legitimately retried with |
| 159 | +previous driver versions to not be retried anymore. This would be a big behavioral change, so we decided to prefer |
| 160 | +consistency with previous versions. |
| 161 | + |
| 162 | +If you prefer strict handling of the idempotent flag, you can wrap your retry policy into an |
| 163 | +[IdempotenceAwareRetryPolicy]. Make sure to position the flag properly on all statements. |
| 164 | + |
| 165 | + |
| 166 | +[RetryDecision]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html |
| 167 | +[retry()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#retry-com.datastax.driver.core.ConsistencyLevel- |
| 168 | +[tryNextHost()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#tryNextHost-com.datastax.driver.core.ConsistencyLevel- |
| 169 | +[rethrow()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#rethrow-- |
| 170 | +[ignore()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.RetryDecision.html#ignore-- |
| 171 | +[NoHostAvailableException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html |
| 172 | +[getErrors()]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/NoHostAvailableException.html#getErrors-- |
| 173 | +[RetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/RetryPolicy.html |
| 174 | +[DefaultRetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html |
| 175 | +[onReadTimeout]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html#onReadTimeout-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-int-int-boolean-int- |
| 176 | +[onWriteTimeout]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html#onWriteTimeout-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-com.datastax.driver.core.WriteType-int-int-int- |
| 177 | +[onUnavailable]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DefaultRetryPolicy.html#onUnavailable-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-int-int-int- |
| 178 | +[onRequestError]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/ExtendedRetryPolicy.html#onRequestError-com.datastax.driver.core.Statement-com.datastax.driver.core.ConsistencyLevel-java.lang.Exception-int- |
| 179 | +[ExtendedRetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/ExtendedRetryPolicy.html |
| 180 | +[UnavailableException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/UnavailableException.html |
| 181 | +[ReadTimeoutException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/ReadTimeoutException.html |
| 182 | +[WriteTimeoutException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/WriteTimeoutException.html |
| 183 | +[OverloadedException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/OverloadedException.html |
| 184 | +[ServerError]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/ServerError.html |
| 185 | +[OperationTimedOutException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/OperationTimedOutException.html |
| 186 | +[ConnectionException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/ConnectionException.html |
| 187 | +[QueryValidationException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/QueryValidationException.html |
| 188 | +[InvalidQueryException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/InvalidQueryException.html |
| 189 | +[InvalidConfigurationInQueryException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/InvalidConfigurationInQueryException.html |
| 190 | +[UnauthorizedException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/UnauthorizedException.html |
| 191 | +[SyntaxError]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/SyntaxError.html |
| 192 | +[AlreadyExistsException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/AlreadyExistsException.html |
| 193 | +[TruncateException]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/exceptions/TruncateException.html |
| 194 | +[IdempotenceAwareRetryPolicy]: http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/IdempotenceAwareRetryPolicy.html |
| 195 | + |
| 196 | +[query plan]: ../load_balancing/#query-plan |
| 197 | +[connection pool]: ../pooling/ |
| 198 | +[prepared]: ../statements/prepared/#preparing-on-multiple-nodes |
| 199 | +[driver read timeout]: ../socket_options/#driver-read-timeout |
| 200 | +[hinted handoffs]: https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_hh_c.html?scroll=concept_ds_ifg_jqx_zj__performance |
| 201 | +[idempotent queries]: ../idempotence/ |
0 commit comments