Configure dogpile.cache to deal with memcached pods failures

lmiccini · openshift-merge-bot[bot] · commit e5a56f90aa85 · 2025-01-07T12:23:07.000Z
Whenever one of the mecached pods disappears, because of a rolling restart during a minor update or as result of a failure, APIs can take a long time to detect that the pod went away and keep trying to reconnect. From a quick round of tests we saw downtimes up to ~150s. By enabling the retry_client and limiting the number of retries the behavior seems much more acceptable. Similarly, when TLS is not in use, we may want to set a lower value for memcache_dead_retry so to eventually reconnect to a new pod (having the same dns name but different ip) much faster. Jira: https://issues.redhat.com/browse/OSPRH-11935
diff --git a/templates/nova.conf b/templates/nova.conf
@@ -173,9 +173,13 @@ enabled = True
 {{if .MemcachedTLS}}
 backend = dogpile.cache.pymemcache
 memcache_servers={{ .MemcachedServers }}
+enable_retry_client = true
+retry_attempts = 2
+retry_delay = 0
 {{else}}
 backend = dogpile.cache.memcached
 memcache_servers={{ .MemcachedServersWithInet }}
+memcache_dead_retry = 10
 {{end}}
 tls_enabled={{ .MemcachedTLS }}
 {{else}}