We are running Cortex in a dedicated EKS cluster.
More than 70 other clusters send their metrics to this Cortex instance.
Each cluster’s Grafana is configured to query Cortex for data visualization.
For the past couple of months, we have been observing gaps in Grafana panels for time ranges longer than 6 hours (this only has been observed for our biggest tenant - around 27.7 Mil series).
There are no missing metrics — all data is successfully received by Cortex.
It appears the issue is related to metrics caching.
We’ve noticed that restarting the Memcached frontend resolves the problem temporarily — after the restart, the gaps disappear.
Memcached-fronted config:
query_range:
cache_results: true
results_cache:
cache:
memcached_client:
host: cortex-infra-memcached-frontend.cortex-infra.svc.cluster.local
timeout: 3s
max_idle_conns: 200