Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alloy-metrics on tamarin / staging is failing to send metrics to mimir #3856

Open
QuantumEnigmaa opened this issue Jan 30, 2025 · 2 comments
Labels

Comments

@QuantumEnigmaa
Copy link

QuantumEnigmaa commented Jan 30, 2025

The pod fails to send samples with the following error message :

server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with a more recent timestamp has already been ingested and this sample is beyond the out-of-order time window of 5m (err-mimir-sample-timestamp-too-old).

One easy workaround would be to increase the out_of_order_time_window field in tamarin's mimir config (currently set to 5m) but the root cause here might be that alloy needs more shard.

We have had multiple such alerts last week: tamarin / production / alloy-metrics - MetricForwardingErrors, on all tamarin clusters (testing, staging, production).

We don't know why these happen, we should investigate and fix.

@QuantumEnigmaa QuantumEnigmaa added the team/atlas Team Atlas label Jan 30, 2025
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Jan 30, 2025
@QuentinBisson
Copy link

I have the same issue also on testing but I would love to know why samples are 20 minutes behind:

ts=2025-01-30T12:46:44.244494161Z caller=grpc_logging.go:76 level=warn method=/cortex.Ingester/Push duration=21.90549ms msg=gRPC err="user=anonymous: the sample has been rejected because another sample with a more recent timestamp has already been ingested and this sample is beyond the out-of-order time window of 5m (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2025-01-30T10:24:46.333Z and is from series statistics_request_count{appid="kube", cluster_id="staging", cluster_type="workload_cluster", container="mtail", customer="panamax", endpoint="mtail", installation="tamarin", instance="10.244.15.145:3903", job="totp-staging/mtail", namespace="totp-staging", organization="panamax", pid="24135", pipeline="stable", pod="tdsrest-6dbd9f77b-44btg", provider="cloud-director", region="onprem", request="TOTPProbe", service_priority="highest"} (sampled 1/10)"

@QuentinBisson
Copy link

Pod restart fixes it for a while but ...

@Rotfuks Rotfuks moved this from Inbox 📥 to Up Next ➡️ in Roadmap Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Up Next ➡️
Development

No branches or pull requests

3 participants