Skip to content

Standardize Karmada metrics label for member clusters to avoid conflicts with Prometheus external_labels #6781

@jabellard

Description

@jabellard

What would you like to be added:

Standardize the label used to denote the Karmada member cluster across all exported metrics, and avoid using the generic label key cluster. I propose introducing a single, consistent label key — member_cluster — and migrating existing metrics to use it (deprecating current usages like cluster and cluster_name).

Concretely:

  • For resource-sync counters:
    • create_resource_to_cluster{..., cluster="<member>", ...}create_resource_to_cluster{..., member_cluster="<member>", ...}
    • update_resource_to_cluster{..., cluster="<member>", ...}...{member_cluster="<member>", ...}
    • delete_resource_from_cluster{..., cluster="<member>", ...}...{member_cluster="<member>", ...}
  • For cluster status/summary metrics:
    • *_cluster_*{cluster_name="<member>"}...{member_cluster="<member>"}

Suggested rollout:

  1. Add member_cluster in the next release while continuing to populate cluster/cluster_name.
  2. Document member_cluster as the canonical label; mark cluster/cluster_name as deprecated.
  3. After at least one stable release, drop the deprecated keys.

Include a short “migration” section in docs (example PromQL and relabel rules) and note the change in the release notes.

Why is this needed:

  1. Avoid collisions with Prometheus external_labels
    external_labels are commonly used to tag the Prometheus server (e.g., cluster, region, replica) for HA, federation, and remote storage systems like Thanos/Cortex. Using cluster inside Karmada metrics to mean “member cluster” collides with the equally common cluster external label that means “Prometheus/monitoring cluster,” making series ambiguous and forcing brittle relabeling.

  2. Improved portability with long-term storage and HA stacks
    Projects and vendors routinely rely on external labels for sharding/dedup (e.g., HA pairs use cluster/replica-style labels). When Karmada metrics also use cluster, remote stores and query layers have to disambiguate two different “cluster” concepts. Standardizing on member_cluster eliminates that ambiguity and reduces operator error during ingestion and querying.

  3. Alignment with Prometheus naming guidance
    Prometheus best practices encourage clear, descriptive, and consistent label naming. A domain-specific label (member_cluster) communicates intent better than a generic cluster, and it matches the semantic role Karmada plays (a control plane over member clusters). Consistency also helps users build reusable dashboards and recording rules.

  4. Fewer surprises in Alertmanager and federation
    Alert labels inherit external labels; mismatched or colliding label keys break grouping/dedup unless users add custom alert relabeling. Using an unambiguous member_cluster key removes this footgun and keeps alert routing predictable across environments.

  5. Karmada’s metrics are already inconsistent
    Within Karmada’s own code, some metrics use cluster (e.g., create_resource_to_cluster, update_resource_to_cluster, delete_resource_from_cluster), while others use cluster_name (e.g., cluster_ready_state, cluster_node_number, cluster_sync_status_duration_seconds). Unifying on member_cluster across both groups improves discoverability and reduces cognitive load for users writing PromQL/alerts.


🛠 Proposed Migration Guide

To transition from the current inconsistent label usage (cluster, cluster_name) to the standardized member_cluster, I recommend the following phased approach:

1. Normalize at Ingestion with Relabeling

Operators should use metric_relabel_configs in their Prometheus scrape jobs to immediately normalize label keys, so all metrics are stored under the new canonical label:

metric_relabel_configs:
  - action: labelmap
    regex: ^cluster_name$
    replacement: member_cluster
  - action: labelmap
    regex: ^cluster$
    replacement: member_cluster

🔎 What this does:

  • If a metric has cluster_name="foo", it will be stored as member_cluster="foo".
  • If a metric has cluster="bar", it will also be stored as member_cluster="bar".
  • After this transformation, only member_cluster will exist in Prometheus’s TSDB.
  • This avoids having duplicate series that differ only by label key.

⚠️ Important caveat:
Existing queries that use cluster or cluster_name will not work once relabeling is enabled, because those keys no longer exist in stored metrics. Queries must be migrated to use member_cluster.


2. Update Queries and Dashboards

All PromQL queries, recording rules, and dashboards should be updated to reference member_cluster instead of cluster or cluster_name.

Example:

# Before
sum by (cluster) (rate(create_resource_to_cluster[5m]))

# After
sum by (member_cluster) (rate(create_resource_to_cluster[5m]))

3. Staged Rollout

  • Start by enabling relabeling on a test Prometheus instance to confirm that metrics look as expected.
  • Update your dashboards/alerts in development environments first.
  • Roll out to production once queries are validated.

This staged process avoids surprises in live environments.


4. Project-Side Deprecation Plan

On Karmada’s side, I suggest:

  1. Add member_cluster labels to all metrics immediately.
  2. Keep cluster/cluster_name for one full release cycle, marked as deprecated.
  3. Announce in release notes and documentation that operators should switch queries.
  4. After one stable release, drop the old labels entirely.

This ensures alignment between project maintainers and operators, with plenty of notice.


5. Documentation and Communication

  • The official Karmada metrics documentation should clearly distinguish between:
    • member_cluster = the Karmada member cluster (what we’re standardizing).
    • Prometheus external_labels.cluster = the Prometheus server or monitoring cluster.
  • Provide worked examples for Thanos, Cortex, or other federated setups to show why separating these concepts avoids collisions and alerting confusion.

In summary:

  • Relabeling enforces consistency right away.
  • Queries must be updated to member_cluster.
  • Project maintainers should support a deprecation window.
  • Documentation and communication are key to avoiding confusion.

This small, backward-compatible naming change will make Karmada’s metrics easier to run in real-world Prometheus/Thanos/Cortex setups, reduce relabeling boilerplate, and clarify queries and alerts for operators.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions