-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
What would you like to be added:
Standardize the label used to denote the Karmada member cluster across all exported metrics, and avoid using the generic label key cluster
. I propose introducing a single, consistent label key — member_cluster
— and migrating existing metrics to use it (deprecating current usages like cluster
and cluster_name
).
Concretely:
- For resource-sync counters:
create_resource_to_cluster{..., cluster="<member>", ...}
→create_resource_to_cluster{..., member_cluster="<member>", ...}
update_resource_to_cluster{..., cluster="<member>", ...}
→...{member_cluster="<member>", ...}
delete_resource_from_cluster{..., cluster="<member>", ...}
→...{member_cluster="<member>", ...}
- For cluster status/summary metrics:
*_cluster_*{cluster_name="<member>"}
→...{member_cluster="<member>"}
Suggested rollout:
- Add
member_cluster
in the next release while continuing to populatecluster
/cluster_name
. - Document
member_cluster
as the canonical label; markcluster
/cluster_name
as deprecated. - After at least one stable release, drop the deprecated keys.
Include a short “migration” section in docs (example PromQL and relabel rules) and note the change in the release notes.
Why is this needed:
-
Avoid collisions with Prometheus
external_labels
external_labels
are commonly used to tag the Prometheus server (e.g.,cluster
,region
,replica
) for HA, federation, and remote storage systems like Thanos/Cortex. Usingcluster
inside Karmada metrics to mean “member cluster” collides with the equally commoncluster
external label that means “Prometheus/monitoring cluster,” making series ambiguous and forcing brittle relabeling. -
Improved portability with long-term storage and HA stacks
Projects and vendors routinely rely on external labels for sharding/dedup (e.g., HA pairs usecluster
/replica
-style labels). When Karmada metrics also usecluster
, remote stores and query layers have to disambiguate two different “cluster” concepts. Standardizing onmember_cluster
eliminates that ambiguity and reduces operator error during ingestion and querying. -
Alignment with Prometheus naming guidance
Prometheus best practices encourage clear, descriptive, and consistent label naming. A domain-specific label (member_cluster
) communicates intent better than a genericcluster
, and it matches the semantic role Karmada plays (a control plane over member clusters). Consistency also helps users build reusable dashboards and recording rules. -
Fewer surprises in Alertmanager and federation
Alert labels inherit external labels; mismatched or colliding label keys break grouping/dedup unless users add custom alert relabeling. Using an unambiguousmember_cluster
key removes this footgun and keeps alert routing predictable across environments. -
Karmada’s metrics are already inconsistent
Within Karmada’s own code, some metrics usecluster
(e.g.,create_resource_to_cluster
,update_resource_to_cluster
,delete_resource_from_cluster
), while others usecluster_name
(e.g.,cluster_ready_state
,cluster_node_number
,cluster_sync_status_duration_seconds
). Unifying onmember_cluster
across both groups improves discoverability and reduces cognitive load for users writing PromQL/alerts.
🛠 Proposed Migration Guide
To transition from the current inconsistent label usage (cluster
, cluster_name
) to the standardized member_cluster
, I recommend the following phased approach:
1. Normalize at Ingestion with Relabeling
Operators should use metric_relabel_configs
in their Prometheus scrape jobs to immediately normalize label keys, so all metrics are stored under the new canonical label:
metric_relabel_configs:
- action: labelmap
regex: ^cluster_name$
replacement: member_cluster
- action: labelmap
regex: ^cluster$
replacement: member_cluster
🔎 What this does:
- If a metric has
cluster_name="foo"
, it will be stored asmember_cluster="foo"
. - If a metric has
cluster="bar"
, it will also be stored asmember_cluster="bar"
. - After this transformation, only
member_cluster
will exist in Prometheus’s TSDB. - This avoids having duplicate series that differ only by label key.
Existing queries that use cluster
or cluster_name
will not work once relabeling is enabled, because those keys no longer exist in stored metrics. Queries must be migrated to use member_cluster
.
2. Update Queries and Dashboards
All PromQL queries, recording rules, and dashboards should be updated to reference member_cluster
instead of cluster
or cluster_name
.
Example:
# Before
sum by (cluster) (rate(create_resource_to_cluster[5m]))
# After
sum by (member_cluster) (rate(create_resource_to_cluster[5m]))
3. Staged Rollout
- Start by enabling relabeling on a test Prometheus instance to confirm that metrics look as expected.
- Update your dashboards/alerts in development environments first.
- Roll out to production once queries are validated.
This staged process avoids surprises in live environments.
4. Project-Side Deprecation Plan
On Karmada’s side, I suggest:
- Add
member_cluster
labels to all metrics immediately. - Keep
cluster
/cluster_name
for one full release cycle, marked as deprecated. - Announce in release notes and documentation that operators should switch queries.
- After one stable release, drop the old labels entirely.
This ensures alignment between project maintainers and operators, with plenty of notice.
5. Documentation and Communication
- The official Karmada metrics documentation should clearly distinguish between:
member_cluster
= the Karmada member cluster (what we’re standardizing).- Prometheus
external_labels.cluster
= the Prometheus server or monitoring cluster.
- Provide worked examples for Thanos, Cortex, or other federated setups to show why separating these concepts avoids collisions and alerting confusion.
✅ In summary:
- Relabeling enforces consistency right away.
- Queries must be updated to
member_cluster
. - Project maintainers should support a deprecation window.
- Documentation and communication are key to avoiding confusion.
This small, backward-compatible naming change will make Karmada’s metrics easier to run in real-world Prometheus/Thanos/Cortex setups, reduce relabeling boilerplate, and clarify queries and alerts for operators.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status