Standardize Karmada metrics label for member clusters to avoid conflicts with Prometheus `external_labels`

**What would you like to be added**:

Standardize the label used to denote the *Karmada member cluster* across all exported metrics, and avoid using the generic label key `cluster`. I propose introducing a single, consistent label key — **`member_cluster`** — and migrating existing metrics to use it (deprecating current usages like `cluster` and `cluster_name`).

Concretely:
- For resource-sync counters:
    - `create_resource_to_cluster{..., cluster="<member>", ...}` → `create_resource_to_cluster{..., member_cluster="<member>", ...}`
    - `update_resource_to_cluster{..., cluster="<member>", ...}` → `...{member_cluster="<member>", ...}`
    - `delete_resource_from_cluster{..., cluster="<member>", ...}` → `...{member_cluster="<member>", ...}`
- For cluster status/summary metrics:
    - `*_cluster_*{cluster_name="<member>"}` → `...{member_cluster="<member>"}`

Suggested rollout:
1. Add `member_cluster` in the next release while continuing to populate `cluster`/`cluster_name`.
2. Document `member_cluster` as the canonical label; mark `cluster`/`cluster_name` as deprecated.
3. After at least one stable release, drop the deprecated keys.

Include a short “migration” section in docs (example PromQL and relabel rules) and note the change in the release notes.

**Why is this needed**:

1) **Avoid collisions with Prometheus `external_labels`**  
   `external_labels` are commonly used to tag the *Prometheus server* (e.g., `cluster`, `region`, `replica`) for HA, federation, and remote storage systems like Thanos/Cortex. Using `cluster` inside Karmada metrics to mean “member cluster” collides with the equally common `cluster` external label that means “Prometheus/monitoring cluster,” making series ambiguous and forcing brittle relabeling.

2) **Improved portability with long-term storage and HA stacks**  
   Projects and vendors routinely rely on external labels for sharding/dedup (e.g., HA pairs use `cluster`/`replica`-style labels). When Karmada metrics also use `cluster`, remote stores and query layers have to disambiguate two different “cluster” concepts. Standardizing on `member_cluster` eliminates that ambiguity and reduces operator error during ingestion and querying.

3) **Alignment with Prometheus naming guidance**  
   Prometheus best practices encourage clear, descriptive, and consistent label naming. A domain-specific label (`member_cluster`) communicates intent better than a generic `cluster`, and it matches the semantic role Karmada plays (a control plane over *member clusters*). Consistency also helps users build reusable dashboards and recording rules.

4) **Fewer surprises in Alertmanager and federation**  
   Alert labels inherit external labels; mismatched or colliding label keys break grouping/dedup unless users add custom alert relabeling. Using an unambiguous `member_cluster` key removes this footgun and keeps alert routing predictable across environments.

5) **Karmada’s metrics are already inconsistent**  
   Within Karmada’s own code, some metrics use `cluster` (e.g., `create_resource_to_cluster`, `update_resource_to_cluster`, `delete_resource_from_cluster`), while others use `cluster_name` (e.g., `cluster_ready_state`, `cluster_node_number`, `cluster_sync_status_duration_seconds`). Unifying on `member_cluster` across both groups improves discoverability and reduces cognitive load for users writing PromQL/alerts.

---

## 🛠 Proposed Migration Guide

To transition from the current inconsistent label usage (`cluster`, `cluster_name`) to the standardized `member_cluster`, I recommend the following phased approach:

### 1. Normalize at Ingestion with Relabeling
Operators should use `metric_relabel_configs` in their Prometheus scrape jobs to immediately normalize label keys, so all metrics are stored under the new canonical label:

```yaml
metric_relabel_configs:
  - action: labelmap
    regex: ^cluster_name$
    replacement: member_cluster
  - action: labelmap
    regex: ^cluster$
    replacement: member_cluster
```

🔎 **What this does**:
- If a metric has `cluster_name="foo"`, it will be stored as `member_cluster="foo"`.
- If a metric has `cluster="bar"`, it will also be stored as `member_cluster="bar"`.
- After this transformation, **only `member_cluster` will exist** in Prometheus’s TSDB.
- This avoids having duplicate series that differ only by label key.

⚠️ **Important caveat**:  
Existing queries that use `cluster` or `cluster_name` **will not work once relabeling is enabled**, because those keys no longer exist in stored metrics. Queries must be migrated to use `member_cluster`.

---

### 2. Update Queries and Dashboards
All PromQL queries, recording rules, and dashboards should be updated to reference `member_cluster` instead of `cluster` or `cluster_name`.

**Example:**
```promql
# Before
sum by (cluster) (rate(create_resource_to_cluster[5m]))

# After
sum by (member_cluster) (rate(create_resource_to_cluster[5m]))
```

---

### 3. Staged Rollout
- Start by enabling relabeling on a **test Prometheus instance** to confirm that metrics look as expected.
- Update your dashboards/alerts in development environments first.
- Roll out to production once queries are validated.

This staged process avoids surprises in live environments.

---

### 4. Project-Side Deprecation Plan
On Karmada’s side, I suggest:
1. Add `member_cluster` labels to all metrics immediately.
2. Keep `cluster`/`cluster_name` for **one full release cycle**, marked as deprecated.
3. Announce in release notes and documentation that operators should switch queries.
4. After one stable release, drop the old labels entirely.

This ensures alignment between project maintainers and operators, with plenty of notice.

---

### 5. Documentation and Communication
- The official Karmada metrics documentation should clearly distinguish between:
    - **`member_cluster`** = the Karmada *member cluster* (what we’re standardizing).
    - **Prometheus `external_labels.cluster`** = the *Prometheus server or monitoring cluster*.
- Provide worked examples for Thanos, Cortex, or other federated setups to show why separating these concepts avoids collisions and alerting confusion.

---

✅ **In summary**:
- Relabeling enforces consistency right away.
- Queries must be updated to `member_cluster`.
- Project maintainers should support a deprecation window.
- Documentation and communication are key to avoiding confusion.

This small, backward-compatible naming change will make Karmada’s metrics easier to run in real-world Prometheus/Thanos/Cortex setups, reduce relabeling boilerplate, and clarify queries and alerts for operators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardize Karmada metrics label for member clusters to avoid conflicts with Prometheus `external_labels` #6781

🛠 Proposed Migration Guide

1. Normalize at Ingestion with Relabeling

2. Update Queries and Dashboards

3. Staged Rollout

4. Project-Side Deprecation Plan

5. Documentation and Communication

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Standardize Karmada metrics label for member clusters to avoid conflicts with Prometheus external_labels #6781

Description

🛠 Proposed Migration Guide

1. Normalize at Ingestion with Relabeling

2. Update Queries and Dashboards

3. Staged Rollout

4. Project-Side Deprecation Plan

5. Documentation and Communication

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Standardize Karmada metrics label for member clusters to avoid conflicts with Prometheus `external_labels` #6781