Skip to content

[improvement](fe) Add virtual compute group switch metric#63036

Merged
luwei16 merged 1 commit into
apache:masterfrom
luwei16:codex/virtual-cluster-switch-metrics
May 9, 2026
Merged

[improvement](fe) Add virtual compute group switch metric#63036
luwei16 merged 1 commit into
apache:masterfrom
luwei16:codex/virtual-cluster-switch-metrics

Conversation

@luwei16
Copy link
Copy Markdown
Contributor

@luwei16 luwei16 commented May 6, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add an FE cloud metric that records virtual compute group active-standby switch events. The metric key uses virtual/src/dst compute group ids so a compute group rename updates the exposed labels without leaving stale old-name series.

Metric example

Prometheus output example:

# HELP doris_fe_virtual_compute_group_switch_total virtual compute group active standby switch count
# TYPE doris_fe_virtual_compute_group_switch_total counter
doris_fe_virtual_compute_group_switch_total{virtual_compute_group_id="id1",virtual_compute_group_name="v_group_1",src_compute_group_id="id2",src_compute_group_name="p_group_1",dst_compute_group_id="id3",dst_compute_group_name="p_group_2"} 1

The metric value is the accumulated switch count for the labeled virtual compute group switch path.

Release note

Add FE metric doris_fe_virtual_compute_group_switch_total for virtual compute group active-standby switches.

Check List (For Author)

  • Test:
    • Unit Test: bash run-fe-ut.sh --run org.apache.doris.cloud.system.CloudSystemInfoServiceTest
    • Unit Test: bash run-fe-ut.sh --run org.apache.doris.metric.MetricsTest
    • Manual test: git diff --check
    • FE checkstyle: bash -lc "export DORIS_HOME=$PWD && source env.sh && cd fe && ${MVN_CMD} -pl fe-core -DskipTests checkstyle:check"
  • Behavior changed: Yes. Add a new FE metric for virtual compute group active-standby switches.
  • Does this need documentation: No

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@luwei16
Copy link
Copy Markdown
Contributor Author

luwei16 commented May 6, 2026

run buildall

@luwei16 luwei16 force-pushed the codex/virtual-cluster-switch-metrics branch from 57b917c to d0cef2d Compare May 6, 2026 16:34
@luwei16
Copy link
Copy Markdown
Contributor Author

luwei16 commented May 6, 2026

run buildall

protected static AutoMappedMetric<LongCounterMetric> CLUSTER_CLOUD_GLOBAL_BALANCE_NUM;
protected static AutoMappedMetric<LongCounterMetric> CLUSTER_CLOUD_SMOOTH_UPGRADE_BALANCE_NUM;
protected static AutoMappedMetric<LongCounterMetric> CLUSTER_CLOUD_WARM_UP_CACHE_BALANCE_NUM;
protected static AutoMappedMetric<LongCounterMetric> VIRTUAL_CLUSTER_SWITCH_COUNTER;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster -> compute_group

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed the new metric/API/labels from virtual cluster terminology to virtual compute group terminology. The exposed metric is now doris_fe_virtual_compute_group_switch_total with *_compute_group_* labels.

List<MetricLabel> labels = new ArrayList<>();
counter.increase(1L);
labels.add(new MetricLabel("virtual_cluster_id", virtualClusterId));
labels.add(new MetricLabel("virtual_cluster_name", virtualClusterName));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen to a renamed compute group,
the existed metrics with wrong names seem never disappare?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The internal AutoMappedMetric key now uses virtual/src/dst compute group ids instead of names. When the same ids are reported with updated names, FE removes the old registered label series before setting the new labels, so renamed compute groups do not leave stale old-name metrics. Added MetricsTest.testVirtualComputeGroupSwitchMetricRename to cover this case.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add an FE cloud metric that records virtual compute group active-standby switch events. The metric key uses virtual/src/dst compute group ids so a compute group rename updates the exposed labels without leaving stale old-name series.

### Metric example

Prometheus output example:

```text
# HELP doris_fe_virtual_compute_group_switch_total virtual compute group active standby switch count
# TYPE doris_fe_virtual_compute_group_switch_total counter
doris_fe_virtual_compute_group_switch_total{virtual_compute_group_id="id1",virtual_compute_group_name="v_group_1",src_compute_group_id="id2",src_compute_group_name="p_group_1",dst_compute_group_id="id3",dst_compute_group_name="p_group_2"} 1
```

The metric value is the accumulated switch count for the labeled virtual compute group switch path.

### Release note

Add FE metric doris_fe_virtual_compute_group_switch_total for virtual compute group active-standby switches.

### Check List (For Author)

- Test:
    - Unit Test: bash run-fe-ut.sh --run org.apache.doris.cloud.system.CloudSystemInfoServiceTest
    - Unit Test: bash run-fe-ut.sh --run org.apache.doris.metric.MetricsTest
    - Manual test: git diff --check
    - FE checkstyle: bash -lc "export DORIS_HOME=$PWD && source env.sh && cd fe && ${MVN_CMD} -pl fe-core -DskipTests checkstyle:check"
- Behavior changed: Yes. Add a new FE metric for virtual compute group active-standby switches.
- Does this need documentation: No
@luwei16 luwei16 force-pushed the codex/virtual-cluster-switch-metrics branch from d0cef2d to 034d322 Compare May 7, 2026 02:57
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR approved by anyone and no changes requested.

@gavinchou gavinchou changed the title [improvement](fe) Add virtual cluster switch metric [improvement](fe) Add virtual compute group switch metric May 7, 2026
@luwei16
Copy link
Copy Markdown
Contributor Author

luwei16 commented May 7, 2026

run buildall

@luwei16 luwei16 merged commit 4178178 into apache:master May 9, 2026
33 of 35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants