Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Allow custom metric buckets #777

Open
TheCodeWrangler opened this issue Feb 26, 2025 · 4 comments
Open

[Feature Request] Allow custom metric buckets #777

TheCodeWrangler opened this issue Feb 26, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@TheCodeWrangler
Copy link
Contributor

Currently the default binning for activity metrics are top out at 60 seconds. This limits my observability into activities which might take a long time or even take more than 60 seconds on P99 metrics.

I would like to have a method of specifying a desired binning strategy on activity metrics.

@TheCodeWrangler TheCodeWrangler added the enhancement New feature or request label Feb 26, 2025
@cretz cretz changed the title [Feature Request] Allow custom prometheus binning on Activities [Feature Request] Allow custom metric buckets Feb 26, 2025
@cretz
Copy link
Member

cretz commented Feb 26, 2025

Yes, now that temporalio/sdk-core#844 is available, we can do this here too. I have changed the title.

@TheCodeWrangler
Copy link
Contributor Author

TheCodeWrangler commented Feb 27, 2025

Any estimate on a timeline? Is this something I could help speed along with a PR?

As best I can tell (though I am not very familiar with rust) the sdk-core binding accepts histogram_bucket_overrides

So possibly just python updates in runtime.py

@dataclass(frozen=True)
class PrometheusConfig:
    """Configuration for Prometheus metrics endpoint."""

    bind_address: str
    counters_total_suffix: bool = False
    unit_suffix: bool = False
    durations_as_seconds: bool = False
    histogram_bucket_overrides: dict[str, list[float]] | None = None # New property

    def _to_bridge_config(self) -> temporalio.bridge.runtime.PrometheusConfig:
        return temporalio.bridge.runtime.PrometheusConfig(
            bind_address=self.bind_address,
            counters_total_suffix=self.counters_total_suffix,
            unit_suffix=self.unit_suffix,
            durations_as_seconds=self.durations_as_seconds,
            histogram_bucket_overrides=self.histogram_bucket_overrides or {}, # New arg
        )

I am pretty naive on what might be involved here though.

@cretz
Copy link
Member

cretz commented Feb 27, 2025

There is no estimate on timeline, but it is on the near-term backlog since it was recently implemented on our Core side. Yes what you have there is mostly correct and we definitely welcome PRs if you want to tackle it (will want a test to confirm it works for built-in and user-defined histograms too, which is usually just as simple as making an HTTP call to the Prometheus endpoint and checking contents).

@TheCodeWrangler
Copy link
Contributor Author

TheCodeWrangler commented Mar 3, 2025

I took a stab in a PR.
#781

This was branched off of 1.10.0 tag because I was not able to get uv build to succeed on the main branch, but was able to get the poetry build working on the tag. This limited me from being able to compile and test cahnges on the main.

Was able to get a test to work and confirm that custom histogram binning worked for a custom metric BUT was not seeing it have an effect on the workflow end-to-end latency. This has me a bit nervoous that the update to workflow or activity binning (which is all i actually care about at the moment) will require a change to the core-sdk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants