Riak stats #11

martinsumner · 2025-02-04T12:57:47Z

Associated with #10.

Riak produces stats, available via riak admin status, or via HTTP calls to the web API.

The stats are sent to be stored and aggregated by exometer_core. the stats are sent via the riak_kv_stat module, and then the values are fetched via the riak_kv_status module which embellishes the exometer stats with further information.

By default a riak_kv_stat_worker is used as a proxy for sending the stats. This (by default) is a resource controlled by sidejob - this means that there is a worker for every scheduler, and a (un-configurable) rate-limit shared across the worker of 10K updates per node per second. However, the rate-limit is bypassed for async updates (mainly histogram changes) - only direct updates (mainly counter updates) are covered by the rate-limit.

For some operations, a single user request can generate a lot of stats activity.

A user GET will lead to:

An increment to a per-node counter of GETs;
An update to a rolling histogram of GET timings;
An update to a rolling histogram of sibling counts;
An update to a rolling histogram of object sizes;
Some per-bucket stats, if configured (off by default);
An increment to a counter of per-node vnode_head requests ( x 3);
An update to a rolling histogram of per-node vnode_head timings ( x 3);
An increment to a counter of per-index vnode_head requests ( x 3);
An update to a rolling histogram of per-index vnode_head timings ( x 3);
An increment to a counter of per-node vnode_get requests ( x 1);
An update to a rolling histogram of per-node vnode_get timings ( x 1);
An increment to a counter of per-index vnode_get requests ( x 1);
An update to a rolling histogram of per-index vnode_get timings ( x 1);
Updates to read_repair stats if required.

There are at least 9 counter updates, and 11 histogram updates required for each and every GET.

It is hard to accurately assess what the cost of all this is. Recent PRs have improved the cost of fetching the results, however nothing has been done about the cost of maintaining the results.

Some profiling exercises indicates that the CPU cost associated with stats may be between 10% and 20%.

There are some questions:

What is a "reasonable" overhead for gathering of stats?
Is the use of sidejob either necessary or helpful in this case - given that the application of the rate-limit is partial - and there is an unnecessary double-casting of most requests.
Should it be possible to rationalise the stats so that less are produced?
Is there overuse of histogram rather than the more efficient alternative in exometer of uniform (or perhaps there should be an alternative to uniform which only uses the bounded set for percentiles, but still records mean/max accurately).

The text was updated successfully, but these errors were encountered:

martinsumner · 2025-02-04T13:06:05Z

Note the use of sidejob for stats is configurable using the undocumented direct_stats option:

https://github.com/OpenRiak/riak_kv/blob/fe9d3ab121acb0694ddd6cb23e4e0d8f837194eb/src/riak_kv_app.erl#L67-L72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Riak stats #11

Riak stats #11

martinsumner commented Feb 4, 2025 •

edited

Loading

martinsumner commented Feb 4, 2025

Riak stats #11

Riak stats #11

Comments

martinsumner commented Feb 4, 2025 • edited Loading

martinsumner commented Feb 4, 2025

martinsumner commented Feb 4, 2025 •

edited

Loading