Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Riak stats #11

Open
martinsumner opened this issue Feb 4, 2025 · 1 comment
Open

Riak stats #11

martinsumner opened this issue Feb 4, 2025 · 1 comment

Comments

@martinsumner
Copy link
Contributor

martinsumner commented Feb 4, 2025

Associated with #10.

Riak produces stats, available via riak admin status, or via HTTP calls to the web API.

The stats are sent to be stored and aggregated by exometer_core. the stats are sent via the riak_kv_stat module, and then the values are fetched via the riak_kv_status module which embellishes the exometer stats with further information.

By default a riak_kv_stat_worker is used as a proxy for sending the stats. This (by default) is a resource controlled by sidejob - this means that there is a worker for every scheduler, and a (un-configurable) rate-limit shared across the worker of 10K updates per node per second. However, the rate-limit is bypassed for async updates (mainly histogram changes) - only direct updates (mainly counter updates) are covered by the rate-limit.

For some operations, a single user request can generate a lot of stats activity.

A user GET will lead to:

  • An increment to a per-node counter of GETs;
  • An update to a rolling histogram of GET timings;
  • An update to a rolling histogram of sibling counts;
  • An update to a rolling histogram of object sizes;
  • Some per-bucket stats, if configured (off by default);
  • An increment to a counter of per-node vnode_head requests ( x 3);
  • An update to a rolling histogram of per-node vnode_head timings ( x 3);
  • An increment to a counter of per-index vnode_head requests ( x 3);
  • An update to a rolling histogram of per-index vnode_head timings ( x 3);
  • An increment to a counter of per-node vnode_get requests ( x 1);
  • An update to a rolling histogram of per-node vnode_get timings ( x 1);
  • An increment to a counter of per-index vnode_get requests ( x 1);
  • An update to a rolling histogram of per-index vnode_get timings ( x 1);
  • Updates to read_repair stats if required.

There are at least 9 counter updates, and 11 histogram updates required for each and every GET.

It is hard to accurately assess what the cost of all this is. Recent PRs have improved the cost of fetching the results, however nothing has been done about the cost of maintaining the results.

Some profiling exercises indicates that the CPU cost associated with stats may be between 10% and 20%.

There are some questions:

  • What is a "reasonable" overhead for gathering of stats?
  • Is the use of sidejob either necessary or helpful in this case - given that the application of the rate-limit is partial - and there is an unnecessary double-casting of most requests.
  • Should it be possible to rationalise the stats so that less are produced?
  • Is there overuse of histogram rather than the more efficient alternative in exometer of uniform (or perhaps there should be an alternative to uniform which only uses the bounded set for percentiles, but still records mean/max accurately).
@martinsumner
Copy link
Contributor Author

Note the use of sidejob for stats is configurable using the undocumented direct_stats option:

https://github.com/OpenRiak/riak_kv/blob/fe9d3ab121acb0694ddd6cb23e4e0d8f837194eb/src/riak_kv_app.erl#L67-L72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant