Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs-watch: only fetch new data for appends #6444

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Nov 15, 2024

Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Everytime new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio).

Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch.

Fixes #6414


side notes:

  • so protocol wise
    • before it was
      • first RPC get all KVS data, return it all, note how long the data is
      • later RPCs, get all KVS data, only send new data past offset of prior request
    • now it is
      • RPCs now get KVS treeobj
        • keep track of how many references are in the array
        • get each blobref data from the content store
        • return that data to the user

So RPC wise the pros are:

  • no longer iterating / building larger RPCs than is necessary

the cons are

  • adding a level of RPC indirection (getting treeobj)
  • possibly sending more RPCs back to caller (first RPC does not "combine" data into a single return)

For large data, this is a significant performance improvement for use cases like this:

time flux run flux lptest 2000000 80

wallclock time is generally being cut down from 58 seconds to 24 seconds. It's a big win. (FWIW, this was on corona. On my laptop it was 15s before 11.7s afterwards, not as much as a win there, but still that's 22%)

For everything else (flux job attach JOBID for a finished job, job throughput), it appears that performance is a wash. That the wins balance out the losses. This is of course with limited testing in single node instances.

Edit: Also as an aside, there is more potential for support of #6292 with this implementation

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a nice improvement IMHO!

Before I go further into a review, could you fix the following

  • Don't break short function parameter lists to multiple lines
  • Don't log errors for user-induced problems. As a general rule, if sending an error response to the user, don't log it also
  • Make an effort to make log messages more helpful that "function name: strerror".

Comment on lines 648 to 649
* Ordering note: KVS lookups can be returned out of order due to
* what is presently cached. KVS lookup futures are added to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: instead of

due to what is presently cached

how about

because they are processed asynchronously

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lemme tweak. Even though it's asynchronous, was trying to explain scenario KVS could return things out of order.

Comment on lines 155 to 157
will only contain the additional appended data. This flag will
only work with appends with the FLUX_KVS_APPEND. Overwritten data
will result in an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only way to append is with FLUX_KVS_APPEND, I think the first sentence added here could be dropped.

The second one might be slightly clearer if it read

If the value is overwritten, the lookup fails with EINVAL

(or whatever errno it fails wiith)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: if we've detected that a watched key has been overwritten and we're watching it with this flag, a human readable error response like "value of key watched with WATCH_APPEND flag was overwritten" might be helpful.

w->prev_end_index = treeobj_get_count (val) - 1;
}
else {
flux_log (h, LOG_ERR, "%s: treeobj not val or valref", __FUNCTION__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop log message since error is being returned to the user and this is an expected situation as far as this service goes.

Comment on lines 251 to 255
static void handle_load_response (flux_future_t *f,
struct watcher *w)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: don't break short parameter lists to multiple lines

@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch from f214e6c to c07a22d Compare November 18, 2024 22:47
@chu11
Copy link
Member Author

chu11 commented Nov 18, 2024

re-pushed fixing up a lot of the minor things listed above. I went with the general rule of making more descriptive errors when an error would be half-likely, but kept "function name: strerror" log messages for things that should extremely unlikely.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to get through this before meeting so here's what I have so far!

@@ -21,6 +21,7 @@
#include "src/common/libkvs/treeobj.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message: s/everytime/every time/

Comment on lines 268 to 278
if (flux_respond_pack (h, w->request, "{ s:o }", "val", val) < 0) {
flux_log_error (h, "%s: flux_respond_pack", __FUNCTION__);
json_decref (val);
goto error;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the goto error, another error response is attempted. I think we've settled previously that this is not a good idea. Just log the (unlikely) error.

In new code I've been trying to log messages like "failed to respond to TOPIC" in these cases.

Comment on lines 331 to 333
saved_errno = errno;
flux_future_destroy (f);
errno = saved_errno;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flux_future_destroy already preserves errno

Comment on lines 315 to 321
if (!(f = content_load_byblobref (h, ref, 0))) {
flux_log_error (h, "%s: content_load_byblobref", __FUNCTION__);
goto error;
}
if (flux_future_then (f, -1., load_continuation, w) < 0) {
flux_log_error (h, "%s: flux_future_then", __FUNCTION__);
goto error;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: combine error logs and make it more readable, e.g.

if (!(f = load()) || then() < 0) {
    flux_log_error ("failed to send TOPIC request")
}

@chu11
Copy link
Member Author

chu11 commented Nov 21, 2024

re-pushed with fixes per comments above. Also added a few new cleanups. I could probably split those out at this point in time.

@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch from b23189b to 4029afa Compare November 21, 2024 23:25
@chu11 chu11 mentioned this pull request Nov 21, 2024
@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch 2 times, most recently from 2f8aec6 to 91c30de Compare November 22, 2024 20:19
@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch from 91c30de to 76ee0c0 Compare December 2, 2024 20:19
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good! I made a bunch of comments about error messages and some about readability, but I'll tentatively approve and leave up to you how you want to proceed. It's probably best to get this merged early in our release cycle so we have plenty of time to notice any issues.

Great work!

Comment on lines 257 to 287
if (content_load_get (f, &data, &size) < 0) {
flux_log_error (h, "failed to load content data");
goto error_respond;
}

if (!w->mute) {
json_t *val = treeobj_create_val (data, size);
if (!val) {
flux_log_error (h, "%s: treeobj_create_val", __FUNCTION__);
goto error_respond;
}
if (flux_respond_pack (h, w->request, "{ s:o }", "val", val) < 0) {
flux_log_error (h,
"%s: failed to respond to kvs-watch.lookup",
__FUNCTION__);
json_decref (val);
goto finished;
}
w->responded = true;
}

return;
error_respond:
if (!w->mute) {
if (flux_respond_error (h, w->request, errno, NULL) < 0)
flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the two cases here where an error is being returned to the user, instead of logging detailed errors, maybe it would be better to return the detailed errors to the user?

Comment on lines 311 to 316
static flux_future_t *load_ref (flux_t *h,
struct watcher *w,
const char *ref)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: don't break short parameter list


if (!(f = content_load_byblobref (h, ref, 0))
|| flux_future_then (f, -1., load_continuation, w) < 0) {
flux_log_error (h, "failed to send content.load request");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error is being returned to the user so no need to log this.

flux_future_t *f;
const char *ref = treeobj_get_blobref (val, i);
if (!ref) {
flux_log_error (h, "%s: treeobj_get_blobref", __FUNCTION__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error is being returned to the user, don't log.

Comment on lines 391 to 401
if (load_range (h,
w,
w->prev_start_index,
w->prev_end_index,
val) < 0)
goto error_respond;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set errmsg to something here in place of the logging above. Maybe "error sending request for content blobs [start:finish]". (It might be helpful to use flux_error_t instead of the const char **errmsg so that errprintf can be used)

if (new_offset < w->append_offset) {
free (new_data);
else {
(*errmsg) = "value of key watched with WATCH_APPEND overwritten";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think overwritten is correct here. Something that should be a valref is now not a valref, indicating the key was overwritten with something new.

Comment on lines 442 to 648
if (flux_rpc_get_unpack (f, "{ s:o s:i }",
if (flux_rpc_get_unpack (f, "{ s:o s:s s:i }",
"val", &val,
"rootref", &root_ref,
"rootseq", &root_seq) < 0) {
/* It is worth mentioning ENOTSUP error conditions here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix formatting since you touched it :-)

int root_seq;
json_t *val;
const char *errmsg = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be best to make this a flux_error_t error and just make sure it is filled in for all possible errors that are going to be sent to the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, there's a rebase error. Due to the changes from 0047178, the errmsg actually does nothing. Lemme update in general.

Comment on lines 760 to 783
if (flux_msg_pack (msg, "{s:s s:s s:i}",
"key", w->key,
"namespace", ns,
"flags", w->flags) < 0)
"flags", flags) < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix formatting

Comment on lines 490 to 503
/* N.B. It may not be obvious why we have to check w->index_set if we
* have not yet responded. It is possible we have received a setroot
* response and an updated valref before loads from the content store
* have returned to the caller.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments exceed 80 cols.

@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch 2 times, most recently from 9b34723 to ca56ea7 Compare December 4, 2024 22:03
@chu11
Copy link
Member Author

chu11 commented Dec 4, 2024

re-pushed with fixes based on comments above

there are two new commits, one to deal with parameters on different lines and one to add a treeobj_type_name() convenience function. So probably warrants another look @garlick

Problem: small amounts of eventlog data may not be flushed to
the terminal when they are received.  This can introduce complication
in tests.

Solution: call fflush() after events are processed with eventlog_entry_dumpf().
@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch from ca56ea7 to 8a21e50 Compare December 5, 2024 18:25
return "dir";
else if (treeobj_is_dirref (obj))
return "dirref";
return NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: return "unknown" instead of null if the main purpose is logging.

chu11 added 6 commits December 5, 2024 11:04
Problem: It would be convenient to know the treeobj type name for
debugging.  No treeobj type name conversion function exists.

Add a new treeobj_type_name() function to libkvs treeobj API.
Problem: kvs-watch dos not conform to modern project coding style
per RFC7.

Break long parameter lists into one per line.
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently
in kvs-watch.  Every time new data is appended, the entire contents
of the watched key are retrieved and only new data calculated via
an offset is sent to the watcher.  This significantly slows down performance
of large appended data (such as stdio).

Solution: Instead of retrieving the entire contents of the watched key,
fetch the tree object for the key from the KVS.  With access to the tree
object's blobref array, we need only access the new appended blobrefs from
the content store.  This significantly cuts down on the amount of data
transfered during a kvs-watch.

Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND description is a little out
of date given recent updates.

Tweak the description given recent updates.
Problem: With the recent behavior change to the kvs watch append
several tests no longer work.

Update kvs watch append tests in t1007-kvs-lookup-watch.t.

- the new kvs watch append sends each append data separately, leading
  to a slight change in output for one test.A

- the error output in one test changed

- all tests that previously "fake append" by overwriting a value all
  fail now in the same way.  Remove all tests except for one.
Problem: Additional coverage of the new kvs watch append
behavior is needed.

Add new tests in t1007-kvs-lookup-watch.t.
@chu11 chu11 force-pushed the issue6414_kvs_watch_optimization branch from 8a21e50 to acfad8d Compare December 5, 2024 19:04
@chu11
Copy link
Member Author

chu11 commented Dec 5, 2024

@garlick thanks, fixed that minor thing, will set MWP

@mergify mergify bot merged commit 0cfb291 into flux-framework:master Dec 5, 2024
33 of 34 checks passed
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 77.24551% with 38 lines in your changes missing coverage. Please review.

Project coverage is 83.60%. Comparing base (72ae1f0) to head (acfad8d).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/kvs-watch/kvs-watch.c 75.94% 38 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6444      +/-   ##
==========================================
- Coverage   83.61%   83.60%   -0.01%     
==========================================
  Files         524      524              
  Lines       87493    87611     +118     
==========================================
+ Hits        73154    73248      +94     
- Misses      14339    14363      +24     
Files with missing lines Coverage Δ
src/cmd/job/eventlog.c 89.41% <100.00%> (+0.11%) ⬆️
src/common/libkvs/treeobj.c 84.40% <100.00%> (+0.59%) ⬆️
src/modules/kvs-watch/kvs-watch.c 78.27% <75.94%> (+0.12%) ⬆️

... and 8 files with indirect coverage changes

@chu11 chu11 deleted the issue6414_kvs_watch_optimization branch December 5, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvs-watch: avoid re-fetching the same data repeatedly
2 participants