kvs-watch: avoid re-fetching the same data repeatedly #6414

garlick · 2024-11-04T22:47:31Z

Problem: flux_kvs_lookup(FLUX_KVS_WATCH | FLUX_KVS_WATCH_APPEND) is implemented inefficiently. Although it returns only new data that has been appended to a watched key, the kvs-watch module internally fetches the entire value each time the key is mentioned in a kvs.setroot event, before returning only the data after the watcher's current offset.

This could ~~almost trivially~~ be made more efficient by asking the KVS for the RFC 11 valref object, which contains an array of blobrefs, and then fetching only the new blobrefs from the content store.

(From brainstorming discussion with @chu11)

The above description probably needs to be checked against the code before we get too excited about implementing this. However, given that we watch eventlogs a lot in Flux, this could be a high value improvement.

The text was updated successfully, but these errors were encountered:

chu11 · 2024-11-06T08:40:43Z

This could almost trivially be made more efficient by asking the KVS for the RFC 11 valref object, which contains an array of blobrefs, and then fetching only the new blobrefs from the content store.

adding an extra level of indirection in kvs-watch was going to be annoying (i.e. instead see setroot & get a value, we'd then have see setroot & get valref & get content (which can be in a loop to get multiple refs)). Can reconsider If we go down the path of a kvs-eventlog-watch module, would probably not be as hard to do. So for this first round experiment:

https://github.com/chu11/flux-core/tree/issue6414_kvs_watch_optimization

have kvs.lookup-plus return the value's treeobj on every return. Easier than doing a different lookup
have kvs.lookup-plus take a "start index" for a valref lookup. Internally in the KVS, when a valref is building up its multi-blobref return value it loops from 0 to N-1. It was fairly trivial to start at index > 0.

this relatively simple prototype passed all tests except

FAIL: t4000-issues-test-driver.t 1 - t4612-eventlog-overwrite-crash
FAIL: t1007-kvs-lookup-watch.t 29 - flux kvs get: --append works on fake append
FAIL: t1007-kvs-lookup-watch.t 30 - flux kvs get: --append works on fake append wiping data
FAIL: t1007-kvs-lookup-watch.t 31 - flux kvs get: --append works on fake zero length append
FAIL: t1007-kvs-lookup-watch.t 32 - flux kvs get: --append fails on shortened write
FAIL: t2607-job-shell-input.t 7 - flux-shell: multiple jobs, each want stdin

the first 5 failures are weird append corner cases that probably just need new handling (or perhaps are no longer relevant). The last test has some extra output, probably a corner case I missed. (Update: yup, corner case where I get 2 setroot events in succession and send the same starting index twice ... In a future world where I get from the content store,would be easier to handle this corner case)

anyways doing a simple time flux run flux lptest 2000000 80, before wallclock time of 58 seconds, after wallclock time of 24 seconds. So that's really good.

FWIW, testing lptest 4000000 80 led to flux-shell[0]: FATAL: doom: rank 0 on host corona212 exited and exit-timeout=30s has expired b/c I think at some point the broker is just spinning so much it's bad. On my experimental branch, the job completed in about 45 seconds.

garlick · 2024-11-06T15:11:42Z

Nice result! Any improvement on the job throughput test?

garlick · 2024-11-06T16:20:59Z

FWIW, testing lptest 4000000 80 led to flux-shell[0]: FATAL: doom: rank 0 on host corona212 exited and exit-timeout=30s has expired b/c I think at some point the broker is just spinning so much it's bad. On my experimental branch, the job completed in about 45 seconds.

The doom plugin raises an exception on the job when one rank exits early. Did the rank 0 broker of your batch job crash?

chu11 · 2024-11-06T18:07:37Z

The doom plugin raises an exception on the job when one rank exits early. Did the rank 0 broker of your batch job crash?

I was a bit perplexed by the error as well. I hadn't exited the test instance shell, so one presumes the broker didn't crash. I didn't look into too much but next time I should.

Nice result! Any improvement on the job throughput test?

Doing basic throughput tests w/ simulated execution, no obvious change (doing 10000 jobs, and I'm on corona which may have noise from other users. hitting some likely prototype corner case limitations, so hard to test bigger than this). It's not too surprising IMO. The main eventlog and exec eventlogs are small and (very likely) all cached in the KVS, so the minor lookup cost savings probably doesn't amount to much. (especially when offset w/ the additional treeobj we are sending back with all responses in this prototype).

Looking through code last night, the big savings are we're no longer iterating through large blobref arrays and building up responses (and in my lptest tests above, some of those responses will be quite large). For long running jobs, older entries may not be cached anymore, so we can avoid doing those extra calls to the content-store. To me, those are the real big wins.

And if this solution is to be the start of a solution for #6292, there's the obvious win there.

Lemme look into implementing a prototype that is more "legit", mapping more to what we initially discussed. B/c the duplicate data corner case is hard to solve with my current hacks.

chu11 · 2024-11-07T23:16:53Z

Pushed a more "refined" prototype to branch

https://github.com/chu11/flux-core/tree/issue6414_kvs_watch_optimization

this time instead of it being so hacky it

gets the treeobj from the KVS using the normal FLUX_KVS_TREEOBJ flag
determines what valref indexes it should get (i.e. 0-2 vs 0 - max)
calls a new kvs.lookup-range RPC target to get those specific indexes

I know the new kvs.lookup-range deviates from prior discussions of going directly to the content store. But I figure all of the code to A) get from the content store and B) build a combined treeobj result is already in the KVS, might as well utilize it.

the big time flux run flux lptest 2000000 80 test run under a flux start instance performs similar to what I had before. ~50 seconds wallclock on master, ~24 seconds on the branch.

doing a throughput test with 25000 simulated jobs, if there is a performance difference it appears to be in the noise and not apparent. With job eventlogs being so small, any minor gains from not iterating and getting all the references is probably wiped out with the extra RPC exchange for the treeobj. I wouldn't be surprised if it was a tad slower overall if I ran this more and more and really got a good average.

To me, the performance gain from flux lptest makes it clear we're on to something. But perhaps it should not be used for all eventlog watching. Perhaps it could be ...

a new option / KVS-flag / toggle
a new module that does this specifically?

that could purposefully handle this performance situation, which we might apply only to stdin/stdout/stderr. It could possibly be required to be specified by users?

Given potential goal for #6292, perhaps it should be option 2??

chu11 · 2024-11-08T17:58:23Z

Just some notes on failing tests. Note that if we go with no-re-fetch only on stdin/stdout/stderr, I think all these tests go back to working and stay the same.

issues/t4612-eventlog-overwrite-crash.sh

zero write appends now impossible, so this test hangs (well, the job in the test will probably just finish sleeping for 5 minutes). Test should be removed. (Edit: well I suppose if we cached the valref we could double check that the refs haven't changed, but I don't think we should do that).

not ok 28 - flux kvs get: --append fails on change to non-value

"Is a directory" error now "Protocol error" b/c expecting a valref

not ok 29 - flux kvs get: --append works on fake append

no longer applicable b/c we no longer diff whole value contents, should be removed

not ok 30 - flux kvs get: --append works on fake append wiping data

no longer applicable, results in protocol error, should be removed

not ok 31 - flux kvs get: --append works on fake zero length append

no longer applicable, results in protocol error, should be removed

not ok 32 - flux kvs get: --append fails on shortened write

"Invalid argument" error now "Protocol error" b/c expecting a valref

garlick · 2024-11-08T21:10:07Z

I am off today and haven't had time to look at this but a quick reaction is that this is an optimization that is specific to watching, so it feels like it should be implemented as much as is feasible in kvs-watch not kvs. I think avoiding unnecessary complexity in the kvs proper is probably a good idea at this point.

Since watching a key whose value is being repeatedly overwritten (the original watch design point) is a different use case than watching one that is being appended to, a client flag to select one or the other seems justified.

A little code duplication in the name of clarity is OK sometimes IMHO.

chu11 · 2024-11-08T21:44:20Z

Since watching a key whose value is being repeatedly overwritten (the original watch design point) is a different use case than watching one that is being appended to, a client flag to select one or the other seems justified.

Ok, I will look into going with an alternate module, libkvs can route there given flags. There's so many legacy corner cases (FULL, UNIQ, WAITCREATE, etc.) it may be easier to go down a new module path.

garlick · 2024-11-08T22:50:39Z

Maybe a new RPC in kvs-watch?

chu11 · 2024-11-11T23:16:15Z

it occurs to me, that if we get data from the content module in kvs-watch, we're sort of bypassing kvs access checks. It appears that the content module can only accept lookups from the instance owner.

I guess this is ok, b/c the treeobj lookup done via the KVS should be sufficient since that lookup wouldn't work if access would be denied.

garlick · 2024-11-11T23:43:03Z

I guess this is ok, b/c the treeobj lookup done via the KVS should be sufficient since that lookup wouldn't work if access would be denied.

Yep that sounds right.

chu11 · 2024-11-13T18:50:24Z

anyone have a good idea for what a new flag for this option would be called? I'm currently programming this as "APPEND2".

Best idea I thought of was "APPEND_REFS" or "REF_APPEND", i.e. return data from the appended refs. But this sounds like you're returning the references, not the data. "APPEND_REF_DATA"?

garlick · 2024-11-13T19:33:09Z

This is a flag to tell kvs-watch that the key is expected to be appended to, and so it should watch the treeobj for content additions?

Hmm, wouldn't that be implied already by the FLUX_KVS_WATCH_APPEND flag?

chu11 · 2024-11-13T19:37:34Z

Hmm, wouldn't that be implied already by the FLUX_KVS_WATCH_APPEND flag?

Are you suggesting we should change the current FLUX_KVS_WATCH_APPEND to something else? I suppose the current flag is less "APPEND" and more like "DIFF"?

To be clear, my current plan/prototype was to introduce a new kvs-watch flag that would not re-fetch the same data repeatedly. The old FLUX_KVS_WATCH_APPEND would stay the same.

garlick · 2024-11-13T19:41:23Z

I thought the purpose of the flag was just to differentiate between the old use case for watch where the key was being repeatedly overwritten (and refetching it each time is desired) vs append, where there is always a new blobref being added to the treeobj, and fetching the new treeobj and any new blobs, but not the old ones, is desired?

Am I confused/forgetting?

p.s. I actually forgot we had a flag already when we were discussing it earlier

chu11 · 2024-11-13T20:17:38Z

The present case is

get all of the data, return only data past the previous end point, update new end point

the new case is

get treeobj
- if it is a "val" return it
- if "valref", get all of the data and return it, keep track of the indexes you've gotten before

the main corner cases are that the new case only works with valref, not vals. So ...

flux kvs put a=1
flux kva put a=12
flux kvs put a=123

would work in the present case, but would fail in the latter b/c the code would say "hey, this isn't a valref".

hmmmm, thinking about this more .... i suppose it's possible we could handle both cases under one flag? Although I think that would start to get tricky.

garlick · 2024-11-13T20:23:12Z

I think if you append to a val it is converted to a valref with the existing value converted to a valref with one blob, correct?

Seems like if you're keeping a blob index anyway, this just means you don't have to fetch the first one? E.g. return it as index 0, increment counter, and start next read from the second bloberef in what will be a treeobj? (and if not, it's an error?)

Edit: oh on the a= case, that seems like it may just be a contrived case for test that we could do away with?

chu11 · 2024-11-13T20:27:34Z

I think if you append to a val it is converted to a valref with the existing value converted to a valref with one blob, correct?

Yes, but in the example I have above, it's not an append, we're just overwriting the value over and over again. The equivalent of the above with appends would be

flux kvs put a=1
flux kva put --append a=2
flux kvs put --append a=3

Admittedly it's a bit of a weird corner case, but I guess my point is there are these unique cases with the old behavior that are eliminated with the new behavior. I tend to be more conservative on these kinds of decisions, but it makes me want to preserve the old behavior and add a flag for new behavior.

garlick · 2024-11-13T20:34:19Z

Well the flag has APPEND in the name, so it feels like fair game to me. E.g. just throw an error if the the key changes and the next object is not a treeobj.

chu11 · 2024-11-13T22:14:04Z

@garlick i'm going back and forth on this, wondering your thoughts.

Think of the use case of flux job attach <jobid> on a job that has completed. The stdout is a valref w/ N blobrefs.

should kvs-watch respond to the watcher with a single response, aggregating those N responses into a single val treeobj? This is the behavior that exists presently in flux-core.
Or should we respond with N val treeobjs?

My initial thought is we should do the former, but AFAICT, there seems to be little performance impact. The increased cost of sending N RPCs is atleast partially offset by the high cost of building the large initial response. Granted my testing is with a single node instance and there's no TBON hops involved. I would bet the single RPC response might be faster in the general sense.

But as I thought about it more, sending "N" responses feels more appropriate. I feels like the expected behavior from a "WATCH_APPEND", as we're getting an RPC response for each append that occurred. i.e. it doesn't matter if the job is finished or not, the WATCH_APPEND sends the same data in the same pattern to you.

I dunno, I keep on going back and forth on this.

garlick · 2024-11-13T22:45:46Z

I'd probably start with the simpler N responses and if we're not happy with the message framing overhead when it's like one event per message, we can always look at consolidating it later.

Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Everytime new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414

Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414

chu11 self-assigned this Nov 4, 2024

chu11 mentioned this issue Nov 7, 2024

throughput of jobs slows down over time - part deux #6419

Closed

chu11 mentioned this issue Nov 15, 2024

kvs-watch: only fetch new data for appends #6444

Merged

chu11 mentioned this issue Nov 21, 2024

kvs-watch: retrieve content blobs in chunks rather than all at once #6456

Closed

mergify bot closed this as completed in #6444 Dec 5, 2024

mergify bot closed this as completed in 1cdcae8 Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvs-watch: avoid re-fetching the same data repeatedly #6414

kvs-watch: avoid re-fetching the same data repeatedly #6414

garlick commented Nov 4, 2024 •

edited

Loading

chu11 commented Nov 6, 2024 •

edited

Loading

garlick commented Nov 6, 2024

garlick commented Nov 6, 2024

chu11 commented Nov 6, 2024

chu11 commented Nov 7, 2024

chu11 commented Nov 8, 2024 •

edited

Loading

garlick commented Nov 8, 2024 •

edited

Loading

chu11 commented Nov 8, 2024

garlick commented Nov 8, 2024

chu11 commented Nov 11, 2024

garlick commented Nov 11, 2024

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024 •

edited

Loading

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024 •

edited

Loading

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024

chu11 commented Nov 13, 2024 •

edited

Loading

garlick commented Nov 13, 2024

kvs-watch: avoid re-fetching the same data repeatedly #6414

kvs-watch: avoid re-fetching the same data repeatedly #6414

Comments

garlick commented Nov 4, 2024 • edited Loading

chu11 commented Nov 6, 2024 • edited Loading

garlick commented Nov 6, 2024

garlick commented Nov 6, 2024

chu11 commented Nov 6, 2024

chu11 commented Nov 7, 2024

chu11 commented Nov 8, 2024 • edited Loading

garlick commented Nov 8, 2024 • edited Loading

chu11 commented Nov 8, 2024

garlick commented Nov 8, 2024

chu11 commented Nov 11, 2024

garlick commented Nov 11, 2024

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024 • edited Loading

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024 • edited Loading

chu11 commented Nov 13, 2024

garlick commented Nov 13, 2024

chu11 commented Nov 13, 2024 • edited Loading

garlick commented Nov 13, 2024

garlick commented Nov 4, 2024 •

edited

Loading

chu11 commented Nov 6, 2024 •

edited

Loading

chu11 commented Nov 8, 2024 •

edited

Loading

garlick commented Nov 8, 2024 •

edited

Loading

garlick commented Nov 13, 2024 •

edited

Loading

garlick commented Nov 13, 2024 •

edited

Loading

chu11 commented Nov 13, 2024 •

edited

Loading