-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvs-watch: avoid re-fetching the same data repeatedly #6414
Comments
adding an extra level of indirection in https://github.com/chu11/flux-core/tree/issue6414_kvs_watch_optimization
this relatively simple prototype passed all tests except
the first 5 failures are weird append corner cases that probably just need new handling (or perhaps are no longer relevant). The last test has some extra output, probably a corner case I missed. (Update: yup, corner case where I get 2 setroot events in succession and send the same starting index twice ... In a future world where I get from the content store,would be easier to handle this corner case) anyways doing a simple FWIW, testing |
Nice result! Any improvement on the job throughput test? |
The doom plugin raises an exception on the job when one rank exits early. Did the rank 0 broker of your batch job crash? |
I was a bit perplexed by the error as well. I hadn't exited the test instance shell, so one presumes the broker didn't crash. I didn't look into too much but next time I should.
Doing basic throughput tests w/ simulated execution, no obvious change (doing 10000 jobs, and I'm on corona which may have noise from other users. hitting some likely prototype corner case limitations, so hard to test bigger than this). It's not too surprising IMO. The main eventlog and exec eventlogs are small and (very likely) all cached in the KVS, so the minor lookup cost savings probably doesn't amount to much. (especially when offset w/ the additional treeobj we are sending back with all responses in this prototype). Looking through code last night, the big savings are we're no longer iterating through large blobref arrays and building up responses (and in my lptest tests above, some of those responses will be quite large). For long running jobs, older entries may not be cached anymore, so we can avoid doing those extra calls to the content-store. To me, those are the real big wins. And if this solution is to be the start of a solution for #6292, there's the obvious win there. Lemme look into implementing a prototype that is more "legit", mapping more to what we initially discussed. B/c the duplicate data corner case is hard to solve with my current hacks. |
Pushed a more "refined" prototype to branch https://github.com/chu11/flux-core/tree/issue6414_kvs_watch_optimization this time instead of it being so hacky it
I know the new the big doing a throughput test with 25000 simulated jobs, if there is a performance difference it appears to be in the noise and not apparent. With job eventlogs being so small, any minor gains from not iterating and getting all the references is probably wiped out with the extra RPC exchange for the treeobj. I wouldn't be surprised if it was a tad slower overall if I ran this more and more and really got a good average. To me, the performance gain from
that could purposefully handle this performance situation, which we might apply only to stdin/stdout/stderr. It could possibly be required to be specified by users? Given potential goal for #6292, perhaps it should be option 2?? |
Just some notes on failing tests. Note that if we go with no-re-fetch only on stdin/stdout/stderr, I think all these tests go back to working and stay the same.
|
I am off today and haven't had time to look at this but a quick reaction is that this is an optimization that is specific to watching, so it feels like it should be implemented as much as is feasible in Since watching a key whose value is being repeatedly overwritten (the original watch design point) is a different use case than watching one that is being appended to, a client flag to select one or the other seems justified. A little code duplication in the name of clarity is OK sometimes IMHO. |
Ok, I will look into going with an alternate module, libkvs can route there given flags. There's so many legacy corner cases (FULL, UNIQ, WAITCREATE, etc.) it may be easier to go down a new module path. |
Maybe a new RPC in |
it occurs to me, that if we get data from the content module in I guess this is ok, b/c the treeobj lookup done via the KVS should be sufficient since that lookup wouldn't work if access would be denied. |
Yep that sounds right. |
anyone have a good idea for what a new flag for this option would be called? I'm currently programming this as "APPEND2". Best idea I thought of was "APPEND_REFS" or "REF_APPEND", i.e. return data from the appended refs. But this sounds like you're returning the references, not the data. "APPEND_REF_DATA"? |
This is a flag to tell Hmm, wouldn't that be implied already by the FLUX_KVS_WATCH_APPEND flag? |
Are you suggesting we should change the current FLUX_KVS_WATCH_APPEND to something else? I suppose the current flag is less "APPEND" and more like "DIFF"? To be clear, my current plan/prototype was to introduce a new |
I thought the purpose of the flag was just to differentiate between the old use case for watch where the key was being repeatedly overwritten (and refetching it each time is desired) vs append, where there is always a new blobref being added to the treeobj, and fetching the new treeobj and any new blobs, but not the old ones, is desired? Am I confused/forgetting? p.s. I actually forgot we had a flag already when we were discussing it earlier |
The present case is
the new case is
the main corner cases are that the new case only works with valref, not vals. So ...
would work in the present case, but would fail in the latter b/c the code would say "hey, this isn't a valref". hmmmm, thinking about this more .... i suppose it's possible we could handle both cases under one flag? Although I think that would start to get tricky. |
I think if you append to a val it is converted to a valref with the existing value converted to a valref with one blob, correct? Seems like if you're keeping a blob index anyway, this just means you don't have to fetch the first one? E.g. return it as index 0, increment counter, and start next read from the second bloberef in what will be a treeobj? (and if not, it's an error?) Edit: oh on the a= case, that seems like it may just be a contrived case for test that we could do away with? |
Yes, but in the example I have above, it's not an append, we're just overwriting the value over and over again. The equivalent of the above with appends would be
Admittedly it's a bit of a weird corner case, but I guess my point is there are these unique cases with the old behavior that are eliminated with the new behavior. I tend to be more conservative on these kinds of decisions, but it makes me want to preserve the old behavior and add a flag for new behavior. |
Well the flag has APPEND in the name, so it feels like fair game to me. E.g. just throw an error if the the key changes and the next object is not a treeobj. |
@garlick i'm going back and forth on this, wondering your thoughts. Think of the use case of
My initial thought is we should do the former, but AFAICT, there seems to be little performance impact. The increased cost of sending N RPCs is atleast partially offset by the high cost of building the large initial response. Granted my testing is with a single node instance and there's no TBON hops involved. I would bet the single RPC response might be faster in the general sense. But as I thought about it more, sending "N" responses feels more appropriate. I feels like the expected behavior from a "WATCH_APPEND", as we're getting an RPC response for each append that occurred. i.e. it doesn't matter if the job is finished or not, the WATCH_APPEND sends the same data in the same pattern to you. I dunno, I keep on going back and forth on this. |
I'd probably start with the simpler N responses and if we're not happy with the message framing overhead when it's like one event per message, we can always look at consolidating it later. |
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Everytime new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Everytime new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem: The FLUX_KVS_WATCH_APPEND flag is implemented inefficiently in kvs-watch. Every time new data is appended, the entire contents of the watched key are retrieved and only new data calculated via an offset is sent to the watcher. This significantly slows down performance of large appended data (such as stdio). Solution: Instead of retrieving the entire contents of the watched key, fetch the tree object for the key from the KVS. With access to the tree object's blobref array, we need only access the new appended blobrefs from the content store. This significantly cuts down on the amount of data transfered during a kvs-watch. Fixes flux-framework#6414
Problem:
flux_kvs_lookup(FLUX_KVS_WATCH | FLUX_KVS_WATCH_APPEND)
is implemented inefficiently. Although it returns only new data that has been appended to a watched key, thekvs-watch
module internally fetches the entire value each time the key is mentioned in akvs.setroot
event, before returning only the data after the watcher's current offset.This could
almost triviallybe made more efficient by asking the KVS for the RFC 11 valref object, which contains an array of blobrefs, and then fetching only the new blobrefs from the content store.(From brainstorming discussion with @chu11)
The above description probably needs to be checked against the code before we get too excited about implementing this. However, given that we watch eventlogs a lot in Flux, this could be a high value improvement.
The text was updated successfully, but these errors were encountered: