-
Notifications
You must be signed in to change notification settings - Fork 235
Scaling issue with tensor_db #1593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+52
−26
Closed
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
52327b5
initial
porteratzo b1971aa
Merge branch 'securefederatedai:develop' into scalling
porteratzo e2abab0
optimize cache tensor
porteratzo f6699d4
Merge branch 'scalling' of https://github.com/porteratzo/openfl into …
porteratzo e6d6fa9
merge
porteratzo 962b266
final fixes
porteratzo 66161a2
Merge branch 'develop' into scalling
payalcha de15b97
Merge branch 'develop' into scalling
porteratzo b42a86c
Merge branch 'develop' of https://github.com/securefederatedai/openfl…
porteratzo 4098788
Merge branch 'develop' into scalling
porteratzo d0b6781
Merge branch 'develop' into scalling
porteratzo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, this snippet removes collaborator metrics on the aggregator side (which are merely a couple of floats for each collaborator). I am not sure if this is related to scaling issues, or memory usage in general. Do you have any pre/post analysis in support of this proposal?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well as evidence I have profiled the get_cached_tensor with and without filtering these rows and got an ~80% speed up
As an example I have this calculation

So the collaborator tensor rows grows a lot with more collaborators, this number of rows remains constant through the experiment and though the metrics increase relatively little per round, given many reported metrics and many collaborators and rounds these can grow to be a significant part of the tensor_db.
Most calls to get_tensor_from_cache only really care about the aggregator tensor so removing all the collaborator model rows and metric rows does make a large improvement to query time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, two experiment suggestions:
If either of these two results are significant improvements, we can prioritize merging this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are most impactful with more collaborators and also depend on model size so the following experiment was run on a custom modified workspace from MNIST with 30 collaborators using resnet50 for 10 rounds.
Without changes
real 39m51.381s
With changes
real 34m12.750s
so around 13% reduction in time to completion.