-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: _get_folder_size fn #471
base: main
Are you sure you want to change the base?
Conversation
src/litdata/streaming/reader.py
Outdated
size += os.stat(os.path.join(dirpath, filename)).st_size | ||
if filename.endswith(_SUPPORTED_EXTENSIONS): | ||
with contextlib.suppress(FileNotFoundError): | ||
size += os.stat(os.path.join(dirpath, filename)).st_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is not use os.stat which is expensive and not walk the folder.
Instead, we check only the cache folder, not its parents, list the file and use the config to estimate the size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, we check only the cache folder, not its parents, list the file and use the config to estimate the size
yes, while debugging to understand the sizes, I could see it was considering the other dirs(for eg: remote_dir
) in the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note
for more information, see https://pre-commit.ci
sorry that was just to make the pr, I was working on using config dict. |
@deependujha, It seems the failing tests are correct because, previously, the size calculation included the To address this, I think we could either reduce the
|
Before submitting
What does this PR do?
This PR addresses comment: #468 (comment)
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃