Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: _get_folder_size fn #471

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

deependujha
Copy link
Collaborator

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

This PR addresses comment: #468 (comment)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

size += os.stat(os.path.join(dirpath, filename)).st_size
if filename.endswith(_SUPPORTED_EXTENSIONS):
with contextlib.suppress(FileNotFoundError):
size += os.stat(os.path.join(dirpath, filename)).st_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is not use os.stat which is expensive and not walk the folder.

Instead, we check only the cache folder, not its parents, list the file and use the config to estimate the size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we check only the cache folder, not its parents, list the file and use the config to estimate the size

yes, while debugging to understand the sizes, I could see it was considering the other dirs(for eg: remote_dir) in the tests

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

@deependujha
Copy link
Collaborator Author

sorry that was just to make the pr, I was working on using config dict.

@bhimrazy
Copy link
Collaborator

bhimrazy commented Feb 19, 2025

@deependujha, It seems the failing tests are correct because, previously, the size calculation included the remote_dir as well, whereas now it only considers the cache_dir size.
Since the max_cache_size remains unchanged, this results in more chunks being stored in the cache directory.

To address this, I think we could either reduce the max_cache_size to around 1300(or less) or update the number of files we check.

Incase we donot consider the index.json file then it would be just 48-50 bytes. The size of index.json file itself seems to be around 1317 bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants