-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 200% untar progress issue by avoiding a second read of file #17708
base: develop2
Are you sure you want to change the base?
Fix 200% untar progress issue by avoiding a second read of file #17708
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Seems like performance-wise, both methods will end up calling extract_one in the tarfile for each member, so there would be no extra overhead there, we'll just need to address my comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please share real timings of unzipping some large files, with and without this change.
UPDATE: Sorry I didn't see the OP clearly, now it is clear, thanks!
Then @AbrilRBS your comment
Looks great! Seems like performance-wise, both methods will end up calling extract_one in the tarfile for each member, so there would be no extra overhead there, we'll just need to address my comment
Not fully clear, what do you mean?
if pattern: | ||
members = list(filter(lambda m: fnmatch(m.name, pattern), | ||
tarredgzippedFile.getmembers())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this case read a third time the file, potentially causing 300%?
If the getmembers()
is doing a file read IO, this seems the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop, getmembers
have a static variable / cache mechanism.
The maximum reads that tarfile could perform on a targz are two!
See code:
def getmembers(self):
"""Return the members of the archive as a list of TarInfo objects. The
list has the same order as the members in the archive.
"""
self._check()
if not self._loaded: # if we want to obtain a list of
self._load() # all members, we first have to
# scan the whole archive.
return self.members
I did not check the code, but if I understand correctly, the uncompress status is updated every second. I just tried this branch and indeed it works until 99%, but Qt is huge, then, each step is repeated, but it's fine IMO:
|
Yes... I think I can add a final log which explicitly marks the 100% just for the TOC
That is a very good point, the progress printer is managed by a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@perseoGI thank you for clarifying the feature. It looks to me now.
Changelog: Fix: Improve untar performance
Docs: omit
Calling
TarFile.getmembers()
forces a complete IO read of the tar file.The same happens when calling
TarFile.extractall(members=members)
for the first time.This PR aims to fix the reported issue of having untar progress exceeding 100% occasionated by calling
seek(0)
twice, one in the first read querying all files and a second for the actual uncompression.Details:
Perform the decompression in a single IO loop by iterating over compressed files without actually reading them, filtering and uncompressing when needed one at a time.
Benchmarks:
develop2 untar code
New changes
The results show a 25% impact on uncompression efficiency
Closes #17697
develop
branch, documenting this one.