Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For unseekable files -- build full index by reading the entire file instead #106

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

epicfaace
Copy link
Contributor

@epicfaace epicfaace commented Oct 13, 2022

This is because it appears that when I try to build an index by reading an unseekable file through the entire way (as opposed to build_full_index()), it gives me a ZRAN_READ error. Making this PR to see if this shows up in the tests as well.

@epicfaace
Copy link
Contributor Author

@pauldmccarthy do you know why this is failing? Ideally, even if I'm using an unseekable fileobj, I should be able to read through it once (but just once, since it's unseekable!) and construct the index.

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Jan 30, 2023

Hi @epicfaace, apologies for the delay; I've just had a quick poke through the code, and I'm afraid I've come to the conclusion that this would be impossible to support with the current code design. This is because each call to the zran_read function (called by IndexedGzipFile.read) performs a full cycle of:

  1. Initialising the underlying zlib structure for inflation from the beginning of the stream, or from the nearest seek point preceding the current location
  2. Reading/inflating data
  3. Resetting the state of the underlying zlib structure

So a subsequent call to zran_read is unable to simply resume from where the previous call finished. It must be able to seek backwards through the stream to the nearest seek point and start reading from there.

The zran_build_index function (called via IndexedGzipFile.build_full_index) is able to create an index from an unseekable stream because it performs a single pass through the data, i.e.:

  1. Initialising zlib for inflation from the beginning of the stream
  2. Reading/inflating the entire file
  3. Resetting zlib state

I think it would be possible to adjust the zran_read function so that the full index could be built from a single read, (i.e. f.read() rather than while f.read(1024): pass) because then the zran_read flow would match that of zran_expand_index. However, this won't help you if you are also hoping to simultaneously stream the data while building the index.

So I'm afraid that this would not be possible without a major refactor/redesign of the contents of the zran.c module - the zran_read function would have to be redesigned so that it could perform a partial read without resetting zlib state.

I really don't have time at the moment for this, but may do in the future (as I am increasingly of the opinion that that it could be dramatically simplified).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants