For unseekable files -- build full index by reading the entire file instead #106

epicfaace · 2022-10-13T18:52:46Z

This is because it appears that when I try to build an index by reading an unseekable file through the entire way (as opposed to build_full_index()), it gives me a ZRAN_READ error. Making this PR to see if this shows up in the tests as well.

…nstead

epicfaace · 2022-10-13T19:08:20Z

@pauldmccarthy do you know why this is failing? Ideally, even if I'm using an unseekable fileobj, I should be able to read through it once (but just once, since it's unseekable!) and construct the index.

pauldmccarthy · 2023-01-30T13:48:35Z

Hi @epicfaace, apologies for the delay; I've just had a quick poke through the code, and I'm afraid I've come to the conclusion that this would be impossible to support with the current code design. This is because each call to the zran_read function (called by IndexedGzipFile.read) performs a full cycle of:

Initialising the underlying zlib structure for inflation from the beginning of the stream, or from the nearest seek point preceding the current location
Reading/inflating data
Resetting the state of the underlying zlib structure

So a subsequent call to zran_read is unable to simply resume from where the previous call finished. It must be able to seek backwards through the stream to the nearest seek point and start reading from there.

The zran_build_index function (called via IndexedGzipFile.build_full_index) is able to create an index from an unseekable stream because it performs a single pass through the data, i.e.:

Initialising zlib for inflation from the beginning of the stream
Reading/inflating the entire file
Resetting zlib state

I think it would be possible to adjust the zran_read function so that the full index could be built from a single read, (i.e. f.read() rather than while f.read(1024): pass) because then the zran_read flow would match that of zran_expand_index. However, this won't help you if you are also hoping to simultaneously stream the data while building the index.

So I'm afraid that this would not be possible without a major refactor/redesign of the contents of the zran.c module - the zran_read function would have to be redesigned so that it could perform a partial read without resetting zlib state.

I really don't have time at the moment for this, but may do in the future (as I am increasingly of the opinion that that it could be dramatically simplified).

epicfaace added 3 commits October 13, 2022 14:51

For unseekable files -- build full index by reading the entire file i…

d42a0c9

…nstead

Update ctest_indexed_gzip.pyx

6eca36a

Update ctest_indexed_gzip.pyx

43c4988

wwwjn mentioned this pull request Oct 19, 2022

Bypass server upload large file is slow because of generating index file codalab/codalab-worksheets#4201

Open

epicfaace mentioned this pull request Oct 19, 2022

For unseekable files, can't build full index by reading the entire file #107

Open

pauldmccarthy closed this Jan 30, 2023

pauldmccarthy reopened this Jan 30, 2023

pauldmccarthy mentioned this pull request Jan 30, 2023

Using seek_points() to obtain valid decompression ranges. #114

Open

pauldmccarthy mentioned this pull request Mar 29, 2023

ENH: Create index on reads, not on seeks #74

Open

pauldmccarthy changed the base branch from master to main August 29, 2023 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

For unseekable files -- build full index by reading the entire file instead #106

For unseekable files -- build full index by reading the entire file instead #106

Uh oh!

epicfaace commented Oct 13, 2022 •

edited

Loading

Uh oh!

epicfaace commented Oct 13, 2022

Uh oh!

pauldmccarthy commented Jan 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

For unseekable files -- build full index by reading the entire file instead #106

Are you sure you want to change the base?

For unseekable files -- build full index by reading the entire file instead #106

Uh oh!

Conversation

epicfaace commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epicfaace commented Oct 13, 2022

Uh oh!

pauldmccarthy commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

epicfaace commented Oct 13, 2022 •

edited

Loading

pauldmccarthy commented Jan 30, 2023 •

edited

Loading