Skip to content

Add test for file-like read() method sending fewer bytes than requested #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 22, 2025

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Apr 16, 2025

We have a while loop logic in AVIOFileLikeContext::read that handles cases where the read() method of the file-like object sends fewer bytes than requested:

while (totalNumRead < buf_size) {

Until now, this logic was untested, so this PR adds a test for this.


Side note

I am genuinely surprised that this actually works. I tried hard to build a test that would fail, but I couldn't. I guess it's a good thing. The bad thing is: I don't understand how this actually works. It looks like FFmpeg is being really smart and re-requests the bytes it didn't originally get, and there's the internal buffer logic going on as well. I played a bit with this snippet, trying to take the diff between a "good" reader and a "bad" reader.

The bad reader ends up reading fewer bytes in total, and it's impressive to see that things still work even if read() returns 0 bytes sometimes. And yet, results appear to be correct. So, I'm still puzzled, but I'll accept that this works until proven otherwise.

import torch
from torchcodec.decoders import VideoDecoder
from torchcodec.samplers import clips_at_random_indices
path = "/home/nicolashug/dev/torchcodec/test/resources/nasa_13013.mp4"

class FileLike:
    def __init__(self, file, read_less: bool):
        self._file = file
        self.total_bytes_read = 0
        self.read_less = read_less

    def read(self, size: int) -> bytes:
        if size == 10000:
            print(f"Requested {size}, returning nothing!")
            return b""

        new_size = size - 10000 if self.read_less else size

        self.total_bytes_read += new_size
        print(f"Requested {size}, returning {new_size}", flush=True)
        return self._file.read(new_size)

    def seek(self, offset: int, whence: int) -> bytes:
        print(f"Seeking to {offset = }", flush=True)
        return self._file.seek(offset, whence)


file_like = FileLike(open(path, mode="rb", buffering=0), read_less=False)
decoder_file_like = VideoDecoder(file_like)
torch.manual_seed(0)
clips_file_like = clips_at_random_indices(decoder=decoder_file_like, num_clips=100)
print(f"{file_like.total_bytes_read = }")

decoder = VideoDecoder(path)
torch.manual_seed(0)
clips = clips_at_random_indices(decoder=decoder, num_clips=100)

torch.testing.assert_close(clips_file_like.data, clips.data)

Logs with read_less=True:

Requested 65536, returning 55536
Requested 10000, returning nothing!
Seeking to offset = -1
Seeking to offset = 55536
Seeking to offset = 654812
Requested 65536, returning 55536
Requested 47367, returning 37367
Seeking to offset = -1
Seeking to offset = 672981
Seeking to offset = 48
Requested 65536, returning 55536
Requested 10000, returning nothing!
Seeking to offset = -1
Seeking to offset = 55584
Seeking to offset = -1
Seeking to offset = 55584
Seeking to offset = -1
Seeking to offset = 55584
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Seeking to offset = 7279
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10000, returning nothing!
Requested 65536, returning 55536
Requested 10730, returning 730
file_like.total_bytes_read = 1482033

logs with read_less=False

Requested 65536, returning 65536
Seeking to offset = -1
Seeking to offset = 65536
Seeking to offset = 654812
Requested 65536, returning 65536
Requested 47367, returning 47367
Seeking to offset = -1
Seeking to offset = 672981
Seeking to offset = 48
Requested 65536, returning 65536
Seeking to offset = -1
Seeking to offset = 65584
Seeking to offset = -1
Seeking to offset = 65584
Seeking to offset = -1
Seeking to offset = 65584
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Seeking to offset = 7279
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
Requested 65536, returning 65536
file_like.total_bytes_read = 1489159


@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 16, 2025
@scotts
Copy link
Contributor

scotts commented Apr 21, 2025

I can roughly make sense of this, because there are more total read requests when we reduce (or eliminate!) the bytes we read. Basically all low-level reading operations can return less data than requested; see, for example, the man page for C's standard library read: https://man7.org/linux/man-pages/man2/read.2.html. Some of the reasons why are explained in that man page. As a result anyone calling these operations needs to be robust to receiving less data than requested, and it looks like FFmpeg is doing the right thing.

@NicolasHug NicolasHug merged commit 8cf1daf into pytorch:main Apr 22, 2025
58 of 62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants