Skip to content

Split large range requests to avoid origin limits on overly large range requests on e.g. deep-coverage BAM#5576

Open
cmdcolin wants to merge 1 commit into
mainfrom
fix_requests
Open

Split large range requests to avoid origin limits on overly large range requests on e.g. deep-coverage BAM#5576
cmdcolin wants to merge 1 commit into
mainfrom
fix_requests

Conversation

@cmdcolin
Copy link
Copy Markdown
Collaborator

A contiguous run of uncached chunks was coalesced into one HTTP range request. Over deep-coverage BAM loci this produced ~100MB single requests that some origins (e.g. Google Cloud Storage) reject with HTTP 500. Cap each request at MAX_FETCH_CHUNKS (8MB); the same bytes are fetched, now in parallel via the existing concurrency limiter.

Notably, this is not because of 'overmerging chunks' in bam-js, it is because there are actually just very large 'chunks' in the bam file itself

claude code reasoning when asked whether this should be added in bam-js itself (which does not have the 'transport layer' [the fetching], it just says what chunks to fetch and sometimes merges chunks to produce less requests, but this is not related to over merging)

 I checked, and the answer is no, it shouldn't move into the chunk-merging code, but
  the reasoning is worth laying out because it confirms the fix is in the right place.

  What the merge code actually does

  The merging logic in bam-js (optimizeChunks in src/util.ts — what you're calling canMergeBlocks)
  already has a size cap, and has since 2019 (v1.0.29):

  if (
    chunkMinBlock - lastMaxBlock < 65000 &&      // adjacent chunks, small gap
    chunkMaxBlock - lastMinBlock < 5000000       // combined span stays under 5MB
  ) { /* merge */ }

  So this code only ever combines small adjacent chunks, and it refuses to combine them past 5 MB. It
  is structurally incapable of producing a 100 MB request. The published 7.2.4 that current JBrowse
  ships has this cap.

  So where does Mike's 104 MB request come from?

  Logically it cannot be a merge result — every merge step checks the span stays under 5 MB, so any
  merged chunk is < 5 MB by construction. That leaves only one possibility: it's a single, un-merged
  chunk that is natively ~104 MB.

  That's exactly what deep targeted sequencing produces. The index (BAI) groups reads into bins; the
  smallest bin covers ~16 kb. If 5M reads pile into one 16 kb window, they're stored back-to-back and
  the index records that as one chunk spanning ~100 MB. optimizeChunks receives it already that big —
  there's nothing to merge, so the merge cap never even looks at it. It hands that one fat chunk
  straight through, and the code reads its whole byte span in a single request.

  Why splitting belongs at the transport layer, not in the merge code

  To fix it inside the merge code you'd have to add splitting logic — chop one oversized chunk into
  several smaller reads and then reassemble the bytes before decompressing. But that's not a "which
  chunks go together" decision (what the merge code is for); it's a "how do I physically fetch this
  byte range" decision. Reassembling pieces of a byte range is precisely what the download/cache layer
  already does.

  That's the clean separation the fix respects:
  - bam-js decides what bytes are needed (the logical chunks) — unchanged.
  - The download layer decides how to fetch them (in pieces if large) — this is where my change lives.

  bam-js asks for "bytes 523M–628M" exactly as before; the download layer quietly fetches that in 32
  MB pieces and hands back the complete result. bam-js decompresses it identically, completely unaware
  anything was split. No new reassembly logic in bam-js, no risk of splitting a compressed block in
  the wrong spot.

  Bonus: it's more general where it is

  A bam-js-only fix would only help BAM. The same "one giant native chunk" situation exists for CRAM,
  tabix-indexed VCF/GFF/BED, and BigWig — they all funnel through this same download layer, so they
  all get the protection for free.

  Bottom line

  The merge code is the right place for merging decisions and already caps itself sensibly. Mike's
  oversized request isn't a merging problem — it's a single naturally-huge chunk, and breaking a large
  byte fetch into smaller ones is a transport job. The fix is correctly placed. The only thing I'd
  add is that it's worth confirming Mike upgrades past whatever shipped in 3.2.0, but since this is a
  single-chunk issue rather than a merge issue, the download-layer cap is what actually closes it.


…M loci

A contiguous run of uncached chunks was coalesced into one HTTP range
request. Over deep-coverage BAM loci this produced ~100MB single requests
that some origins (e.g. Google Cloud Storage) reject with HTTP 500. Cap
each request at MAX_FETCH_CHUNKS (8MB); the same bytes are fetched, now
in parallel via the existing concurrency limiter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant