Skip to content

GetObject response body hangs forever when the connection drops mid-transfer (no body-read timeout; ChecksumStream swallows the abort) #8098

@strongpauly

Description

@strongpauly

Checkboxes for prior research

Describe the bug

GetObjectCommand can hang forever while reading the response body when the underlying TCP connection drops or stalls mid-transfer, with no error and no timeout to break the deadlock.

The failure is invisible to every safety mechanism the SDK offers:

  • client.send() resolves successfully as soon as the response headers arrive (HTTP 200). From the SDK's point of view the operation has already succeeded, so the failure happens entirely in the body-streaming phase that follows.
  • The returned Body — a ChecksumStream when the object carries a checksum, which is the default for newly-uploaded objects — never emits end or error when the underlying socket is destroyed or goes silent mid-body. The teardown is swallowed by the wrapper, so the consumer waits forever.
  • This affects every way of reading the body: the SDK's own Body.transformToString(), Node's stream/consumers (text()/json()), and manual data/error/end collectors all hang identically.
  • Built-in retries don't help. Because send() already succeeded, the retry strategy is complete by the time the body stalls — with maxAttempts: 5 the server receives exactly one request and the read still hangs.
  • requestTimeout doesn't help either — it only bounds the request up to the response headers, not the body transfer.

The net effect is a silent, unrecoverable deadlock: a process that has received a partial object body sits idle indefinitely (we observed 7+ hours, 0% CPU, no open socket, no exception) with no way to detect or recover without killing it. Any caller streaming object bodies over a connection that can be dropped mid-flight (NAT idle eviction, load-balancer reset, transient network blip, half-open peer) is exposed.

Regression Issue

  • Select this option if this issue appears to be a regression.

SDK version number

@aws-sdk/client-s3: 3.1038.0, @smithy/node-http-handler 4.7.8

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v24.12.0

Reproduction Steps

The bug is a stalled/dropped TCP connection after the response headers are received but before the body finishes. The script below reproduces it deterministically with a local mock endpoint — no AWS credentials or real bucket required.

npm install @aws-sdk/client-s3
node repro.mjs
// repro.mjs
import http from 'node:http';
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';

// Mock S3 endpoint: returns 200 + a checksum header (so the SDK wraps Body in a
// validating ChecksumStream), sends 44 KB of a 500 KB-declared body, then drops
// the socket mid-transfer — i.e. a connection killed in flight (NAT eviction,
// LB idle reset, transient network blip).
const server = http.createServer((req, res) => {
    res.writeHead(200, {
        'Content-Length': '500000',
        'x-amz-checksum-crc32': 'AAAAAA==',
        'x-amz-checksum-type': 'FULL_OBJECT'
    });
    res.write('{"data":[' + 'x'.repeat(44000));
    setTimeout(() => req.socket.destroy(), 300); // connection drops mid-body
});
await new Promise(r => server.listen(0, '127.0.0.1', r));
const { port } = server.address();

const s3 = new S3Client({
    region: 'us-east-1',
    endpoint: `http://127.0.0.1:${port}`,
    forcePathStyle: true,
    credentials: { accessKeyId: 'x', secretAccessKey: 'y' },
    maxAttempts: 5 // retries do not help — see below
});

let outcome = '>>> HANG (never settled) <<<';
const work = (async () => {
    const { Body } = await s3.send(new GetObjectCommand({ Bucket: 'b', Key: 'k', ChecksumMode: 'ENABLED' }));
    console.log('send() resolved; Body =', Body.constructor.name);
    await Body.transformToString(); // <-- never settles
    outcome = 'resolved';
})().catch(e => (outcome = 'rejected: ' + (e?.name || e?.message)));

await Promise.race([work, new Promise(r => setTimeout(r, 5000))]);
console.log('outcome after 5s:', outcome);
server.close();
process.exit(0);

Output:

send() resolved; Body = ChecksumStream
outcome after 5s: >>> HANG (never settled) <<<

Variations that all reproduce the same hang:

  • Replace req.socket.destroy() with never finishing the body (a silent stall — server stops sending, no FIN/RST). Hangs identically.
  • Consume the body with node:stream/consumers text()/json(), or with a manual data/error/end collector, instead of transformToString(). All hang.
  • Remove the checksum headers (raw IncomingMessage body): the destroy() case then surfaces as an error and rejects, but the silent stall case still hangs (no body-read timeout).

Observed Behavior

  • client.send() resolves successfully as soon as the response headers arrive (HTTP 200). From the SDK's perspective the operation has already succeeded.
  • The returned Body (a ChecksumStream when the object has a checksum, which is the default for new objects) then never emits end or error when the underlying connection is destroyed/stalls mid-body.
  • The body-consuming promise (transformToString(), node:stream/consumers, or any data/end collector) is therefore orphaned and hangs forever.
  • Built-in retries do not fire: with maxAttempts: 5 the mock server receives exactly one request — because send() already succeeded, the retry strategy is done; the failure is in the body phase, which it doesn't cover.
  • requestTimeout on NodeHttpHandler does not help either — it only bounds the request up to the response headers, not the body transfer.

In our deployed system this manifested as a job that consumed a partial S3 object then hung indefinitely (7+ hours, 0% CPU) with no error, no socket, and no way to recover without killing the process.

Expected Behavior

A GetObject body read should not be able to hang forever on a dropped/stalled connection. Concretely, at least one of:

  1. A configurable response/body read (socket inactivity) timeout that applies to the whole operation including streaming the body — so a stalled body eventually rejects.
  2. The body stream (including the ChecksumStream wrapper) should propagate the underlying socket teardown as an error/premature-close on the stream the consumer is reading, so transformToString() / for await / data+end collectors reject instead of hanging.

Either would let callers catch the failure and retry, instead of silently deadlocking.

Possible Solution

  • Attach a socket inactivity timeout (socket.setTimeout) that remains armed through the body-streaming phase, not just until response headers, and destroy + error the body stream when it fires (configurable via the existing requestTimeout, or a new bodyTimeout/socketTimeout option).
  • Ensure @smithy/util-stream's ChecksumStream (and any other body wrappers) forward error/aborted/close-before-end from their source IncomingMessage to consumers, so a teardown is never swallowed.
  • Until then, document clearly that consumers must impose their own body-read timeout, since maxAttempts/requestTimeout do not cover this.

For reference, the workaround we shipped is a per-read idle watchdog that destroy()s the body stream if no bytes arrive within a timeout (resetting on each chunk), which makes the standard consumer reject; we then retry the whole GetObject. It works, but every SDK user reading object bodies needs this and most won't know to.

Additional Information/Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.needs-triageThis issue or PR still needs to be triaged.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions