|
19 | 19 |
|
20 | 20 | # HTTP GET Arrow Data: Compression Examples
|
21 | 21 |
|
22 |
| -This directory contains examples of HTTP servers/clients that transmit/receive data in the Arrow IPC streaming format and use compression (in various ways) to reduce the size of the transmitted data. |
| 22 | +This directory contains examples of HTTP servers/clients that transmit/receive |
| 23 | +data in the Arrow IPC streaming format and use compression (in various ways) to |
| 24 | +reduce the size of the transmitted data. |
| 25 | + |
| 26 | +Since we re-use the [Arrow IPC format][ipc] for transferring Arrow data over |
| 27 | +HTTP and both Arrow IPC and HTTP standards support compression on their own, |
| 28 | +there are at least two approaches to this problem: |
| 29 | + |
| 30 | +1. Compressed HTTP responses carrying Arrow IPC streams with uncompressed |
| 31 | + array buffers. |
| 32 | +2. Uncompressed HTTP responses carrying Arrow IPC streams with compressed |
| 33 | + array buffers. |
| 34 | + |
| 35 | +Applying both IPC buffer and HTTP compression to the same data is not |
| 36 | +recommended. The extra CPU overhead of decompressing the data twice is |
| 37 | +not worth any possible gains that double compression might bring. If |
| 38 | +compression ratios are unambiguously more important than reducing CPU |
| 39 | +overhead, then a different compression algorithm that optimizes for that can |
| 40 | +be chosen. |
| 41 | + |
| 42 | +This table shows the support for different compression algorithms in HTTP and |
| 43 | +Arrow IPC: |
| 44 | + |
| 45 | +| Codec | Identifier | HTTP Support | IPC Support | |
| 46 | +|----------- | ----------- | ------------- | ------------ | |
| 47 | +| GZip | `gzip` | X | | |
| 48 | +| DEFLATE | `deflate` | X | | |
| 49 | +| Brotli | `br` | X[^2] | | |
| 50 | +| Zstandard | `zstd` | X[^2] | X[^3] | |
| 51 | +| LZ4 | `lz4` | | X[^3] | |
| 52 | + |
| 53 | +Since not all Arrow IPC implementations support compression, HTTP compression |
| 54 | +based on accepted formats negotiated with the client is a great way to increase |
| 55 | +the chances of efficient data transfer. |
| 56 | + |
| 57 | +Servers may check the `Accept-Encoding` header of the client and choose the |
| 58 | +compression format in this order of preference: `zstd`, `br`, `gzip`, |
| 59 | +`identity` (no compression). If the client does not specify a preference, the |
| 60 | +only constraint on the server is the availability of the compression algorithm |
| 61 | +in the server environment. |
| 62 | + |
| 63 | +## Arrow IPC Compression |
| 64 | + |
| 65 | +When IPC buffer compression is preferred and servers can't assume all clients |
| 66 | +support it[^4], clients may be asked to explicitly list the supported compression |
| 67 | +algorithms in the request headers. The `Accept` header can be used for this |
| 68 | +since `Accept-Encoding` (and `Content-Encoding`) is used to control compression |
| 69 | +of the entire HTTP response stream and instruct HTTP clients (like browsers) to |
| 70 | +decompress the response before giving data to the application or saving the |
| 71 | +data. |
| 72 | + |
| 73 | + Accept: application/vnd.apache.arrow.stream; codecs="zstd, lz4" |
| 74 | + |
| 75 | +This is similar to clients requesting video streams by specifying the |
| 76 | +container format and the codecs they support |
| 77 | +(e.g. `Accept: video/webm; codecs="vp8, vorbis"`). |
| 78 | + |
| 79 | +The server is allowed to choose any of the listed codecs, or not compress the |
| 80 | +IPC buffers at all. Uncompressed IPC buffers should always be acceptable by |
| 81 | +clients. |
| 82 | + |
| 83 | +If a server adopts this approach and a client does not specify any codecs in |
| 84 | +the `Accept` header, the server can fall back to checking `Accept-Encoding` |
| 85 | +header to pick a compression algorithm for the entire HTTP response stream. |
| 86 | + |
| 87 | +To make debugging easier servers may include the chosen compression codec(s) |
| 88 | +in the `Content-Type` header of the response (quotes are optional): |
| 89 | + |
| 90 | + Content-Type: application/vnd.apache.arrow.stream; codecs=zstd |
| 91 | + |
| 92 | +This is not necessary for correct decompression because the payload already |
| 93 | +contains information that tells the IPC reader how to decompress the buffers, |
| 94 | +but it can help developers understand what is going on. |
| 95 | + |
| 96 | +When programatically checking if the `Content-Type` header contains a specific |
| 97 | +format, it is important to use a parser that can handle parameters or look |
| 98 | +only at the media type part of the header. This is not an exclusivity of the |
| 99 | +Arrow IPC format, but a general rule for all media types. For example, |
| 100 | +`application/json; charset=utf-8` should match `application/json`. |
| 101 | + |
| 102 | +When considering use of IPC buffer compression, check the [IPC format section of |
| 103 | +the Arrow Implementation Status page][^5] to see whether the the Arrow |
| 104 | +implementations you are targeting support it. |
| 105 | + |
| 106 | +## HTTP/1.1 Response Compression |
| 107 | + |
| 108 | +HTTP/1.1 offers an elaborate way for clients to specify their preferred |
| 109 | +content encoding (read compression algorithm) using the `Accept-Encoding` |
| 110 | +header.[^1] |
| 111 | + |
| 112 | +At least the Python server (in [`python/`](./python)) implements a fully |
| 113 | +compliant parser for the `Accept-Encoding` header. Application servers may |
| 114 | +choose to implement a simpler check of the `Accept-Encoding` header or assume |
| 115 | +that the client accepts the chosen compression scheme when talking to that |
| 116 | +server. |
| 117 | + |
| 118 | +Here is an example of a header that a client may send and what it means: |
| 119 | + |
| 120 | + Accept-Encoding: zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0 |
| 121 | + |
| 122 | +This header says that the client prefers that the server compress the |
| 123 | +response with `zstd`, but if that is not possible, then `brotli` and `gzip` |
| 124 | +are acceptable (in that order because 0.8 is greater than 0.5). The client |
| 125 | +does not want the response to be uncompressed. This is communicated by |
| 126 | +`"identity"` being listed with `q=0`. |
| 127 | + |
| 128 | +To tell the server the client only accepts `zstd` responses and nothing |
| 129 | +else, not even uncompressed responses, the client would send: |
| 130 | + |
| 131 | + Accept-Encoding: zstd, *;q=0 |
| 132 | + |
| 133 | +RFC 2616[^1] specifies the rules for how a server should interpret the |
| 134 | +`Accept-Encoding` header: |
| 135 | + |
| 136 | + A server tests whether a content-coding is acceptable, according to |
| 137 | + an Accept-Encoding field, using these rules: |
| 138 | + |
| 139 | + 1. If the content-coding is one of the content-codings listed in |
| 140 | + the Accept-Encoding field, then it is acceptable, unless it is |
| 141 | + accompanied by a qvalue of 0. (As defined in section 3.9, a |
| 142 | + qvalue of 0 means "not acceptable.") |
| 143 | + |
| 144 | + 2. The special "*" symbol in an Accept-Encoding field matches any |
| 145 | + available content-coding not explicitly listed in the header |
| 146 | + field. |
| 147 | + |
| 148 | + 3. If multiple content-codings are acceptable, then the acceptable |
| 149 | + content-coding with the highest non-zero qvalue is preferred. |
| 150 | + |
| 151 | + 4. The "identity" content-coding is always acceptable, unless |
| 152 | + specifically refused because the Accept-Encoding field includes |
| 153 | + "identity;q=0", or because the field includes "*;q=0" and does |
| 154 | + not explicitly include the "identity" content-coding. If the |
| 155 | + Accept-Encoding field-value is empty, then only the "identity" |
| 156 | + encoding is acceptable. |
| 157 | + |
| 158 | +If you're targeting web browsers, check the compatibility table of [compression |
| 159 | +algorithms on MDN Web Docs][^2]. |
| 160 | + |
| 161 | +Another important rule is that if the server compresses the response, it |
| 162 | +must include a `Content-Encoding` header in the response. |
| 163 | + |
| 164 | + If the content-coding of an entity is not "identity", then the |
| 165 | + response MUST include a Content-Encoding entity-header (section |
| 166 | + 14.11) that lists the non-identity content-coding(s) used. |
| 167 | + |
| 168 | +Since not all servers implement the full `Accept-Encoding` header parsing logic, |
| 169 | +clients tend to stick to simple header values like `Accept-Encoding: identity` |
| 170 | +when no compression is desired, and `Accept-Encoding: gzip, deflate, zstd, br` |
| 171 | +when the client supports different compression formats and is indifferent to |
| 172 | +which one the server chooses. Clients should expect uncompressed responses as |
| 173 | +well in theses cases. The only way to force a "406 Not Acceptable" response when |
| 174 | +no compression is available is to send `identity;q=0` or `*;q=0` somewhere in |
| 175 | +the end of the `Accept-Encoding` header. But that relies on the server |
| 176 | +implementing the full `Accept-Encoding` handling logic. |
| 177 | + |
| 178 | + |
| 179 | +[^1]: [Fielding, R. et al. (1999). HTTP/1.1. RFC 2616, Section 14.3 Accept-Encoding.](https://www.rfc-editor.org/rfc/rfc2616#section-14.3) |
| 180 | +[^2]: [MDN Web Docs: Accept-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#browser_compatibility) |
| 181 | +[^3]: [Arrow Columnar Format: Compression](https://arrow.apache.org/docs/format/Columnar.html#compression) |
| 182 | +[^4]: Web applications using the JavaScript Arrow implementation don't have |
| 183 | + access to the compression APIs to decompress `zstd` and `lz4` IPC buffers. |
| 184 | +[^5]: [Arrow Implementation Status: IPC Format](https://arrow.apache.org/docs/status.html#ipc-format) |
| 185 | + |
| 186 | +[ipc]: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc |
0 commit comments