Skip to content

Commit 8cd6cce

Browse files
authored
http: Compressed response example in Python (#35)
* http: Compressed response example in Python * complete the chunked response loop * more strict list of available compressors * simplify config * better names * turns out I can use for..in in this loop as well * fix indent * don't pick gzip as default when it's not in AVAILABLE_CODINGS * suggest default filename * fix brotli file extension * expand README with note about simpler Accept-Encoding headers * Add client.py * reduce buffering and reduce latency * expedite the yielding of the first buffer * expand README * remove test code * add an option to use dictionary-encoded string column * readme: add note about IPC compression codec negotiation * remove BUFFER_ENTIRE_RESPONSE option * write a parser based on a tokenizer * make parser generic to Accept and Accept-Encoding * support IPC buffer compression based on Accept header * return codec in header * extend client.py cases * Update paragraph about double-compression * Fix typo in README * Add note about meaning and interpretation of Content-Type * fix typo * Apply suggestions from code review * README.md: Break long lines * Move make_requests.sh to curl/client.sh * Add README files to sub directories * Improve python/server/README.md * Improve python/client/README.md * Improve python/client/README.md
1 parent e006c3c commit 8cd6cce

File tree

8 files changed

+1016
-2
lines changed

8 files changed

+1016
-2
lines changed

http/get_compressed/README.md

+165-1
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,168 @@
1919

2020
# HTTP GET Arrow Data: Compression Examples
2121

22-
This directory contains examples of HTTP servers/clients that transmit/receive data in the Arrow IPC streaming format and use compression (in various ways) to reduce the size of the transmitted data.
22+
This directory contains examples of HTTP servers/clients that transmit/receive
23+
data in the Arrow IPC streaming format and use compression (in various ways) to
24+
reduce the size of the transmitted data.
25+
26+
Since we re-use the [Arrow IPC format][ipc] for transferring Arrow data over
27+
HTTP and both Arrow IPC and HTTP standards support compression on their own,
28+
there are at least two approaches to this problem:
29+
30+
1. Compressed HTTP responses carrying Arrow IPC streams with uncompressed
31+
array buffers.
32+
2. Uncompressed HTTP responses carrying Arrow IPC streams with compressed
33+
array buffers.
34+
35+
Applying both IPC buffer and HTTP compression to the same data is not
36+
recommended. The extra CPU overhead of decompressing the data twice is
37+
not worth any possible gains that double compression might bring. If
38+
compression ratios are unambiguously more important than reducing CPU
39+
overhead, then a different compression algorithm that optimizes for that can
40+
be chosen.
41+
42+
This table shows the support for different compression algorithms in HTTP and
43+
Arrow IPC:
44+
45+
| Codec | Identifier | HTTP Support | IPC Support |
46+
|----------- | ----------- | ------------- | ------------ |
47+
| GZip | `gzip` | X | |
48+
| DEFLATE | `deflate` | X | |
49+
| Brotli | `br` | X[^2] | |
50+
| Zstandard | `zstd` | X[^2] | X[^3] |
51+
| LZ4 | `lz4` | | X[^3] |
52+
53+
Since not all Arrow IPC implementations support compression, HTTP compression
54+
based on accepted formats negotiated with the client is a great way to increase
55+
the chances of efficient data transfer.
56+
57+
Servers may check the `Accept-Encoding` header of the client and choose the
58+
compression format in this order of preference: `zstd`, `br`, `gzip`,
59+
`identity` (no compression). If the client does not specify a preference, the
60+
only constraint on the server is the availability of the compression algorithm
61+
in the server environment.
62+
63+
## Arrow IPC Compression
64+
65+
When IPC buffer compression is preferred and servers can't assume all clients
66+
support it[^4], clients may be asked to explicitly list the supported compression
67+
algorithms in the request headers. The `Accept` header can be used for this
68+
since `Accept-Encoding` (and `Content-Encoding`) is used to control compression
69+
of the entire HTTP response stream and instruct HTTP clients (like browsers) to
70+
decompress the response before giving data to the application or saving the
71+
data.
72+
73+
Accept: application/vnd.apache.arrow.stream; codecs="zstd, lz4"
74+
75+
This is similar to clients requesting video streams by specifying the
76+
container format and the codecs they support
77+
(e.g. `Accept: video/webm; codecs="vp8, vorbis"`).
78+
79+
The server is allowed to choose any of the listed codecs, or not compress the
80+
IPC buffers at all. Uncompressed IPC buffers should always be acceptable by
81+
clients.
82+
83+
If a server adopts this approach and a client does not specify any codecs in
84+
the `Accept` header, the server can fall back to checking `Accept-Encoding`
85+
header to pick a compression algorithm for the entire HTTP response stream.
86+
87+
To make debugging easier servers may include the chosen compression codec(s)
88+
in the `Content-Type` header of the response (quotes are optional):
89+
90+
Content-Type: application/vnd.apache.arrow.stream; codecs=zstd
91+
92+
This is not necessary for correct decompression because the payload already
93+
contains information that tells the IPC reader how to decompress the buffers,
94+
but it can help developers understand what is going on.
95+
96+
When programatically checking if the `Content-Type` header contains a specific
97+
format, it is important to use a parser that can handle parameters or look
98+
only at the media type part of the header. This is not an exclusivity of the
99+
Arrow IPC format, but a general rule for all media types. For example,
100+
`application/json; charset=utf-8` should match `application/json`.
101+
102+
When considering use of IPC buffer compression, check the [IPC format section of
103+
the Arrow Implementation Status page][^5] to see whether the the Arrow
104+
implementations you are targeting support it.
105+
106+
## HTTP/1.1 Response Compression
107+
108+
HTTP/1.1 offers an elaborate way for clients to specify their preferred
109+
content encoding (read compression algorithm) using the `Accept-Encoding`
110+
header.[^1]
111+
112+
At least the Python server (in [`python/`](./python)) implements a fully
113+
compliant parser for the `Accept-Encoding` header. Application servers may
114+
choose to implement a simpler check of the `Accept-Encoding` header or assume
115+
that the client accepts the chosen compression scheme when talking to that
116+
server.
117+
118+
Here is an example of a header that a client may send and what it means:
119+
120+
Accept-Encoding: zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0
121+
122+
This header says that the client prefers that the server compress the
123+
response with `zstd`, but if that is not possible, then `brotli` and `gzip`
124+
are acceptable (in that order because 0.8 is greater than 0.5). The client
125+
does not want the response to be uncompressed. This is communicated by
126+
`"identity"` being listed with `q=0`.
127+
128+
To tell the server the client only accepts `zstd` responses and nothing
129+
else, not even uncompressed responses, the client would send:
130+
131+
Accept-Encoding: zstd, *;q=0
132+
133+
RFC 2616[^1] specifies the rules for how a server should interpret the
134+
`Accept-Encoding` header:
135+
136+
A server tests whether a content-coding is acceptable, according to
137+
an Accept-Encoding field, using these rules:
138+
139+
1. If the content-coding is one of the content-codings listed in
140+
the Accept-Encoding field, then it is acceptable, unless it is
141+
accompanied by a qvalue of 0. (As defined in section 3.9, a
142+
qvalue of 0 means "not acceptable.")
143+
144+
2. The special "*" symbol in an Accept-Encoding field matches any
145+
available content-coding not explicitly listed in the header
146+
field.
147+
148+
3. If multiple content-codings are acceptable, then the acceptable
149+
content-coding with the highest non-zero qvalue is preferred.
150+
151+
4. The "identity" content-coding is always acceptable, unless
152+
specifically refused because the Accept-Encoding field includes
153+
"identity;q=0", or because the field includes "*;q=0" and does
154+
not explicitly include the "identity" content-coding. If the
155+
Accept-Encoding field-value is empty, then only the "identity"
156+
encoding is acceptable.
157+
158+
If you're targeting web browsers, check the compatibility table of [compression
159+
algorithms on MDN Web Docs][^2].
160+
161+
Another important rule is that if the server compresses the response, it
162+
must include a `Content-Encoding` header in the response.
163+
164+
If the content-coding of an entity is not "identity", then the
165+
response MUST include a Content-Encoding entity-header (section
166+
14.11) that lists the non-identity content-coding(s) used.
167+
168+
Since not all servers implement the full `Accept-Encoding` header parsing logic,
169+
clients tend to stick to simple header values like `Accept-Encoding: identity`
170+
when no compression is desired, and `Accept-Encoding: gzip, deflate, zstd, br`
171+
when the client supports different compression formats and is indifferent to
172+
which one the server chooses. Clients should expect uncompressed responses as
173+
well in theses cases. The only way to force a "406 Not Acceptable" response when
174+
no compression is available is to send `identity;q=0` or `*;q=0` somewhere in
175+
the end of the `Accept-Encoding` header. But that relies on the server
176+
implementing the full `Accept-Encoding` handling logic.
177+
178+
179+
[^1]: [Fielding, R. et al. (1999). HTTP/1.1. RFC 2616, Section 14.3 Accept-Encoding.](https://www.rfc-editor.org/rfc/rfc2616#section-14.3)
180+
[^2]: [MDN Web Docs: Accept-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#browser_compatibility)
181+
[^3]: [Arrow Columnar Format: Compression](https://arrow.apache.org/docs/format/Columnar.html#compression)
182+
[^4]: Web applications using the JavaScript Arrow implementation don't have
183+
access to the compression APIs to decompress `zstd` and `lz4` IPC buffers.
184+
[^5]: [Arrow Implementation Status: IPC Format](https://arrow.apache.org/docs/status.html#ipc-format)
185+
186+
[ipc]: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# HTTP GET Arrow Data: Compressed Arrow Data Examples
21+
22+
This directory contains a simple `curl` script that issues multiple HTTP GET
23+
requests to the server implemented in the parent directory, negotiating
24+
different compression algorithms for the Arrow IPC stream data piping the output
25+
to different files with extensions that indicate the compression algorithm used.
26+
27+
To run this example, first start one of the server examples in the parent
28+
directory, then run the `client.sh` script.
29+
30+
You can check all the sizes with a simple command:
31+
32+
```bash
33+
$ du -sh out* | sort -gr
34+
816M out.arrows
35+
804M out_from_chunked.arrows
36+
418M out_from_chunked.arrows+lz4
37+
405M out.arrows+lz4
38+
257M out.arrows.gz
39+
256M out_from_chunked.arrows.gz
40+
229M out_from_chunked.arrows+zstd
41+
229M out.arrows+zstd
42+
220M out.arrows.zstd
43+
219M out_from_chunked.arrows.zstd
44+
39M out_from_chunked.arrows.br
45+
38M out.arrows.br
46+
```
47+
48+
> [!WARNING]
49+
> Better compression is not the only relevant metric as it might come with a
50+
> trade-off in terms of CPU usage. The best compression algorithm for your use
51+
> case will depend on your specific requirements.
52+
53+
## Meaning of the file extensions
54+
55+
Files produced by HTTP/1.0 requests are not chunked, they get buffered in memory
56+
at the server before being sent to the client. If compressed, they end up
57+
slightly smaller than the results of chunked responses, but the extra delay for
58+
first byte is not worth it in most cases.
59+
60+
- `out.arrows` (Uncompressed)
61+
- `out.arrows.gz` (Gzip HTTP compression)
62+
- `out.arrows.zstd` (Zstandard HTTP compression)
63+
- `out.arrows.br` (Brotli HTTP compression)
64+
65+
- `out.arrows+zstd` (Zstandard IPC compression)
66+
- `out.arrows+lz4` (LZ4 IPC compression)
67+
68+
HTTP/1.1 requests are returned by the server with `Transfer-Encoding: chunked`
69+
to send the data in smaller chunks that are sent to the socket as soon as they
70+
are ready. This is useful for large responses that take a long time to generate
71+
at the cost of a small overhead caused by the independent compression of each
72+
chunk.
73+
74+
- `out_from_chunked.arrows` (Uncompressed)
75+
- `out_from_chunked.arrows.gz` (Gzip HTTP compression)
76+
- `out_from_chunked.arrows.zstd` (Zstandard HTTP compression)
77+
- `out_from_chunked.arrows.br` (Brotli HTTP compression)
78+
79+
- `out_from_chunked.arrows+lz4` (LZ4 IPC compression)
80+
- `out_from_chunked.arrows+zstd` (Zstandard IPC compression)
+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/bin/sh
2+
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
20+
CURL="curl --verbose"
21+
URI="http://localhost:8008"
22+
OUT_HTTP1=out.arrows
23+
OUT_CHUNKED=out_from_chunked.arrows
24+
25+
# HTTP/1.0 means that response is not chunked and not compressed...
26+
$CURL --http1.0 -o $OUT_HTTP1 $URI
27+
# ...but it may be compressed with an explicitly set Accept-Encoding
28+
# header
29+
$CURL --http1.0 -H "Accept-Encoding: gzip, *;q=0" -o $OUT_HTTP1.gz $URI
30+
$CURL --http1.0 -H "Accept-Encoding: zstd, *;q=0" -o $OUT_HTTP1.zstd $URI
31+
$CURL --http1.0 -H "Accept-Encoding: br, *;q=0" -o $OUT_HTTP1.br $URI
32+
# ...or with IPC buffer compression if the Accept header specifies codecs.
33+
$CURL --http1.0 -H "Accept: application/vnd.apache.arrow.stream; codecs=\"zstd, lz4\"" -o $OUT_HTTP1+zstd $URI
34+
$CURL --http1.0 -H "Accept: application/vnd.apache.arrow.stream; codecs=lz4" -o $OUT_HTTP1+lz4 $URI
35+
36+
# HTTP/1.1 means compression is on by default...
37+
# ...but it can be refused with the Accept-Encoding: identity header.
38+
$CURL -H "Accept-Encoding: identity" -o $OUT_CHUNKED $URI
39+
# ...with gzip if no Accept-Encoding header is set.
40+
$CURL -o $OUT_CHUNKED.gz $URI
41+
# ...or with the compression algorithm specified in the Accept-Encoding.
42+
$CURL -H "Accept-Encoding: zstd, *;q=0" -o $OUT_CHUNKED.zstd $URI
43+
$CURL -H "Accept-Encoding: br, *;q=0" -o $OUT_CHUNKED.br $URI
44+
# ...or with IPC buffer compression if the Accept header specifies codecs.
45+
$CURL -H "Accept: application/vnd.apache.arrow.stream; codecs=\"zstd, lz4\"" -o $OUT_CHUNKED+zstd $URI
46+
$CURL -H "Accept: application/vnd.apache.arrow.stream; codecs=lz4" -o $OUT_CHUNKED+lz4 $URI
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# HTTP GET Arrow Data: Compressed Arrow Data Examples
21+
22+
This directory contains an HTTP client implemented in Python that issues multiple
23+
requests to one of the server examples implemented in the parent directory,
24+
negotiating different compression algorithms for the Arrow IPC stream data.
25+
26+
To run this example, first start one of the compressed server examples in the
27+
parent directory, then:
28+
29+
```sh
30+
pip install pyarrow
31+
python client.py
32+
```

0 commit comments

Comments
 (0)