Skip to content

Commit 5fb6547

Browse files
authored
Serving and consuming an HTTP multipart/mixed response in Python (#33)
* http/python: Rewrite README section about chunking * get_multipart/python: Add server.py and simple_client.py * get_multipart/python: Explain what urlsafe characters are * get_multipart/python: Add two new READMEs * get_multipart/python: Move module-level docs to README * fixup! get_multipart/python: Add two new READMEs * Add a general boundary generation algorithm recommendation * Always specify policy * Use the right md syntax for footnotes * Change note to warning * Fix positioning of footnote links * fixup! Fix positioning of footnote links
1 parent c752b5c commit 5fb6547

File tree

6 files changed

+625
-2
lines changed

6 files changed

+625
-2
lines changed

http/get_multipart/README.md

+41-1
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,44 @@
1919

2020
# HTTP GET Arrow Data: Multipart Examples
2121

22-
This directory contains examples of HTTP servers/clients that send/receive a multipart response (`Content-Type: multipart/mixed`) containing JSON data (`Content-Type: application/json`) and Arrow IPC stream data (`Content-Type: application/vnd.apache.arrow.stream`).
22+
This directory contains examples of HTTP servers/clients that send/receive a multipart response (`Content-Type: multipart/mixed`) containing JSON data (`Content-Type: application/json`), an Arrow IPC stream data (`Content-Type: application/vnd.apache.arrow.stream`), and (optionally) plain text data (`Content-Type: text/plain`).
23+
24+
## Picking a Boundary
25+
26+
The `multipart/mixed` response format uses a boundary string to separate the
27+
parts. This string **must not appear in the content of any part** according
28+
to RFC 1341.[^1]
29+
30+
We **do not recommend** checking for the boundary string in the content of the
31+
parts as that would prevent streaming them. Which would add up to the memory
32+
usage of the server and waste CPU time.
33+
34+
### Recommended Algorithm
35+
36+
For every `multipart/mixed` response produced by the server:
37+
1. Using a CSPRNG,[^2] generate a byte string of enough entropy to make the
38+
probability of collision[^3] negligible (at least 160 bits = 20 bytes).[^4]
39+
2. Encode the byte string in a way that is safe to use in HTTP headers. We
40+
recommend using `base64url` encoding described in RFC 4648.[^5]
41+
42+
`base64url` encoding is a variant of `base64` encoding that uses `-` and `_`
43+
instead of `+` and `/` respectively. It also omits padding characters (`=`).
44+
45+
This algorithm can be implemented in Python using the `secret.token_urlsafe()`
46+
function.
47+
48+
If you generate a boundary string with generous 224 bits of entropy
49+
(i.e. 28 bytes), the base64url encoding will produce a 38-character
50+
string which is well below the limit defined by RFC 1341 (70 characters).
51+
52+
>>> import secrets
53+
>>> boundary = secrets.token_urlsafe(28)
54+
>>> len(boundary)
55+
38
56+
57+
58+
[^1]: [RFC 1341 - Section 7.2 The Multipart Content-Type](https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html)
59+
[^2]: [Cryptographically Secure Pseudo-Random Number Generator](https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator)
60+
[^3]: [Birthday Problem](https://en.wikipedia.org/wiki/Birthday_problem)
61+
[^4]: [Hash Collision Probabilities](https://preshing.com/20110504/hash-collision-probabilities/)
62+
[^5]: [RFC 4648 - Section 5 Base 64 Encoding with URL and Filename Safe Alphabet](https://tools.ietf.org/html/rfc4648#section-5)
+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# HTTP GET Arrow Data in multipart/mixed: Python Client Example
21+
22+
This directory contains an example of a Python HTTP client that receives a
23+
`multipart/mixed` response from the server. The client:
24+
1. Sends an HTTP GET request to a server.
25+
2. Receives an HTTP 200 response from the server, with the response body
26+
containing a `multipart/mixed` response.
27+
3. Parses the `multipart/mixed` response using the `email` module.[^1]
28+
4. Extracts the JSON part, parses it and prints a preview of the JSON data.
29+
5. Extracts the Arrow stream part, reads the Arrow stream, and sums the
30+
total number of records in the entire Arrow stream.
31+
6. Extracts the plain text part and prints it as it is.
32+
33+
To run this example, first start one of the server examples in the parent
34+
directory, then:
35+
36+
```sh
37+
pip install pyarrow
38+
python simple_client.py
39+
```
40+
41+
> [!WARNING]
42+
> This `simple_client.py` parses the multipart response using the multipart
43+
> message parser from the Python `email` module. This module puts the entire
44+
> message in memory and seems to spend a lot of time looking for part delimiter
45+
> and encoding/decoding the parts.
46+
>
47+
> The overhead of `multipart/mixed` parsing is 85% on my machine and after the
48+
> ~1GB Arrow Stream message is fully in memory, it takes only 0.06% of the total
49+
> execution time to parse it.
50+
51+
[^1]: The `multipart/mixed` standard, used by HTTP, is derived from the MIME
52+
standard used in email.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
from email import policy
19+
import email
20+
import json
21+
import pyarrow as pa
22+
import sys
23+
import time
24+
import urllib.request
25+
26+
JSON_FORMAT = "application/json"
27+
TEXT_FORMAT = "text/plain"
28+
ARROW_STREAM_FORMAT = "application/vnd.apache.arrow.stream"
29+
30+
start_time = time.time()
31+
response_parsing_time = 0 # time to parse the multipart message
32+
arrow_stream_parsing_time = 0 # time to parse the Arrow stream
33+
34+
35+
def parse_multipart_message(response, boundary, buffer_size=8192):
36+
"""
37+
Parse a multipart/mixed HTTP response into a list of Message objects.
38+
39+
Returns
40+
-------
41+
list of email.message.Message containing the parts of the multipart message.
42+
"""
43+
global response_parsing_time
44+
buffer_size = max(buffer_size, 8192)
45+
buffer = bytearray(buffer_size)
46+
47+
header = f'MIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary="{boundary}"\r\n\r\n'
48+
feedparser = email.parser.BytesFeedParser(policy=policy.default)
49+
feedparser.feed(header.encode("utf-8"))
50+
while bytes_read := response.readinto(buffer):
51+
start_time = time.time()
52+
feedparser.feed(buffer[0:bytes_read])
53+
response_parsing_time += time.time() - start_time
54+
start_time = time.time()
55+
message = feedparser.close()
56+
response_parsing_time += time.time() - start_time
57+
assert message.is_multipart()
58+
return message.get_payload()
59+
60+
61+
def process_json_part(message):
62+
assert message.get_content_type() == JSON_FORMAT
63+
payload = part.get_payload()
64+
print(f"-- {len(payload)} bytes of JSON data:")
65+
try:
66+
PREVIW_SIZE = 5
67+
data = json.loads(payload)
68+
print("[")
69+
for i in range(min(PREVIW_SIZE, len(data))):
70+
print(f" {data[i]}")
71+
if len(data) > PREVIW_SIZE:
72+
print(f" ...+{len(data) - PREVIW_SIZE} entries...")
73+
print("]")
74+
except json.JSONDecodeError as e:
75+
print(f"Error parsing JSON data: {e}\n", file=sys.stderr)
76+
return data
77+
78+
79+
def process_arrow_stream_message(message):
80+
global arrow_stream_parsing_time
81+
assert message.get_content_type() == ARROW_STREAM_FORMAT
82+
payload = part.get_payload(decode=True)
83+
print(f"-- {len(payload)} bytes of Arrow data:")
84+
num_batches = 0
85+
num_records = 0
86+
start_time = time.time()
87+
with pa.ipc.open_stream(payload) as reader:
88+
schema = reader.schema
89+
print(f"Schema: \n{schema}\n")
90+
try:
91+
while True:
92+
batch = reader.read_next_batch()
93+
num_batches += 1
94+
num_records += batch.num_rows
95+
except StopIteration:
96+
pass
97+
arrow_stream_parsing_time = time.time() - start_time
98+
print(f"Parsed {num_records} records in {num_batches} batch(es)")
99+
100+
101+
def process_text_part(message):
102+
assert message.get_content_type() == TEXT_FORMAT
103+
payload = part.get_payload()
104+
print("-- Text Message:")
105+
print(payload, end="")
106+
print("-- End of Text Message --")
107+
108+
109+
response = urllib.request.urlopen("http://localhost:8008?include_footnotes")
110+
111+
content_type = response.headers.get_content_type()
112+
if content_type != "multipart/mixed":
113+
raise ValueError(f"Expected multipart/mixed Content-Type, got {content_type}")
114+
boundary = response.headers.get_boundary()
115+
if boundary is None or len(boundary) == 0:
116+
raise ValueError("No multipart boundary found in Content-Type header")
117+
118+
parts = parse_multipart_message(response, boundary, buffer_size=64 * 1024)
119+
batches = None
120+
for part in parts:
121+
content_type = part.get_content_type()
122+
if content_type == JSON_FORMAT:
123+
process_json_part(part)
124+
elif content_type == ARROW_STREAM_FORMAT:
125+
batches = process_arrow_stream_message(part)
126+
elif content_type == TEXT_FORMAT:
127+
process_text_part(part)
128+
129+
end_time = time.time()
130+
execution_time = end_time - start_time
131+
132+
rel_response_parsing_time = response_parsing_time / execution_time
133+
rel_arrow_stream_parsing_time = arrow_stream_parsing_time / execution_time
134+
print(f"{execution_time:.3f} seconds elapsed")
135+
print(
136+
f"""{response_parsing_time:.3f} seconds \
137+
({rel_response_parsing_time * 100:.2f}%) \
138+
seconds parsing multipart/mixed response"""
139+
)
140+
print(
141+
f"""{arrow_stream_parsing_time:.3f} seconds \
142+
({rel_arrow_stream_parsing_time * 100:.2f}%) \
143+
seconds parsing Arrow stream"""
144+
)
+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# HTTP GET Arrow Data in multipart/mixed: Python Server Example
21+
22+
This directory contains an example of a Python HTTP server that sends a
23+
`multipart/mixed` response to clients. The server:
24+
1. Creates a list of record batches and populates it with synthesized data.
25+
2. Listens for HTTP GET requests from clients.
26+
3. Upon receiving a request, builds and sends an HTTP 200 `multipart/mixed`
27+
response containing:
28+
- A JSON part with metadata about the Arrow stream.
29+
- An Arrow stream part with the Arrow IPC stream of record batches.
30+
- A plain text part with a message containing timing information. This part
31+
is optional (included if `?include_footnotes` is present in the URL).
32+
33+
To run this example:
34+
35+
```sh
36+
pip install pyarrow
37+
python server.py
38+
```
39+
40+
> [!NOTE]
41+
> This example uses Python's built-in
42+
> [`http.server`](https://docs.python.org/3/library/http.server.html) module.
43+
> This allows us to implement [chunked transfer
44+
> encoding](https://en.wikipedia.org/wiki/Chunked_transfer_encoding) manually.
45+
> Other servers may implement chunked transfer encoding automatically at the
46+
> cost of an undesirable new layer of buffering. Arrow IPC streams already offer
47+
> a natural way of chunking large amounts of tabular data. It's not a general
48+
> requirement, but in this example each chunk corresponds to one Arrow record
49+
> batch.

0 commit comments

Comments
 (0)