-
Notifications
You must be signed in to change notification settings - Fork 66
Added ability to use LF, not only CRLF delimiter for response Headers and Body #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ability to use LF, not only CRLF delimiter for response Headers and Body #115
Conversation
Benchmark resultswith fix
without fix (latest master 522b004)
|
If performance is an issue, maybe look for "\r" or "\n" first. If you see first "\r" use "\r\n" as the delimiter, otherwise use "\n". |
@bdraco , yes, |
After some performance investigations I changed the algorithm. For now the changeset looks file How
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- reword comment
- reworked maybe_extract_until_next
…imiter_regex) and slightly updated docs
Hello @pgjones Then I ran again the benchmark against the changes: With changes
Without changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed another potential issue, I sadly missed last time.
h11/_receivebuffer.py
Outdated
if self._data[self._start : self._start + 2] == b"\r\n": | ||
self._start += 2 | ||
start_chunk = self._data[self._start : self._start + 2] | ||
if start_chunk in [b"\r\n", b"\n"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think start_chunk
can only equal b"\n"
if self._data == b"\n"
(given the + 2 in 129). Maybe this needs to be,
if self._data[self._start : self._start + 2] == b"\r\n":
self._start += 2
return []
elif self._data[self._start : self._start + 1] == b"\n":
self._start += 1
return []
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I missed that, for example self._data
might start from b"\nla-la-la-la"
.
Incorporated your changes
- changed blank_line_delimiter_regex - changed maybe_extract_lines start processing
- added pytest param names to test_receivebuffer_for_invalid_delimiter
aa87022
to
489de4d
Compare
You can use the address and credentials i've given to you, firmware on that device is 5.20.5 |
@wonderiuy I meant getting his changes to my local filesystem :) |
It was really simple with GitHub desktop to clone this branch and try it out. Works well on both 5.51 and 5.75 as well as on newer firmwares. Good job @cdeler ! |
I can also confirm this is working locally ;) Looking forward for the PR to be approved. |
Last week Axis published a new firmware version (5.51.7.2) for my camera (M1034-W). Below you will find the release notes. Don't know if it has something to do with the issue (LF vs CRLF) in this thread but the first correction (C01) in the new firmware version is handling about "Corrected a newline character". I thought it would be wise to make a notice about it in this tread. Corrections in 5.51.7.2 since 5.51.7.1 ======================================= 5.51.7.2:C01 5.51.7.2:C02 5.51.7.2:C03 5.51.7.2:C04 |
Thanks! I don't know, but regardless there are other firmwares that don't get this update so it is still needed. |
Can also confirm that this PR solves the problems with older versions of Axis cameras in Homeassistant. |
Hey guys! Any progress on getting this merged? |
h11/_receivebuffer.py
Outdated
|
||
# Only search in buffer space that we've not already looked at. | ||
partial_buffer = self._data[self._multiple_lines_search :] | ||
match = blank_line_regex.search(partial_buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copying the buffer, use the pos
argument to search:
match = blank_line_regex.search(self._data, self._multiple_lines_search)
partial_buffer = self._data[search_start_index:] | ||
partial_idx = partial_buffer.find(b"\r\n") | ||
if partial_idx == -1: | ||
self._next_line_search = len(self._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In maybe_extract_next_line
, we store the raw buffer length in self._next_line_search
and then do the subtraction when we use it. In maybe_extract_lines
, we store the "pre-subtracted" value, so we can use it directly. This inconsistency is kind of confusing :-). We should switch one of them so they match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as I rewrote both methods using _extract(...)
method, this issue resolved (both methods works with offsets)
h11/_receivebuffer.py
Outdated
|
||
self._data[:count] = b"" | ||
self._next_line_search = 0 | ||
self._multiple_lines_search = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of copy/pastes of this code for extracting an initial slice and then doing internal bookkeeping, which is error prone. Let's factor it out into a method, like:
def _extract(self, count):
out = self._data[:count]
del self._data[:count]
self._next_line_search = 0
self._multiple_lines_search = 0
And then in all the other methods, just do:
return self._extract(whatever_length_value_we_ended_up_with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced the _extract
, and replaced all copy/pastes using this function.
It also helped to resolve the above comment (#115 (comment))
b"Content-type: text/plain", | ||
b"Connection: close", | ||
] | ||
assert bytes(b) == b"Some body" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a few similar tests to test_readers_unusual
in test_io.py
? These tests are good for the basic ReceiveBuffer
functionality, but the test harness there sets up a more "end-to-end" setup that runs our full http parsing pipeline, so it would give us more confidence that everything is wired up correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added similar tests to test_readers_unusual
h11/_receivebuffer.py
Outdated
# Truncate the buffer and return it. | ||
idx = self._multiple_lines_search + match.span(0)[-1] | ||
out = self._data[:idx] | ||
lines = [line.rstrip(b"\r") for line in out.split(b"\n")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling rstrip
here (which always has to copy the whole buffer, since these are bytearrays), I think we could leave the trailing \r
in, and then in _abnf.py
change header_field
, request_line
, and status_line
to match a trailing optional \r
, e.g.:
header_field = (
r"(?P<field_name>{field_name})"
r":"
r"{OWS}"
r"(?P<field_value>{field_value})"
r"{OWS}\r?".format(**globals()) # <-- notice added \r? here
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do that.
I added \r?
to these regexes (header_field
, and request_line
, and status_line
)
Then I set
lines = out.split(b"\n")
but it broke one of test_readers_unusual
test cases in test_io.py
the test source
def test_readers_unusual():
...
# obsolete line folding
tr(
READERS[CLIENT, IDLE],
b"HEAD /foo HTTP/1.1\r\n"
b"Host: example.com\r\n"
b"Some: multi-line\r\n"
b" header\r\n"
b"\tnonsense\r\n"
b" \t \t\tI guess\r\n"
b"Connection: close\r\n"
b"More-nonsense: in the\r\n"
b" last header \r\n\r\n",
Request(
method="HEAD",
target="/foo",
headers=[
("Host", "example.com"),
("Some", "multi-line header nonsense I guess"),
("Connection", "close"),
("More-nonsense", "in the last header"),
],
),
)
The header
b"Some: multi-line\r\n"
b" header\r\n"
b"\tnonsense\r\n"
b" \t \t\tI guess\r\n"
turns to
b"Some: multi-line\r header\r\tnonsense\r \t \t\tI guess\r\n"
I cannot figure out how to carefully cut out the \r
from such a line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion in gitter, it has been decided not to change regexes, but to rewrite .rstrip(...)
with
for line in lines:
if line.endswith(b"\r"):
del line[-1]
to avoid extra memory allocation
Oh also, forgot to say: could you also add a news entry for the next release notes? See |
1. added new tests to test_io.py 2. introduced ReceiveBuffer::_extract 3. added a newsfragment
1. added new tests to test_io.py 2. introduced ReceiveBuffer::_extract 3. added a newsfragment
f1c4157
to
f688615
Compare
Replaced lines.rstrip(...) with `del line[-1]` to avoid extra allocations
@cdeler I tweaked your newsfragment to use the correct quoting: in ReST, code literals require double backticks. Super annoying if you're used to markdown, but what can ya do. @pgjones Note that this PR doesn't quite drop support for py2, but it does change the buffer handling to be O(n**2) on py2, and I'm wondering if we should flag that in the release notes or anything. Or are you planning to drop py2 for real in the next release anyway? |
Lets merge this, and drop Py2 for the next release. |
1. it uses b"\n\r?\n" as a blank line delimiter regex 2. it splits lines using b"\r?\n" regex, so that it's tolerant for mixed line endings 3. for chunked encoding it rewind buffer until b"\r\n" The changes are based on this comment: #115 (comment)
using these test results #115 (comment)
after @tomchristie's proposal from #115 (comment)
1. added new tests to test_io.py 2. introduced ReceiveBuffer::_extract 3. added a newsfragment
This is like a Christmas gift, a big thank you to every1 involved |
Hello,
I want to submit a PR which closes #7
Why the changes are required
According to this comment in the issue, there are problems with some old servers, which are not totally fit
HTTP/1.1
RFC. The original issue inhttpx
(encode/httpx#1378) describes the situation, where some embedded system developers have to deal with a non-standard serverWhat has been done?
I tried to reimplement the function, which extracts headers for response, using regex
What hasn't done yet
Performance testing, fuzzing(updates) How
maybe_extract_lines
works for now?self._data
buffer until the"\n\r?\n"
, which givesdata
data
using"\r?\n"
regex, which givesdelimiter
.split(delimiter)
(it much faster thanregex.split(data)
)With fix
Without fix (522b004)