Skip to content

Fails to decode a header that requests can handle #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miracle2k opened this issue Mar 20, 2018 · 16 comments
Closed

Fails to decode a header that requests can handle #57

miracle2k opened this issue Mar 20, 2018 · 16 comments

Comments

@miracle2k
Copy link

I am trying to read this url

https://www.bitstamp.net/api/v2/trading-pairs-info/

It fails (see theelous3/asks#60) with an exception:

....
  File "/Users/michael/.pyenv/versions/3.6.3/lib/python3.6/site-packages/h11/_readers.py", line 85, in maybe_read_from_SEND_RESPONSE_server
    return class_(headers=list(_decode_header_lines(lines[1:])),
  File "/Users/michael/.pyenv/versions/3.6.3/lib/python3.6/site-packages/h11/_readers.py", line 55, in _decode_header_lines
    matches = validate(header_field_re, line)
  File "/Users/michael/.pyenv/versions/3.6.3/lib/python3.6/site-packages/h11/_util.py", line 96, in validate
    raise LocalProtocolError(msg)

Looking closer, this is the header that seems to cause trouble:

bytearray(b'Set-Cookie: ___utmvafIumyLc=kUd\x01UpAt; path=/; Max-Age=900')

My guess is it's the \x.

@njsmith
Copy link
Member

njsmith commented Mar 20, 2018

Yep, RFC 7230 says unambiguously that you can't have characters below \x20 in header values (except for \t), so that site is violating the HTTP spec.

Now the question is... what to do about it. If they're violating the spec like that and getting away with it, then probably others are as well, so we might need to relax h11 to match. I'm not keen on allowing just any character inside headers though; that seems like it will inevitably create security issues. I guess we have to go figure out what characters browsers and curl allow...

@njsmith
Copy link
Member

njsmith commented Mar 20, 2018

Things I've learned so far:

  • The HTTP/2 RFC says: "Any request or response that contains a character not permitted in a header field value MUST be treated as malformed", because they're worried about security attacks where different steps in the chain process these things differently. They particularly call out CR, LF, and NUL as likely to be dangerous. Unfortunately the HTTP/1.1 RFC doesn't get as specific about this point.

  • Golang: has a ValidHeaderFieldValue check that seems to correctly implement RFC 7230's rules, and if we look at where it's called, then it looks like they enforce it in their HTTP/1 server code (i.e., reject requests that contain invalid headers), but not in their client code. Also their HTTP/2 code checks this at several places, but I'm not sure how those map onto request/response paths.

  • curl: appears to do no validation at all; I wrote a silly little server that sends back a header with the value \x00, and it accepted this without complaint.

  • Firefox: same as curl, didn't complain at all about a NUL in the header and reported it as NUL in the network inspector, so I don't think it's sanitizing it or anything.

  • We currently have a tiny bug in our header field validation rule that allows a bit more through than the spec says it should (Bug in vchar_or_obs_text regex: shouldn't accept DEL #58)

Ugh. I definitely do not want to allow NUL through by default, even if that's what other clients do. Also, on input I guess \r and \n might not be so dangerous after #7 is fixed, but we definitely should not allow them in outgoing headers, because that's led to a number of real high-profile attacks ("response splitting", see #50).

Some options:

  • Tell users to suck it up and deal with h11 following the specs: this seems unlikely to work in practice, esp. as we look at moving larger libraries over to h11. Especially in cases like in this issue, where someone is just trying to deal with some remote server they don't control.

  • relax our validation rule to reject \x00, \r, and \n only. Hope that all the other weird control characters are not too dangerous in practice.

  • Keep the correct rule, but provide some way to enable "sloppy mode", like an envvar. Theory: people should be made aware and feel a bit of pain when specs are not followed, but still have a way to get their work done if necessary. (And an envvar is a way of setting a flag that you can use even if h11 is buried way down inside your stack.) Downside: modes are bad, and "disable security" modes are especially bad, and "disable security everywhere" modes are especially especially bad. OTOH, relaxing the checks unconditionally might be worse.

  • Maybe do something stricter if we're a server than if we're a client, like golang does? idk

@Lukasa Would appreciate your thoughts here if you have time... e.g. does h2 validate header values?

@kennethreitz @haikuginger @sigmavirus24 You might also have thoughts here, since this might end up affecting urllib3/requests.

@njsmith
Copy link
Member

njsmith commented Mar 20, 2018

BTW, note that I just moved this repo into the python-hyper org, so github email links might be wonky, I'm not sure. The correct URL for this discussion is: #57

@njsmith
Copy link
Member

njsmith commented Mar 20, 2018

For future reference, the program I used to check how curl and firefox handle NUL in header values:

import trio

async def silly_server(stream):
    data = bytearray()
    while True:
        data += await stream.receive_some(4096)
        if b"\r\n\r\n" in data:
            break
    await stream.send_all(
        b"HTTP/1.1 200 OK\r\nTest: \x00\r\nContent-Length: 2\r\n\r\nXX"
    )

async def main():
    await trio.serve_tcp(silly_server, 8888)

trio.run(main)

@sigmavirus24
Copy link

My main opinion is that requests should never have started using environment variables to control things in the stack, especially around security sensitive configuration.

An alternative would be to return NotValidatedHeader objects and return a ValidatedHeader from UnverifiedHeader.verify or raise an exception. This puts the control in the hands of the library user. If they want, they can trample forwards, otherwise, they can re-raise the exception however they want.

@Lukasa
Copy link
Member

Lukasa commented Mar 21, 2018

h2 does indeed validate header fields, but its validation is quite cautious. In its case it does allow all the characters you’re mentioning, which could well be considered a bug.

@njsmith
Copy link
Member

njsmith commented Mar 22, 2018

@sigmavirus24 I guess a similar option would be to have some sort of configuration you pass when setting up the h11.Connection object to choose the degree of header validation. (This would also be easier to implement+faster, since currently we do the header value validation as a side-effect of the regex's we use to pick apart the headers in the first place.) But if the person who has the problem is using, like, someapprequestsurllib3h11, then this kind of config doesn't help unless either urllib3 sets it to the permissive mode by default, or else everyone in the stack provides their own config option that they then pass through to the next layer down. That's the motivation for thinking about envvars -- they're gross, but they do provide a way for a user to circumvent that stack.

That said... I guess we always want strict validation on outgoing headers, and that probably we always want strict validation as servers parsing incoming requests (because on the server side, you can give a proper error message, plus not-always-but-usually the client will be stuck working around whatever you do rather than vice-versa). And if this is breaking in real life and no extant clients actually enforce it, then I guess urllib3 and similar will probably want to disable it always anyway. So one heuristic would be: when parsing the headers from an incoming response, and only in this situation, then outlaw \0, \r and \n but let everything else through; otherwise (outgoing headers and incoming requests) apply full RFC validation.

@sigmavirus24
Copy link

I guess we always want strict validation on outgoing headers

Not if the mitmproxy folks start using this library ;)

That's the motivation for thinking about envvars -- they're gross, but they do provide a way for a user to circumvent that stack.

I understand that they're an easy way for end users to do a thing that maybe fixes it. But that always ends up being a hammer to them that they then use to bash every problem with, unnecessarily. I regret every envvar that Requests supports, even the one that controls certificate files, and that one is arguably the most/only useful one. No amount of documentation will ever fix a user's preconceived notions about what "wins" and precedence gets tricky with enough different sources of configuration.

@kennethreitz
Copy link

@sigmavirus24 should we remove env vars for 3.0?

@sigmavirus24
Copy link

@kennethreitz I'm not sure the fury would be worth making our lives slightly easier. I think we've already merged changes to make the decision/precedence process considerably more consistent. It's just that I think there shouldn't ever be another one added :)

@njsmith
Copy link
Member

njsmith commented Apr 2, 2018

On further thought, I realized my suggestion above wouldn't actually handle the case that started this, because there the offending byte is inside a cookie, which means that the client has to be able to send it back to the server :-/. So I guess our options are:

  • Loosen the header-value regex to accept any character except \0, \r, \n

  • Be spec-compliant about what we send and receive in server mode, but use the loosened rule above in client mode. This would be somewhat annoying to implement, though, because of how h11 is structured (the same h11.Request or h11.Response objects are used by both clients and servers, they validate headers, and they don't know which role we're playing – if we were only changing the rules for parsing from the wire, that would be easier, because that's done by the connection object itself).

So I guess I'm now leaning towards the first option.

miracle2k added a commit to miracle2k/h11 that referenced this issue Apr 15, 2018
@miracle2k
Copy link
Author

I am noticing this issue on a surprising number of sites, so I googled. It would seem that __utmv is actually set by Google Analytics!

@njsmith
Copy link
Member

njsmith commented May 5, 2018

That's, umm. Wow.

njsmith added a commit to njsmith/h11 that referenced this issue May 6, 2018
The RFC says we should reject any header value that contains control
characters. But apparently in the real world, you have to both accept
and produce these sometimes (e.g. Google Analytics cookies use them).

As a compromise, we now accept most control characters, but continue
to disallow NUL (\x00) and all whitespace (\t\n\r\f\v and space),
except that space and tab are allowed inside header values when
surrounded by non-whitespace characters.

Closes: python-hypergh-57, python-hypergh-58
njsmith added a commit to njsmith/h11 that referenced this issue May 6, 2018
The RFC says we should reject any header value that contains control
characters. But apparently in the real world, you have to both accept
and produce these sometimes (e.g. Google Analytics cookies use them).

As a compromise, we now accept most control characters, but continue
to disallow NUL (\x00) and all whitespace (\t\n\r\f\v and space),
except that space and tab are allowed inside header values when
surrounded by non-whitespace characters.

Closes: python-hypergh-57, python-hypergh-58
@njsmith
Copy link
Member

njsmith commented May 6, 2018

Ok, #68 implements the hack I suggested above, except that it makes the list of forbidden characters be NUL + all whitespace (including the weird ones like vertical-tab). Hopefully that's loose enough to work in the real world while still being strict enough to provide some defense-in-depth.

Review would appreciated if anyone has time.

@njsmith
Copy link
Member

njsmith commented Oct 30, 2018

Note for future reference: the WHATWG fetch spec defines an HTTP header value to be:

  • Has no leading or trailing HTTP whitespace bytes.
  • Contains no 0x00, 0x0A or 0x0D bytes.

So they're slightly looser than h11's new rules: h11 and fetch both disallow NUL, \n, and \r, and h11, but not fetch, disallows \f and \v.

@royfielding
Copy link

Note that curl allows all sorts of things specifically because it is a tool used for pentesting and verification. It would be nice if it had some sort of validation mode that highlighted spec errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants