Description
I've been talking with @mhils a bit about this on IRC, and there's some discussion in https://github.com/njsmith/h11/issues/31#issuecomment-309774081, but I kind of want to see mitmproxy's requirements all written down in one place. Let's use this issue for that.
@mhils says:
The main pain points for us are usually along these lines:
- Capturing/send data exactly as-is (we want to mirror the client accurately, e.g. header capitalization)
- Tolerance for slightly misbehaving clients (of course, we reject absurdly wrong data)
I think headers are the only blocking issue for us right now. h11's request/response validation is considerably stricter than what we do right now, but I'm slightly optimistic that we could at least patch this effortlessly.
Here are some places off the top of my head where h11 currently will not do byte-for-byte pass-through:
- Header capitalization, as noted
- h11 understands HTTP/1.0, but it never emits it
- But h11 doesn't understand the non-standard keep-alive extensions sometimes used with HTTP/1.0. Which is fine when it's implementing one side of a connection because it can make sure that it never gets used. But if you're trying to eavesdrop on connections between an arbitrary client/server, it's possible that they'll decide to use it and h11 will get confused.
- Also, h11 will unconditionally insert or rewrite headers in some cases (in particular
Transfer-Encoding
andConnection
) - h11 will eventually tolerate non-compliant line-endings (Support for servers with broken line endings #7), but won't emit them
- Leading and trailing whitespace in header values is always discarded
- We understand headers split over multiple continuation lines, but don't emit them
- We tolerate missing reason phrases (Tolerate missing reason phrases #32), but don't emit them
- Chunked encoding boundaries: Currently we do expose where the boundaries fall because @Lukasa needed this for urllib3 (Proposal: make it possible to observe chunk delimiters. #19), but I think urllib3 has come to their senses and decided they don't need this after all, so I was kinda hoping we could get rid of it again :-). Also, the way we expose them is sub-optimal if your goal is to regenerate them, because we don't expose the length; you just have to buffer the whole chunk and then re-emit it all at once, which is unnecessary unbounded memory overhead.
- Speaking of chunked encoding, we also discard chunk extension metadata and provide no way to emit it.
There are probably a few other issues like this that I'm not thinking of right now.
Also, as noted, we're pretty strict about forbidding a lot of things that a pentester might reasonably want to emit. (In fact, "would a pentester find this useful" is one of the criteria that we use to decide what to forbid, because generally you don't want your software to be accidentally turned into a pentester :-).) Some kinds of validation could be disabled without too much work (see e.g. #33), but other kinds are more difficult. For example, h11 absolutely won't tolerate incoming data with whitespace around header names (Host : example.com
) or illegal characters in header values – both of which are classic sources of security bugs – and this is enforced implicitly by the regexes used to parse the header name and value in the first place, rather than being some extra code that could be switched on or off.
I guess if you want to monkeypatch that's your business, but if you go that route then I feel like I should give you the standard warning just as a matter of practicality I don't think we can guarantee that your monkeypatches will keep working in future releases, or that the hooks you're trying to monkeypatch will even still exist. API guarantees only apply to APIs :-).
So... if you need h11 to be able to handle all of the pass-through cases listed above, then realistically I don't see how to implement that without the codebase turning into something monstrous and full of bugs. From discussions so far I'm guessing you don't actually care about all of these, but currently I don't understand the criteria that determine which ones you care about and which ones you don't, so I can't guess which ones are important. (In particular, it seems like HTTP/1.0 stuff might matter in practice?) Can you elaborate?