Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are user initiated aborts a network failure? #108

Open
igrigorik opened this issue Feb 15, 2019 · 25 comments
Open

Are user initiated aborts a network failure? #108

igrigorik opened this issue Feb 15, 2019 · 25 comments

Comments

@igrigorik
Copy link
Member

The user clicks on a blue link, which kicks off the navigation request. After some time delta, the user's patience runs out and they..

  1. Click cancel and navigation is aborted (rendering of initial page hasn't been updated)
  2. Hit the back key and navigation is aborted (they're looking at white screen and give up)

If the origin being navigated to has an associated NEL policy, would the origin see a report for either or both of these cases? Per our current definition in the spec, assuming the UA did not get a 2XX or 3XX response and user aborts the navigation.. I believe the answer is "yes", but want to sanity check both from spec and implementation perspective.

I guess the meta question here is: how do we treat user initiated aborts?

@chlily1
Copy link

chlily1 commented Feb 15, 2019

I think the answer right now is "no" on the implementation side.

My understanding of how this currently works in Chromium is that a "successful" report is generated (but most likely not sent, depending on the success sampling rate) if valid headers have been received and the network transaction is in an idle state [1]. If not, then the network transaction is silently destroyed without generating a NEL report.

At the level at which we're generating NEL reports, I don't think there's a way to tell why exactly the transaction was cancelled, but we may be able to add code to cover more cases. (I may be misunderstanding how URLRequests/HttpNetworkTransactions work. @MattMenke2 is the expert here -- please correct me if I'm wrong.)

[1] https://cs.chromium.org/chromium/src/net/http/http_network_transaction.cc?l=131

@igrigorik
Copy link
Member Author

Thanks for the pointer and confirming the behavior Lily.

@dcreager curious, have these scenarios come up in previous discussions?

An example use case I'm thinking of here is user clicking on a link that's taking way too long to load (reasons unknown) and manually aborting the navigation — e.g. DNS resolved, TCP handshake completed but the server is non responsive and user runs out of patience, at which point they initiate a manual abort. In this case, from a site owners perspective.. I would argue that's a failed navigation and unless NEL emits a report, they're blind to it.

@MattMenke2
Copy link

MattMenke2 commented Feb 19, 2019

It looks to me like we may or may not generate a success report on cancellation, depending on state of the HttpNetworkTransaction.

If there's a pending body read, we don't generate a report. If there's no pending body read, we do generate a successful report. I assume there's generally a pending body read when a request is cancelled, unless we're actually stalled on something (auth challenge, client cert request, body IPC buffer is full, pending redirect not yet followed, etc).

@igrigorik
Copy link
Member Author

@MattMenke2 what are the prerequisites for transitioning to "pending body read" state? Does this imply we received the response headers?

The types of failures I would be interested in here are...

  • user cancelling during DNS resolve or TCP and TLS handshakes
  • request headers sent but server is slow or unresponsive when user cancels
  • response is streaming but not complete when user cancels

@MattMenke2
Copy link

It's only the case where we're streaming the body where we have a pending body read, so at that point we've read the final headers (after all redirects). Cancelling at earlier times will also not result in reporting error information, since we have a check in a destructor for having valid headers before putting together a report, if we haven't put one together already.

@MattMenke2
Copy link

Another fun case: Redirects. When we see a redirect, we follow the redirect, without trying to read the response body. If we're using H2/QUIC, we just cancel the stream we received the redirect response on. If it's HTTP/1.x, we try and drain the response body out of band, to try and reuse the socket. If draining the response body takes too long, we time out the connection. We can also get a network error while draining the body. In either case, we've already reported the HTTP headers, and followed the redirect. How should that fit into error reporting? Is it a success or a failure?

@igrigorik
Copy link
Member Author

Matt, thanks that's really helpful. The behavior you outlined addresses most of the cases I highlighted..

  • Slow DNS, TCP or TLS handshake -> user abort -> failure report is sent (no body read)
  • Unresponsive server (no http headers received) -> user abort -> failure report is sent

It's possible that a server flushes headers and then trickles out the response (some CDN's do and promote this as an optimization feature), but generally in my experience it's (unfortunately) not common for servers to flush headers well ahead of the response... Well, except we do use that strategy for Search and other Google products.

Thinking out loud, would it be crazy to emit a success report IFF response body finished?

Re, redirects: that's a great question. For same origin redirect chain I could see a case for either each redirect being treated as separate navigation, or as a single 'transaction' — i.e. if any fails we emit an error report. For cross-origin redirects, it seems that we ought to treat each origin separately. E.g. if user navigates to foo.com which then redirects them to bar.com, both foo.com and bar.com have different NEL policies and we should respect them.

@MattMenke2
Copy link

Not sure we want to introduce a concept of origin at that layer of the stack. Another thing to think about is proxies. If we have an error looking up the domain name of the proxy to use, establishing a connection to the proxy, or a proxy returns a funky result to a CONNECT request (auth challenge, or something other than the response indicating it has established a tunnel), should we report that?

If there are multiple proxies in the proxy list, and the first fails, but the next in the list succeeds, should we report the initial failure? What if preconnects fail - there's no request in that case. What if we reject a pushed HTTP stream because we don't like it? (We reject pushed auth challenges, for instance). What if the server requests client auth? Is that an error? HTTP auth we seem to include in the error set, so I guess so?

There's such a thing as "expected" errors whenever a socket is reused. If we reuse a socket (Or use a preconnected but stale socket), we sometimes get a connection close even because the server timed out the socket while we were keeping it around for reuse. In this case, we automatically resend the request. Should we log the original request, which might be an expected error, as an error? We do this both on close and on reset, actually - neither of these may actually reflect an error on the part of the server, and if the retry succeeds, we never surface the error page to the user.

Another case: What if we see a redirect, and the request is aborted rather than following the redirect. In particuar, this can happen on CORS failures (Though the layer we're logging at doesn't know about CORS failures, and we can't really teach it about them).

Another issue: We normally isolate CORS and uncredentialed / non-CORS requests. Is there any reason to separate out network errors by this classification? This is mostly to protect the site the request is being made to from third party sites making the request, and I don't really see any attacks that combining the two sorts of requests could surface, but I'm not really an expert in that space.

@MattMenke2
Copy link

And actually answering your question:

"Thinking out loud, would it be crazy to emit a success report IFF response body finished?"

I think this is the full list of caveats to that (most of these already mentioned):

  • HEAD and 304 responses have no bodies. I think we do try and read bodies for HEAD requests through the normal paths, and 304 responses go through the response body drainer, like redirects.
  • We don't always read the bodies of auth challenges and redirects (We only read them if the user declines to provide auth credentials). I guess we consider these failures, either way.
  • Some consumers just cancel requests after they see the headers rather than waiting for the body.

@chlily1
Copy link

chlily1 commented Feb 22, 2019

Just landed https://crrev.com/c/1480195, so the behavior now is to generate an "abandoned" report if the request is cancelled without having previously generated either an error or success report.

Unfortunately these will all show up as "application" phase, regardless of what phase the network transaction was in when it was cancelled.

@igrigorik
Copy link
Member Author

Just landed https://crrev.com/c/1480195, so the behavior now is to generate an "abandoned" report if the request is cancelled without having previously generated either an error or success report.

@chlily1 nice work! I think that resolves all the scenarios and edge cases @MattMenke2 highlighted above. However, it looks like it got reverted yesterday (doh), are you planning to take another run at it?

@chlily1
Copy link

chlily1 commented Feb 27, 2019

are you planning to take another run at it?

Yes, once I fix a bug that it caused.

@igrigorik
Copy link
Member Author

@chlily1 any updates on this one? Looks like the original CL is still reverted, or is there a new one I should follow?

@chlily1
Copy link

chlily1 commented Apr 8, 2019

Woops, sorry for forgetting to update this thread. It was relanded as https://crrev.com/c/1491905.

@igrigorik
Copy link
Member Author

\o/ ... awesome work, thanks Lily! Based on the branch point, we should see this rollout in M74 stable?

On a related note, do we have any WPT tests to verify this behavior? /cc @yoavweiss

@chlily1
Copy link

chlily1 commented Apr 9, 2019

Yes, this is in M74.

We don't currently have any wpt tests for this.

@yoavweiss
Copy link
Contributor

Would be good to add WPTs, and I think WebDriver can help us simulate user initiated aborts.
Let's keep this open until WPTs are added.

@aaronpeters
Copy link

Just landed https://crrev.com/c/1480195, so the behavior now is to generate an "abandoned" report if the request is cancelled without having previously generated either an error or success report

This is great because it brings the implementation closer to the spec.

Question though: what about cancelled network requests that are not user initiated aborts? For example, some JS code adds an element to the DOM, sets its src and the browser initiates fetch of the resource. Next, JS code removes this element from the DOM and the browser cancels the resource fetch that is still in flight.
Also comes to mind: xhr.abort() and Abortable fetch

Also, @chlily1 , I'm confused by the following I read at https://crrev.com/c/1480195 :

Successes are reported after receiving valid headers (if the response code is 4xx or 5xx, or if the body will not be read, or on a redirect), or after completely reading the body. Any other outcome is reported as an error.

Does this mean that the browser will send a success report after receiving valid headers with response code 4xx or 5xx? If so: should a resource fetch with response code 4xx or 5xx not always be network error of type http.error ?

@chlily1
Copy link

chlily1 commented May 29, 2019

Sorry for unclear wording. It is a success in the sense that the error code reported is net::OK. That is converted later into the type http.error after looking at the status code.

@igrigorik
Copy link
Member Author

Question though: what about cancelled network requests that are not user initiated aborts? For example, some JS code adds an element to the DOM, sets its src and the browser initiates fetch of the resource. Next, JS code removes this element from the DOM and the browser cancels the resource fetch that is still in flight.
Also comes to mind: xhr.abort() and Abortable fetch

Application initiated aborts are out of scope for NEL — they're not errors.

@aaronpeters
Copy link

Got it, although I believe one can argue that user initiated aborts are also not errors.
A user navigates to a big fat page, lots of assets need to be fetched by the browser and the user is impatient and clicks the X to stop page loading. At that time, some network requests were still in flight.
The browser will send NEL report on those cancelled requests, but are these 'network errors'? There was nothing wrong on the network, not even a slow responding server (e.g. the user abort happened very shortly after the fetch started).

This said, the NEL report will clearly show what caused the 'error' (it was a user initiated abort) so the user of the NEL report cannot be confused about what happened, and the report will include elapsed_time so this is helpful in understanding if slowness was in play or not.

Just my 2 cents here.
The spec states user initiated aborts are in scope, and so that's clear.
Also, I do see real value in having insight into user initiated aborts. If an origin gets many of those, that is a strong signal users are not happy.

@igrigorik
Copy link
Member Author

We shouldn't fire errors for subresource fetches aborted by the application or as a result of user action — if that's not clear in the spec, we should clarify it. Both of these cases are WAI and can be tracked by the application itself.

We make one distinction for navigations+user abort because the page cannot, by definition, observe such aborted navigations and remains blind to potential issues.

@MattMenke2
Copy link

We don't have APIs to tell the network service the reason a request was cancelled. If we want to distinguish cancellation reasons, I don't think we want to tell the URLRequest itself the reason for cancellation, but rather have a separate API to explicitly tell NEL about the abort.

Also, the renderer doesn't know the reason for the cancellation of a partially received main frame body, so I'm not sure we can really distinguish between reasons for cancelling a partially rendered body (user abort vs web-initiated, like document.location in the middle of a page load - I assume we only cancel the body load when the new navigation commits, though I could be wrong).

@aaronpeters
Copy link

We shouldn't fire errors for subresource fetches aborted by the application or as a result of user action — if that's not clear in the spec, we should clarify it.

More clarity in the spec is welcome !

@aaronpeters
Copy link

Slow DNS, TCP or TLS handshake -> user abort -> failure report is sent (no body read)

Is it possible for the NEL report to make clear during which phase the abort happened?

E.g. my server is overloaded and can't handle new TCP connections.
Browsers are very patient during the TCP phase and will wait tens of seconds for a server's response. Users don't wait tens of seconds and abort the navigation.
It would be very helpful to receive tcp.abandoned reports instead of abandoned reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants