-
Notifications
You must be signed in to change notification settings - Fork 306
Show poll-failure details #868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f712ca5
to
08f8815
Compare
A question to be answered: how do we determine if an error is transient? Currently we handle 3 classes of errors in
I think it is usually safe to assume that 1 is recoverable, while we are less certain about that for 2 and 3. An idea is to define a number of retries for the client to start displaying error messages. And then in a follow-up PR the client can try to do something different (which is #186). |
For non-transient errors, we will probably show a modal instead. |
d3a14ee
to
54436ee
Compare
If there's a (Those fall into the |
Other than that point, the description above all looks right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Quick comments on the high-level behavior.)
lib/model/store.dart
Outdated
if (accumlatedTransientFailureCount > transientFailureCountNotifyThreshold) { | ||
reportErrorToUserInDialog('Failed to reach server. Will retry: $e'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this case as a toast. It's a server 5xx or a network exception, so it's unlikely there's anything the client should be doing differently about it — no need to get in the user's way and potentially annoy them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's start showing the toasts once it's been a few seconds since we started retrying. The simple way to do that is to make the count threshold about 5, rather than 10 — at the 6th failure, it's been an expected 1.55s since the first failure, and we'll be backing off for an expected 1.6s before the next retry.
837a6d0
to
b64c99c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Comments below.
Then one other comment is, for all the new poll-failure feedback we're giving the user, let's identify which server it's about. This is particularly relevant when the issue is a network problem, but applies whenever the feedback can be significantly delayed. The user might have given up waiting for this account, like a minute or more ago, and is doing other things—maybe even in a different account—when the feedback appears.
lib/log.dart
Outdated
@@ -30,3 +33,45 @@ bool debugLog(String message) { | |||
}()); | |||
return true; | |||
} | |||
|
|||
/// Display an error message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Display an error message in a [SnackBar]." I like how the summary line on reportErrorToUserInDialog
's dartdoc says how the error is displayed.
lib/log.dart
Outdated
/// This shows a [SnackBar] containing the message after app startup, | ||
/// otherwise logs it to the console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific about "after app startup"? I can imagine confused readers wondering how the snackbar could be shown at any time other than after the app starts up.
lib/log.dart
Outdated
/// This shows a [SnackBar] containing the message after app startup, | ||
/// otherwise logs it to the console. | ||
/// | ||
/// See also: [ZulipApp._reportErrorToUserBriefly] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a private method, and also kind of feels like an implementation detail. Is this the kind of information that callers need, to understand if reportErrorToUserBriefly
is right for them, and what its expectations/guarantees are like? (Similarly with the dartdoc on reportErrorToUserInDialog
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary goal of including this line is to have a back link to where the setters get used, since it is not obvious. I think it would be more appropriate to exclude that bit from the dart doct, though, considering the point that it's some implementation detail.
lib/log.dart
Outdated
/// otherwise logs it to the console. | ||
/// | ||
/// See also: [ZulipApp._reportErrorToUserBriefly] | ||
void Function(String message) get reportErrorToUserBriefly => (message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about reportErrorToUserInSnackbar
, to be more informative, and more parallel with reportErrorToUserInDialog
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds like a good choice.
lib/log.dart
Outdated
|
||
String? get debugLastReportedError => _debugLastReportedError; | ||
String? _debugLastReportedError; | ||
String? takeLastReportedError() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be debugTakeLastReportedError
?
test/widgets/app_test.dart
Outdated
// After app startup, reportErrorToUserInDialog displays a dialog. | ||
reportErrorToUserInDialog('test error message'); | ||
await tester.pump(); | ||
check(finder.evaluate()).single; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could maybe be strengthened with checkErrorDialog
; just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a bit repetitive, but it should be good to also check the title.
test/model/store_test.dart
Outdated
test('retries on Server5xxException', () { | ||
checkRetry(() => connection.prepare(httpStatus: 500, body: 'splat')); | ||
}); | ||
group('retries', () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store test [nfc]: Make a group for retry tests.
Let's leave off the [nfc]
in the commit message: #870 (comment)
lib/model/store.dart
Outdated
@@ -801,6 +801,7 @@ class UpdateMachine { | |||
assert(debugLog('Error polling event queue for $store: $e\n' | |||
'Backing off and retrying even though may be hopeless…')); | |||
// TODO tell user on non-transient error in polling | |||
reportErrorToUserInDialog('Error loading server data. Will retry: $e'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store: Reoprt non-transient polling errors.
commit-message nit: spelling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this TODO can be removed:
// TODO tell user on non-transient error in polling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can this be a translated string? If that's not straightforward, let's include a TODO(i18n)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.
Yes, that's #186. Added a line mentioning that.
96cd1ec
to
a434078
Compare
The PR has been updated. Thanks for the review! I'm not entirely sure about the quality of the translation strings. These messages will appear more technical at the current stage for the ease of debugging. However, polishing user facing errors might be a post-launch task. |
beac99b
to
9cd37ea
Compare
9cd37ea
to
63293f6
Compare
Updated the PR to ignore |
63293f6
to
90370a7
Compare
Thanks for the update! Installing on my phone now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've observed this on my phone for a while; and it's been pretty quiet, which is good.
After seeing it in action, I think we can leave out the last commit, with the "Reconnecting to {serverUrl}…" message. It feels redundant with the loading indicator at the top; and the fact that the toast disappears on its own schedule while the loading indicator can be still going seems incongruous, and likely to be confusing.
Otherwise this generally all looks good; just a few comments below.
(I also feel like the UpdateMachine.poll
method has gotten kind of unwieldy; maybe I'll take a stab at reorganizing that after merging this PR. It's already kind of unwieldy in main.)
test/model/store_test.dart
Outdated
if (i != numFailedRequests - 1) { | ||
// Skip polling backoff unless this is the final iteration. | ||
// This allows the next `updateMachine.debugAdvanceLoop` call to | ||
// trigger the next request without wait. | ||
async.flushTimers(); | ||
} | ||
|
||
if (!expectNotifyError) { | ||
// No matter how many retries there have been, no request should | ||
// result in an error message. | ||
check(takeLastReportedError()).isNull(); | ||
continue; | ||
} | ||
if (i < numFailedRequests - 1) { | ||
// The error message should not appear until the `updateMachine` | ||
// has retried the given number of times. | ||
check(takeLastReportedError()).isNull(); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this test (or set of tests) gets fairly tangly to follow.
I think the main thing is that it'd be helpful to test two aspects separately: there's the existing test cases that just check that we retry, and let's have separate new test cases that just check the error-reporting.
Then I think both sets of tests can be a lot simpler than this — most of the logic is only relevant for one aspect or the other. It's OK that some other fragments of code will be duplicated between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately and more tactically, I think it'd help to pull the last iteration here outside of the loop — have the loop only concern itself with the first N-1 requests. That would avoid these multiple conditionals about whether it's the last iteration.
lib/model/store.dart
Outdated
if (e.cause is! SocketException) { | ||
// Heuristic check to only report interesting errors to the user. | ||
// This currently ignores [SocketException], which typically | ||
// occurs when the client goes back from sleep. | ||
// TODO: Investigate if there are other cases of [SocketException] | ||
// that might turn out to be interesting. | ||
// | ||
// See also: | ||
// * [NetworkException.cause], which is the underlying exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (e.cause is! SocketException) { | |
// Heuristic check to only report interesting errors to the user. | |
// This currently ignores [SocketException], which typically | |
// occurs when the client goes back from sleep. | |
// TODO: Investigate if there are other cases of [SocketException] | |
// that might turn out to be interesting. | |
// | |
// See also: | |
// * [NetworkException.cause], which is the underlying exception. | |
if (e.cause is! SocketException) { | |
// Heuristic check to only report interesting errors to the user. | |
// A [SocketException] is common when the app returns from sleep. |
I think the TODO isn't actionable.
The "See also" isn't needed — .cause
is just above, and the reader can follow that to its definition using an IDE.
Then while at it I tightened the sentence about SocketException. No need to say what the code does, as the code here speaks for itself; only to explain why we chose to make the code do that.
5f75ee2
to
6c58f92
Compare
Thanks for the feedback! Dropped the last commit and previous changes relevant to the tests. A new group of tests is now responsible for testing the error reporting feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revision! Just small comments below, all in the tests.
And as a note to myself, there's a followup I mentioned at #868 (review) I might do after merge.
test/model/store_test.dart
Outdated
@@ -652,6 +655,90 @@ void main() { | |||
test('retries on MalformedServerResponseException', () { | |||
checkRetry(() => connection.prepare(httpStatus: 200, body: 'nonsense')); | |||
}); | |||
|
|||
group('reportErrorToUserBriefly', () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
group('reportErrorToUserBriefly', () { | |
group('error reporting', () { |
Otherwise it sounds like these are tests of reportErrorToUserBriefly
.
test/model/store_test.dart
Outdated
late void Function() prepareError; | ||
|
||
void pollAndFail(FakeAsync async) { | ||
prepareError(); | ||
updateMachine.debugAdvanceLoop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late void Function() prepareError; | |
void pollAndFail(FakeAsync async) { | |
prepareError(); | |
updateMachine.debugAdvanceLoop(); | |
void pollAndFail(FakeAsync async) { | |
updateMachine.debugAdvanceLoop(); |
Instead the callers can say connection.prepare(…)
(i.e., the body of what they would set prepareError
to) just before each call to pollAndFail
. That isn't really any longer — it makes a very slight duplication of the content of those connection.prepare
calls — and it makes the control flow substantially more straightforward.
test/model/store_test.dart
Outdated
// This would skip the pending polling backoff. | ||
async.flushTimers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This would skip the pending polling backoff. | |
async.flushTimers(); | |
// This skips the pending polling backoff. | |
async.flushTimers(); |
I.e. this does skip that backoff, right?
Or am I missing a subtlety where it doesn't actually do so, but only would do so if something were different?
Thanks! Ready for review now. |
We delay reporting transient errors until we have retried a certain number of times, because they might recover on their own. Signed-off-by: Zixuan James Li <[email protected]>
Usually when a client goes back from sleep or lose network connection, there will be a bunch of errors when polling event queue. This known error will always be a `NetworkException`, so we make a separate case for it. Previously, we may report these errors, and it ended up being spammy. This filters out the more interesting one to report instead. Fixes: zulip#555 Signed-off-by: Zixuan James Li <[email protected]>
Looks good — merging. |
Ready for review.
This fixes #555.