Skip to content

Show poll-failure details #868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 24, 2024
Merged

Show poll-failure details #868

merged 2 commits into from
Oct 24, 2024

Conversation

PIG208
Copy link
Member

@PIG208 PIG208 commented Aug 6, 2024

Ready for review.

This fixes #555.

@PIG208 PIG208 changed the title WIP Show polling failure details WIP Show poll-failure details Aug 6, 2024
@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from f712ca5 to 08f8815 Compare August 8, 2024 03:51
@PIG208
Copy link
Member Author

PIG208 commented Aug 8, 2024

A question to be answered: how do we determine if an error is transient?

Currently we handle 3 classes of errors in UpdateMachine:

  1. ZulipApiException: The event queue is lost. This is commonly caused by inactivity.
  2. Server5xxException, NetworkException: A server-side bug, or hiccups due to unstable Internet connection.
  3. default: Could be anything that happens in the client code during the getEvents call.

I think it is usually safe to assume that 1 is recoverable, while we are less certain about that for 2 and 3. An idea is to define a number of retries for the client to start displaying error messages. And then in a follow-up PR the client can try to do something different (which is #186).

@PIG208
Copy link
Member Author

PIG208 commented Aug 8, 2024

For non-transient errors, we will probably show a modal instead.

@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from d3a14ee to 54436ee Compare August 8, 2024 23:35
@gnprice
Copy link
Member

gnprice commented Aug 12, 2024

If there's a ZulipApiException other than the event queue being gone, then that's likely non-transient — it's either a bug in the client, or some kind of unexpected incompatibility between server and client. A retry is likely to just hit the same error again.

(Those fall into the default case.)

@gnprice
Copy link
Member

gnprice commented Aug 12, 2024

Other than that point, the description above all looks right.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Quick comments on the high-level behavior.)

Comment on lines 795 to 796
if (accumlatedTransientFailureCount > transientFailureCountNotifyThreshold) {
reportErrorToUserInDialog('Failed to reach server. Will retry: $e');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this case as a toast. It's a server 5xx or a network exception, so it's unlikely there's anything the client should be doing differently about it — no need to get in the user's way and potentially annoy them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's start showing the toasts once it's been a few seconds since we started retrying. The simple way to do that is to make the count threshold about 5, rather than 10 — at the 6th failure, it's been an expected 1.55s since the first failure, and we'll be backing off for an expected 1.6s before the next retry.

@PIG208 PIG208 force-pushed the report-error branch 9 times, most recently from 837a6d0 to b64c99c Compare August 12, 2024 21:29
@PIG208 PIG208 marked this pull request as ready for review August 12, 2024 21:29
@PIG208 PIG208 added the maintainer review PR ready for review by Zulip maintainers label Aug 12, 2024
@PIG208 PIG208 requested a review from chrisbobbe August 12, 2024 21:30
@PIG208 PIG208 changed the title WIP Show poll-failure details Show poll-failure details Aug 12, 2024
Copy link
Collaborator

@chrisbobbe chrisbobbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Comments below.

Then one other comment is, for all the new poll-failure feedback we're giving the user, let's identify which server it's about. This is particularly relevant when the issue is a network problem, but applies whenever the feedback can be significantly delayed. The user might have given up waiting for this account, like a minute or more ago, and is doing other things—maybe even in a different account—when the feedback appears.

lib/log.dart Outdated
@@ -30,3 +33,45 @@ bool debugLog(String message) {
}());
return true;
}

/// Display an error message.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Display an error message in a [SnackBar]." I like how the summary line on reportErrorToUserInDialog's dartdoc says how the error is displayed.

lib/log.dart Outdated
/// This shows a [SnackBar] containing the message after app startup,
/// otherwise logs it to the console.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific about "after app startup"? I can imagine confused readers wondering how the snackbar could be shown at any time other than after the app starts up.

lib/log.dart Outdated
/// This shows a [SnackBar] containing the message after app startup,
/// otherwise logs it to the console.
///
/// See also: [ZulipApp._reportErrorToUserBriefly]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private method, and also kind of feels like an implementation detail. Is this the kind of information that callers need, to understand if reportErrorToUserBriefly is right for them, and what its expectations/guarantees are like? (Similarly with the dartdoc on reportErrorToUserInDialog.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary goal of including this line is to have a back link to where the setters get used, since it is not obvious. I think it would be more appropriate to exclude that bit from the dart doct, though, considering the point that it's some implementation detail.

lib/log.dart Outdated
/// otherwise logs it to the console.
///
/// See also: [ZulipApp._reportErrorToUserBriefly]
void Function(String message) get reportErrorToUserBriefly => (message) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about reportErrorToUserInSnackbar, to be more informative, and more parallel with reportErrorToUserInDialog?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds like a good choice.

lib/log.dart Outdated

String? get debugLastReportedError => _debugLastReportedError;
String? _debugLastReportedError;
String? takeLastReportedError() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be debugTakeLastReportedError?

// After app startup, reportErrorToUserInDialog displays a dialog.
reportErrorToUserInDialog('test error message');
await tester.pump();
check(finder.evaluate()).single;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe be strengthened with checkErrorDialog; just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a bit repetitive, but it should be good to also check the title.

test('retries on Server5xxException', () {
checkRetry(() => connection.prepare(httpStatus: 500, body: 'splat'));
});
group('retries', () {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store test [nfc]: Make a group for retry tests.

Let's leave off the [nfc] in the commit message: #870 (comment)

@@ -801,6 +801,7 @@ class UpdateMachine {
assert(debugLog('Error polling event queue for $store: $e\n'
'Backing off and retrying even though may be hopeless…'));
// TODO tell user on non-transient error in polling
reportErrorToUserInDialog('Error loading server data. Will retry: $e');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store: Reoprt non-transient polling errors.

commit-message nit: spelling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this TODO can be removed:

            // TODO tell user on non-transient error in polling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can this be a translated string? If that's not straightforward, let's include a TODO(i18n).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. And for non-transient errors, we expect that retrying won't help, right? It seems like we should have a TODO to not retry. I'm also wondering if retrying and showing a dialog, for a task that's likely to be hopeless, will spam the user with the same dialog repetitively.

Yes, that's #186. Added a line mentioning that.

@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from 96cd1ec to a434078 Compare August 16, 2024 04:16
@PIG208
Copy link
Member Author

PIG208 commented Aug 16, 2024

The PR has been updated. Thanks for the review!

I'm not entirely sure about the quality of the translation strings. These messages will appear more technical at the current stage for the ease of debugging. However, polishing user facing errors might be a post-launch task.

@PIG208 PIG208 force-pushed the report-error branch 4 times, most recently from beac99b to 9cd37ea Compare September 12, 2024 20:54
@PIG208 PIG208 removed the integration review Added by maintainers when PR may be ready for integration label Sep 13, 2024
@PIG208 PIG208 marked this pull request as draft September 24, 2024 01:06
@PIG208 PIG208 marked this pull request as ready for review September 27, 2024 23:01
@PIG208 PIG208 added the integration review Added by maintainers when PR may be ready for integration label Sep 27, 2024
@PIG208 PIG208 requested a review from gnprice September 27, 2024 23:01
@PIG208
Copy link
Member Author

PIG208 commented Sep 28, 2024

Updated the PR to ignore NetworkException where the underlying cause is a SocketException, added some more documentation and cleaned up some diffs to make them easier to read. Feel free to install a build of this PR to test it out!

@gnprice
Copy link
Member

gnprice commented Oct 4, 2024

Thanks for the update! Installing on my phone now.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've observed this on my phone for a while; and it's been pretty quiet, which is good.

After seeing it in action, I think we can leave out the last commit, with the "Reconnecting to {serverUrl}…" message. It feels redundant with the loading indicator at the top; and the fact that the toast disappears on its own schedule while the loading indicator can be still going seems incongruous, and likely to be confusing.

Otherwise this generally all looks good; just a few comments below.

(I also feel like the UpdateMachine.poll method has gotten kind of unwieldy; maybe I'll take a stab at reorganizing that after merging this PR. It's already kind of unwieldy in main.)

Comment on lines 517 to 535
if (i != numFailedRequests - 1) {
// Skip polling backoff unless this is the final iteration.
// This allows the next `updateMachine.debugAdvanceLoop` call to
// trigger the next request without wait.
async.flushTimers();
}

if (!expectNotifyError) {
// No matter how many retries there have been, no request should
// result in an error message.
check(takeLastReportedError()).isNull();
continue;
}
if (i < numFailedRequests - 1) {
// The error message should not appear until the `updateMachine`
// has retried the given number of times.
check(takeLastReportedError()).isNull();
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this test (or set of tests) gets fairly tangly to follow.

I think the main thing is that it'd be helpful to test two aspects separately: there's the existing test cases that just check that we retry, and let's have separate new test cases that just check the error-reporting.

Then I think both sets of tests can be a lot simpler than this — most of the logic is only relevant for one aspect or the other. It's OK that some other fragments of code will be duplicated between them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately and more tactically, I think it'd help to pull the last iteration here outside of the loop — have the loop only concern itself with the first N-1 requests. That would avoid these multiple conditionals about whether it's the last iteration.

Comment on lines 857 to 865
if (e.cause is! SocketException) {
// Heuristic check to only report interesting errors to the user.
// This currently ignores [SocketException], which typically
// occurs when the client goes back from sleep.
// TODO: Investigate if there are other cases of [SocketException]
// that might turn out to be interesting.
//
// See also:
// * [NetworkException.cause], which is the underlying exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (e.cause is! SocketException) {
// Heuristic check to only report interesting errors to the user.
// This currently ignores [SocketException], which typically
// occurs when the client goes back from sleep.
// TODO: Investigate if there are other cases of [SocketException]
// that might turn out to be interesting.
//
// See also:
// * [NetworkException.cause], which is the underlying exception.
if (e.cause is! SocketException) {
// Heuristic check to only report interesting errors to the user.
// A [SocketException] is common when the app returns from sleep.

I think the TODO isn't actionable.

The "See also" isn't needed — .cause is just above, and the reader can follow that to its definition using an IDE.

Then while at it I tightened the sentence about SocketException. No need to say what the code does, as the code here speaks for itself; only to explain why we chose to make the code do that.

@PIG208 PIG208 force-pushed the report-error branch 2 times, most recently from 5f75ee2 to 6c58f92 Compare October 21, 2024 20:38
@PIG208
Copy link
Member Author

PIG208 commented Oct 21, 2024

Thanks for the feedback! Dropped the last commit and previous changes relevant to the tests. A new group of tests is now responsible for testing the error reporting feature.

@PIG208 PIG208 requested a review from gnprice October 21, 2024 20:44
Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! Just small comments below, all in the tests.

And as a note to myself, there's a followup I mentioned at #868 (review) I might do after merge.

@@ -652,6 +655,90 @@ void main() {
test('retries on MalformedServerResponseException', () {
checkRetry(() => connection.prepare(httpStatus: 200, body: 'nonsense'));
});

group('reportErrorToUserBriefly', () {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
group('reportErrorToUserBriefly', () {
group('error reporting', () {

Otherwise it sounds like these are tests of reportErrorToUserBriefly.

Comment on lines 685 to 686
late void Function() prepareError;

void pollAndFail(FakeAsync async) {
prepareError();
updateMachine.debugAdvanceLoop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
late void Function() prepareError;
void pollAndFail(FakeAsync async) {
prepareError();
updateMachine.debugAdvanceLoop();
void pollAndFail(FakeAsync async) {
updateMachine.debugAdvanceLoop();

Instead the callers can say connection.prepare(…) (i.e., the body of what they would set prepareError to) just before each call to pollAndFail. That isn't really any longer — it makes a very slight duplication of the content of those connection.prepare calls — and it makes the control flow substantially more straightforward.

Comment on lines 714 to 712
// This would skip the pending polling backoff.
async.flushTimers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This would skip the pending polling backoff.
async.flushTimers();
// This skips the pending polling backoff.
async.flushTimers();

I.e. this does skip that backoff, right?

Or am I missing a subtlety where it doesn't actually do so, but only would do so if something were different?

@PIG208
Copy link
Member Author

PIG208 commented Oct 23, 2024

Thanks! Ready for review now.

We delay reporting transient errors until we have retried a certain
number of times, because they might recover on their own.

Signed-off-by: Zixuan James Li <[email protected]>
Usually when a client goes back from sleep or lose network connection,
there will be a bunch of errors when polling event queue.  This known
error will always be a `NetworkException`, so we make a separate case
for it.

Previously, we may report these errors, and it ended up being spammy.
This filters out the more interesting one to report instead.

Fixes: zulip#555

Signed-off-by: Zixuan James Li <[email protected]>
@gnprice gnprice merged commit 33c365a into zulip:main Oct 24, 2024
@gnprice
Copy link
Member

gnprice commented Oct 24, 2024

Looks good — merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration review Added by maintainers when PR may be ready for integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show detailed poll-failure feedback, in beta
3 participants