Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather feedback on possible reasons for the add-on failing to trigger a payload send event #26

Open
motin opened this issue May 2, 2018 · 19 comments

Comments

@motin
Copy link
Contributor

motin commented May 2, 2018

Background:
We are seeing considerably fewer pings ending up in our study's parquet table than expected compared to the Normandy recipe's DAU estimates. Now we are exploring possible sources to this discrepancy.

This issue is created to gather feedback on possible reasons for the add-on failing to trigger a payload send event.

After the add-on is installed by Normandy, the bootstrap.js startup method is the first to fire up.

Possible reasons gathered so far:
a. The add-on yields fatal errors in the lines before the eligibility check, where Pioneer utils is initiated (of such nature that they were not seen during development nor QA)
b. The add-on's double-checking of Normandy's eligibility targets are not in sync with the Normandy recipe used: (normandy.addons[‘[email protected]’].isActive && normandy.version >= 59.0)
c. Normandy installs does not set the installation reason to one of install or upgrade (https://github.com/motin/esper-pioneer-shield-study/blob/1.0.4/addon/bootstrap.js#L62)

@pdehaan, @rhelmer, @rehandalal: can you spot any other potential reasons, stemming from the add-on implementation / pioneer-utils usage etc?

@mlopatka
Copy link

mlopatka commented May 2, 2018

FYI: @jtg567 @sunahsuh

@rehandalal
Copy link

There's nothing that immediately stands out.

@mlopatka
Copy link

mlopatka commented May 2, 2018

@gregglind @raymak In a discussion with @jtg567, we thought that since we are not getting any endStudy pings at all, either the addon is hard-failing or targetting is not happening correctly. Do either of you have any insight into what may be going wrong wiht the Normandy/targetting portion of this?

@mlopatka
Copy link

mlopatka commented May 2, 2018

Adding @dzeber to the party!

@pdehaan
Copy link
Contributor

pdehaan commented May 2, 2018

I can't explain any whys, but it may be worth adding a few try..catch blocks around any async+await blocks, which may help us figure out where this is happenin'.

Looking at $ git grep "async" addon/, I only see 7 potential candidates:

  1. addon/bootstrap.js: async startup(addonData, reason) {...
  2. addon/lib/Pioneer.jsm: async startup(addonData) {...
  3. addon/lib/StudyTelemetryCollector.jsm: async start() {...
  4. addon/lib/StudyTelemetryCollector.jsm: async telemetry(schemaName, schemaVersion, payload) {...
  5. addon/lib/StudyTelemetryCollector.jsm: async collectAndSendTelemetry() {...
  6. addon/lib/StudyTelemetryCollector.jsm: static async collectPlacesDbBasedAttributes() {...
  7. addon/lib/StudyTelemetryCollector.jsm: static async queryPlacesDbTelemetry() {...

@rhelmer
Copy link

rhelmer commented May 2, 2018

Hm, one thing I noticed is that https://github.com/motin/esper-pioneer-shield-study/blob/1.0.4/addon/bootstrap.js#L57 is async but there's no await... since this is doing an early return in startup() could it be that the extension gets torn down before Pioneer.utils.endStudy has a chance to run?

@rhelmer
Copy link

rhelmer commented May 2, 2018

FWIW I took a look at how extension tear-down works: each extension runs in its own Cu.Sandbox and when the extension is shut-down, it is removed from the list of active extensions and the sandbox is eventually GC'd.

Honestly not sure if this confirms my theory or not :) Should be pretty easy to test locally, although if GC is involved then it may not be consistent and should be tested over multiple runs.

@motin
Copy link
Contributor Author

motin commented May 3, 2018

@rhelmer Good catch! Issue reported against pioneer-utils. This could definitely affect the sending of endStudy pings.

@motin
Copy link
Contributor Author

motin commented May 3, 2018

Chatted with Rehan, who also believes that b is definitely not the issue and c seems highly unlikely. The only way to understand a better is to do as @pdehaan suggests - more try/catch. Sending of event pings upon caught errors will allow us to get more insight into thrown errors in the wild (ie not detected during QA/development).

@motin
Copy link
Contributor Author

motin commented May 3, 2018

Action plan for a re-launched esper add-on:

  • Send an install event directly upon study add-on installation so that we can confirm how many installed add-ons are able to send pioneer event telemetry (EVENTS.INSTALL) (Depends on Add EVENTS.INSTALLED pioneer-utils#41)
  • Expire the study if no payload was submitted within 30 min (EVENTS.EXPIRED)
  • End the study after the ordinary payload is submitted (EVENTS.ENDED_POSITIVE)
  • Await endStudy() (So that endStudy events get sent more reliably) - (depends on Async/await for endStudy and submitEventPing pioneer-utils#40)
  • Add more try/catch blocks and let caught errors end the study (EVENTS.ENDED_NEGATIVE)

@motin
Copy link
Contributor Author

motin commented May 3, 2018

@mlopatka Looks good enough?

@mlopatka
Copy link

mlopatka commented May 3, 2018

@motin R+ on the action plan. Lets aim to relaunch (again) in 2 weeks (assuming another pass through QA).

@dzeber
Copy link

dzeber commented May 3, 2018

I've been approaching from the opposite direction, looking into Telemetry data for Pioneers who did and didn't install ESPER, and I've got a lead on the Normandy targeting issue.

It looks like ESPER is only getting installed when a profile is on a major release ("59.0") but not when on a dot release ("59.0.3").

  • Pioneers who were exclusively on dot releases during the launch week did not get ESPER. This accounts for the vast majority of Pioneer profiles, who were on a combination of 59.0.2 and 59.0.3.
  • ESPER is only installed on profiles which had at least one subsession on a major version (eg. 59.0) during the week (even if they updated to a dot release later on).

Evidence is here, scrolling down to cells 49, 52 and 56.

Another issue is that a majority of ESPER enrollees are on prerelease (~90%), and it got installed while they were on version 60.0 (cells 64, 73).

  • The majority of Pioneer profiles (~98%) are on release.
  • However, ESPER appears to have been mainly shipped to the 2% that are on prerelease. This may just be a result of prerelease being more likely to be on a major version number (60.0) than release, and not a separate issue.
  • Nonetheless, this is not supposed to happen given the eligibility double-checking.

@dzeber
Copy link

dzeber commented May 3, 2018

Also, it looks like the number of ESPER pings received in the Pioneer pipeline is very close to the number of profiles reporting the ESPER add-on installed in standard Telemetry.

This suggests that there is not a significant problem with the add-on sending pings beyond the usual failures. That said, I think the additional instrumentation and validation recommended by @motin is still a great idea.

@mlopatka
Copy link

mlopatka commented May 4, 2018

@jtg567, based on the findings listed above:

It looks like ESPER is only getting installed when a profile is on a major release ("59.0") but not when on a dot release ("59.0.3").

Is there any way to tell if the delivery of ESPER exclusively to majour releases and not dot releases could be a Normandy targetting issue?

@jtg567
Copy link

jtg567 commented May 4, 2018

@mythmon

@mythmon
Copy link

mythmon commented May 4, 2018

I checked the filter expression for the study. It is this:

(
    normandy.addons['[email protected]'].isActive &&
    normandy.version >= 59.0
)

At first glance, it looks fine. But there is a subtle and very important problem. normandy.version is a string, and it is being compared to a float. This can sometimes be OK. Filter expressions use a lot of JS semantics, including comparison semantics. So in a filter expression "59.0" >= 59.0 will try and coerce the types together, which results in 59.0 >= 59.0, which matches as desired.

Unfortunately these semantics only go so far. On a dot release, normandy.version would have a value like "59.0.2". When comparing "59.0.2" >= 59.0, the left side will again be converted to a number to make the comparison. Unfortunately, in JS "59.0.2".toString() is NaN. All comparisons with NaN result in false, so the filter fails to match.

The fix is to change the filter expression to use normandy.version >= "59.0" (note that the right hand value is now a string). Since both sides are now strings, they will be compared lexicographically, and match will behave as expected.

@jtg567
Copy link

jtg567 commented May 4, 2018

I revised the recipe on normandy - not sure why I count more clients with the ESPER addon than @dzeber, though - I'm not yet sure the recipe was the only issue

@dzeber
Copy link

dzeber commented May 8, 2018

@jtg567 FWIW it looks like you're counting subsessions (main_summary rows) rather than client IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants