Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce CI and Local Flakiness #2496

Merged
merged 59 commits into from
Feb 15, 2024
Merged

Reduce CI and Local Flakiness #2496

merged 59 commits into from
Feb 15, 2024

Conversation

PROFeNoM
Copy link
Contributor

@PROFeNoM PROFeNoM commented Jan 30, 2024

Description

TL;DR: Flakiness is not meant to disappear with this PR (and it won't). Randomized tests are still flaky and were already the greatest source of flakiness. The build pipeline should be significantly improved, while the build_packages will be to a lesser extent because of some residual, unresolved flakiness in the randomized tests. Introducing a retry mechanism when sending traces reduces local flakiness considerably.

Note: Not all numbers can be determined for now, considering there were CI failures associated with unintentional memory leaks, syntax errors, or such. Knowing this, and including these outliers, the pipeline success rate is above 80% (compared to ~65% on master). It will reduce flakiness to an extent.

Let's have it run on the master branch to get some numbers for next week and identify actionable issues.

Sources of Improvements

  • The introduction of DD_TRACE_AGENT_RETRIES (default:= 0) reduced flakiness locally and, to a lesser extent, on the CI.

    • When sending traces using the background sender, curl_easy_perform would sometimes timeout
    • Re-initializing the writer's cURL handle and retrying to send the stack effectively reduces instances of non-sent/received traces by the trace agent, which is a great source of local Flakiness
    • Note that simply re-executing curl_easy_perform doesn't fix the issue
    • Three retries will be performed in the tests, and DD_TRACE_SHUTDOWN_TIMEOUT is greatly increased so that it doesn't conflict with DD_TRACE_BGS_TIMEOUT_VAL
    • Note that trying to use the sidecar didn't work.
  • integration-test_integrations_phpredis5-8.[0|1|2|3]-nts

    • State: The success rate of these individual jobs was 91-95%
    • Results: On this branch, it is now 100%
    • Fix: Determine master nodes on the fly
      • For an unknown reason, the initial master nodes would sometimes be inaccessible/down.
  • randomized_tests...

    • State: The success rate of these individual jobs was 88-93%
    • Fix: Increase resource class for some jobs; reduce concurrency; change test sequence to start/shutdown base containers only once; Change nginx conf files to handle more/bigger requests; exclude profiler use in 7.0
      • Some randomized tests weren’t using the xlarge resource class, which could lead to a smaller relative overload of the tested instance and, therefore, fewer instances of missing points
      • RAM would get overloaded
      • Race condition during base docker shutdown
        • For instance, in this job, the base docker is shut down because it doesn't see a randomized test running... but the other was just about to start.
      • Sometimes, randomized tests would fail for apparently no reason (see this job). All tests have exited with the code 0, but the test fails with error code 2. It was a race condition
      • Sometimes, the job would fail on Waiting for elasticsearch. This probably happens because of the race condition explained above (see error)
      • Bad Gateways. Certainly too many requests/too big of a header/body.
      • The Profiler would be tried to be used in PHP 7.0, where it doesn’t exist
PHP Warning:  PHP Startup: Unable to load dynamic library '/opt/php/7.0/lib/php/extensions/no-debug-non-zts-20151012/datadog-profiling.so' - /opt/php/7.0/lib/php/extensions/no-debug-non-zts-20151012/datadog-profiling.so: cannot open shared object file: No such file or directory in Unknown on line 0
  • integration-test_integrations-X.X
    • Remove the flaky redirection endpoint (google.comwww.google.com) from the testMultiExec test case in the Guzzle test suite.

image

  • Common failure context deadline exceeded
    • Each time this failure happened (e.g., such jobs), the RAM & CPU usage was maxed out.
    • Increasing the resource class, reducing concurrency (randomized tests only), and preventing the startup of unused docker containers should help with this issue. Typically, we don’t need an SQLServer docker container when running the PHPRedis tests.

Unresolved

  • randomized_tests...
    • These jobs are still flaky
    • Core dumps not related to dd-trace-php.
      • This happens mostly in 7.0. To reduce flakiness, we could remove these... but I don't think that makes sense
    • The MySQL container would sometimes be inaccessible (despite passing the “wait-for” step). Either the “wait-for” step is wrong, or something else is making it inaccessible. It appeared only once.
[24-Jan-2024 10:44:55 UTC] Handling Error: 2 - mysqli_connect(): php_network_getaddresses: getaddrinfo for mysql failed: Name or service not known
[24-Jan-2024 10:44:55 UTC] Unexpected Exception: mysqli_sql_exception: php_network_getaddresses: getaddrinfo for MySQL failed: Name or service not known in /var/www/html/src/RandomizedTests/Snippets.php:141
Stack trace:
#0 /var/www/html/src/RandomizedTests/Snippets.php(141): mysqli_connect('mysql', 'test', Object(SensitiveParameterValue), 'test', 3306)
#1 /var/www/html/src/RandomizedTests/Snippets.php(60): RandomizedTests\Snippets->mysqliVariant1()
#2 /var/www/html/src/RandomizedTests/RandomExecutionPath.php(176): RandomizedTests\Snippets->runSomeIntegrations()
#3 /var/www/html/src/RandomizedTests/RandomExecutionPath.php(168): RandomizedTests\RandomExecutionPath->runSomeIntegrations()
#4 /var/www/html/src/RandomizedTests/RandomExecutionPath.php(151): RandomizedTests\RandomExecutionPath->maybeRunSomeIntegrations()
#5 /var/www/html/src/RandomizedTests/RandomExecutionPath.php(113): RandomizedTests\RandomExecutionPath->doSomethingUntraced2()
#6 /var/www/html/src/RandomizedTests/RandomExecutionPath.php(69): RandomizedTests\RandomExecutionPath->doSomethingUntraced1()
#7 /var/www/html/src/RandomizedTests/RandomExecutionPath.php(55): RandomizedTests\RandomExecutionPath->doSomethingUntraced()
#8 /var/www/html/public/long_running_script.php(58): RandomizedTests\RandomExecutionPath->randomPath()
#9 /var/www/html/public/long_running_script.php(101): Randomized\LongRunning\ProcessingStage1->process(Object(Randomized\LongRunning\Message))
#10 /var/www/html/public/long_running_script.php(152): Randomized\LongRunning\processMessage(Object(Randomized\LongRunning\Message), Array)
#11 {main}
  • Web tests (or any tests using composer)
    • Some tests fail when performing composer update/install (e.g., this job while extracting, or some connection failures).

image

Reviewer checklist

  • Test coverage seems ok.
  • Appropriate labels assigned.

@PROFeNoM PROFeNoM added the ci label Jan 30, 2024
@PROFeNoM PROFeNoM self-assigned this Jan 30, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2024

Codecov Report

Merging #2496 (2254c3d) into master (8ed91bf) will decrease coverage by 2.03%.
Report is 1 commits behind head on master.
The diff coverage is 89.74%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2496      +/-   ##
============================================
- Coverage     78.65%   76.63%   -2.03%     
  Complexity      267      267              
============================================
  Files           110      136      +26     
  Lines         13376    17549    +4173     
  Branches          0     1020    +1020     
============================================
+ Hits          10521    13448    +2927     
- Misses         2855     3568     +713     
- Partials          0      533     +533     
Flag Coverage Δ
appsec-extension 70.05% <ø> (?)
tracer-extension 78.62% <89.74%> (+0.02%) ⬆️
tracer-integrations 79.49% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
ext/configuration.h 100.00% <ø> (ø)
ext/ddtrace.h 62.50% <ø> (ø)
ext/coms.c 80.13% <93.33%> (+0.29%) ⬆️
ext/ddtrace.c 72.85% <77.77%> (+0.01%) ⬆️

... and 26 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ed91bf...2254c3d. Read the comment docs.

@pr-commenter
Copy link

pr-commenter bot commented Jan 31, 2024

Benchmarks

Benchmark execution time: 2024-02-14 15:21:08

Comparing candidate commit 2254c3d in PR branch alex/ci/flakiness with baseline commit 8ed91bf in branch master.

Found 2 performance improvements and 0 performance regressions! Performance is the same for 37 metrics, 51 unstable metrics.

scenario:PHPRedisBench/benchRedisBaseline

  • 🟩 execution_time [-2.921ms; -2.825ms] or [-97.133%; -93.917%]

scenario:PHPRedisBench/benchRedisOverhead

  • 🟩 execution_time [-3.014ms; -2.870ms] or [-93.278%; -88.824%]

@DataDog DataDog deleted a comment from github-actions bot Jan 31, 2024
@PROFeNoM PROFeNoM marked this pull request as ready for review January 31, 2024 09:46
@PROFeNoM PROFeNoM requested a review from a team as a code owner January 31, 2024 09:46
@PROFeNoM PROFeNoM marked this pull request as draft February 1, 2024 08:17
@PROFeNoM PROFeNoM marked this pull request as ready for review February 1, 2024 14:15
@bwoebi
Copy link
Collaborator

bwoebi commented Feb 7, 2024

Regarding web tests:

  • Simply retrying composer should be probably unproblematic I guess?

Regarding randomized:

  • Core dumps from PHP 7.0: is this a particular SAPI? extension? i.e. something we can specifically exclude for PHP 7.0?
  • Can we specifically catch mysqli_sql_exception around mysqli constructor/mysqli_connect, to just skip this particular path in that run?

@PROFeNoM PROFeNoM marked this pull request as draft February 8, 2024 08:10
@PROFeNoM
Copy link
Contributor Author

PROFeNoM commented Feb 8, 2024

Converted back to draft while I try to understand the flakiness of the randomized tests

@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 8, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 9, 2024
@DataDog DataDog deleted a comment from github-actions bot Feb 9, 2024
@bwoebi bwoebi force-pushed the alex/ci/flakiness branch 2 times, most recently from 431aa28 to 67c1c89 Compare February 13, 2024 12:22
PROFeNoM and others added 22 commits February 14, 2024 13:10
Notes: I think not doing commands doing I/O should prevent, or at the very least reduce, any variability that could be associated with disks R/W. What's more, we're not using AOF, so a bgRewriteAOF doesn't make sense. We fundamentally don't care about the commands executed, only the overhead associated with traceMethodNoArgs and traceMethodAsCommand
@PROFeNoM PROFeNoM marked this pull request as ready for review February 14, 2024 12:13
Copy link
Collaborator

@bwoebi bwoebi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work and investigation around the individual CI runs :-)

@bwoebi bwoebi merged commit 2df1f16 into master Feb 15, 2024
580 checks passed
@bwoebi bwoebi deleted the alex/ci/flakiness branch February 15, 2024 16:20
@github-actions github-actions bot added this to the 0.98.0 milestone Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants