Skip to content

Optimise CI - reduce flakyness, unify workflows #4516

Closed
@0x009922

Description

@0x009922

I encounter this time and time again: you open a PR, wait for checks for 10/20/30 minutes, and then you see that same workflows fail on the same tests. During that time I get distracted on something else and forget to re-start checks again just in time. Sometimes it takes 5-6 restarts to get them work. This way, the moment of all green lights might delay for days. And it delays development in my case.

From my observation, the following workflows are flaky:

  • I2::Dev::Tests > with_coverage, integration, unstable

And these are particular flaky tests:

  • integration::extra_functional::offline_peers::genesis_block_is_committed_with_some_offline_peers
  • integration::extra_functional::unstable_network::soft_fork
  • And maybe some others, haven't collected much data

Are these tests worth it?

My another concern is that I don't see the rationale behind having so many workflows:

  1. I2::Dev::Static
    1. smart contracts
    2. workspace
  2. I2::Tests::UI
    1. test with all features
    2. test with no default features
  3. I2::Dev::Tests
    1. consistency
    2. with_coverage
    3. integration
    4. unstable
    5. client-cli-tests

(there are some others too)

These workflows all run Cargo and compile more or less the same stuff. Yes, there are variations in features presets, but Cargo handles it for us. It can granularly reuse compilation artifacts depending on the context (apart from cases with different RUSTC flags, I suppose).

So, I guess that it is worth trying to combine all these workflows into a single one, and build it in a way so that it can report as many useful information as possible in a single run. I wonder how much more/less it would be efficient.

Another useful implication of this would be a shorter feedback on some early errors. For example, a certain change in PR introduces something and Iroha cannot even compile. Currently, all 8+ workflows will run and fail on the same error. In the case of a unified CI, there will be less work repetition.

Proposals

  • Prioritise zero-tolerance to flaky tests from development side
  • If flaky tests couldn't be easily fixed, possibly move them away from PR checks to after-merge checks.
  • Create a single unified workflow, and research the performance impact of it.
  • Explore ways to use a sane scripting language for CI, not Shell. That's for a separate issue, maybe.

Metadata

Metadata

Labels

CIiroha2-devThe re-implementation of a BFT hyperledger in RUSTquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions