Skip to content

Polkadot SDK release automated QA #12054

@sandreim

Description

@sandreim

Problem

We are currently using Westend as a test environment for Polkadot SDK release candidates. But in reality there is little testing and fully manual. Our current release cycle mentions a QA period that covers runtime upgrades for system parachains, release candidates, and collator and node functionality.

At same time, Westend is also used as an ad-hoc test environment for multiple projects.

Note: we do have automated testing, but this is done in PR CI but this is limited scale, short duration and no actual coverage end to end and does not even consider real world scenarios (mix of nodes) or production runtimes.

Key challenges:

  • No real testing, or at best shallow but fully manual: "Is the chain working?" There is no rigorous QA in place, and RCs are pushed manually,
  • No application testing: Zero applications run on Westend during the QA cycle, making it impossible to detect regressions or new bugs introduced by infrastructure changes that affect apps.
  • No long-duration testing: Automated testing for extended periods is missing. Memory leaks, unbounded storage growth, degradations or any issues that take a while to surface are never discovered in testing
  • Lack of matrix Testing: Testing always uses the latest node version, unlike Polkadot which runs a mix of versions.
  • No negative scenario testing.
  • No backward compatibility guarantee: Testing is not conducted against production runtimes, meaning backward compatibility is not verified.
  • Fragility and instability: The environment is highly fragile, frequently experiencing high block times and finality stalls, often breaking the testing process.
  • Conflicting needs: The QA cycle can be disrupted by competing temporary needs from various development teams, or teams have to take turns at using Westend for testing.

The most critical issues requiring immediate resolution are: an automated testing pipeline, conflicting/contention on testing, and the absence of robust, production runtimes and close to real-world, long-duration testing.

Proposed solution

Create a separate test environment by forking Westend using zombie-bite. This testing will not be part of the CI, instead it will be triggered by releases of new SDK release candidates and also triggered by the fellowship releases of new production runtimes. The environment scale needs to be adjustable. Multiple environments (aka ephemeral testnets) can be spawned at same time, and multiple tests can run on the same environment at the same time.

The test suite can live in Polkadot SDK repo or a different repository. Tests built on top of Zombienet SDK.

Image

Environment (aka test-net) must mimic real world:

  • matrix testing aka multiple node versions (the release under test + old/unsupported validators)
  • simulated latencies & faults (packet drops, low connectivity)
  • malicious actors
  • simulated users (will generate load during the tests - buy coretime, deploy/call contracts)
  • all nodes are run with sufficient resources and debug level

Coverage:

  • basic testing of block production, block confidence and finality
  • RC consensus protocols (in particular availability and disputes)
  • system parachain testing (example: coretime works, bulletin storage works, basic contract testing)
  • e2e application testing

Duration: The goal is to run these tests continously over a longer period of time (days, up to a week).

Monitoring & reporting:

  • The usual stack: Grafana/Loki
  • We'll use the Zombienet SDK to assert metrics and test fail pass
  • High level testing results are aggregated into a single ticket (per run)
  • Individual failures are logged as separate tickets and linked to the run they occured in. Each ticket provides links to artifacts, logs and metrics that indicate the failure.

MVP

The first focus should be building an MVP pipeline that focuses on testing block production, block confidence and finality for SDK releases only.

  • zombie-bite change to allow forking Westend
  • support longer runs: at least 1 day
  • Create initial test suite (adapt existing CI zombienet testing)
  • Zombienet SDK support for test results reporting
  • Integrate with SDK release
  • One SDK RC consistently passes all tests (showing robustness - is not flaky)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions