Some benchmarks are very noisy

Here's a sequence of benchmark runs on the same code (https://github.com/haskell-unordered-containers/unordered-containers/commit/bd165b028d6e0e5bec7c5b386734424f94f12caa) using `tasty-bench`'s `--fail-faster` and `--fail-slower` flags to highlight differing results:

<details>

```
$ cabal bench --benchmark-options "--stdev=1 --timeout=10 --csv=bench-0.csv"
<snip>
$ cabal bench --benchmark-options "--stdev=1 --timeout=10 --csv=bench-1.csv --baseline=bench-0.csv --fail-if-slower=5 --fail-if-faster=5 --hide-successes"
All
  Map
    insert
      String:           FAIL (0.95s)
        1.63 ms ±  25 μs,  5% faster than baseline
        Use -p '/All.Map.insert.String/' to rerun this test only.
      ByteStringString: FAIL (0.86s)
        1.45 ms ±  23 μs,  5% faster than baseline
        Use -p '/All.Map.insert.ByteStringString/' to rerun this test only.
    fromList
      ByteString:       FAIL (0.86s)
        1.46 ms ±  28 μs,  5% faster than baseline
        Use -p '/All.Map.fromList.ByteString/' to rerun this test only.
  hashmap/Map
    delete-miss
      ByteString:       FAIL (0.78s)
        650  μs ±  11 μs, 10% faster than baseline
        Use -p '/hashmap\/Map.delete-miss.ByteString/' to rerun this test only.
  IntMap
    lookup-miss:        FAIL (1.54s)
      338  μs ± 2.7 μs,  5% slower than baseline
      Use -p '/IntMap.lookup-miss/' to rerun this test only.
    delete-miss:        FAIL (1.34s)
      582  μs ± 6.1 μs,  9% faster than baseline
      Use -p '/IntMap.delete-miss/' to rerun this test only.
  HashMap
    lookup-miss
      ByteString:       FAIL (1.09s)
        112  μs ± 1.6 μs,  5% slower than baseline
        Use -p '/HashMap.lookup-miss.ByteString/' to rerun this test only.
      Int:              FAIL (1.01s)
        102  μs ± 1.6 μs,  9% slower than baseline
        Use -p '/lookup-miss.Int/' to rerun this test only.
    insert
      ByteString:       FAIL (1.21s)
        523  μs ± 6.3 μs, 19% faster than baseline
        Use -p '/HashMap.insert.ByteString/' to rerun this test only.
      Int:              FAIL (1.11s)
        469  μs ± 5.6 μs, 13% faster than baseline
        Use -p '/insert.Int/' to rerun this test only.
    insert-dup
      Int:              FAIL (0.96s)
        398  μs ± 5.7 μs, 14% faster than baseline
        Use -p '/insert-dup.Int/' to rerun this test only.
    delete
      String:           FAIL (0.90s)
        754  μs ±  11 μs, 12% faster than baseline
        Use -p '/HashMap.delete.String/' to rerun this test only.
    delete-miss
      String:           FAIL (0.97s)
        205  μs ± 3.0 μs,  5% faster than baseline
        Use -p '/HashMap.delete-miss.String/' to rerun this test only.
      ByteString:       FAIL (0.77s)
        149  μs ± 2.6 μs,  7% slower than baseline
        Use -p '/HashMap.delete-miss.ByteString/' to rerun this test only.
      Int:              FAIL (1.33s)
        289  μs ± 2.8 μs,  5% slower than baseline
        Use -p '/delete-miss.Int/' to rerun this test only.
    alterInsert
      ByteString:       FAIL (1.31s)
        580  μs ± 7.2 μs, 18% faster than baseline
        Use -p '/alterInsert.ByteString/' to rerun this test only.
      Int:              FAIL (1.19s)
        505  μs ± 5.9 μs, 21% faster than baseline
        Use -p '/alterInsert.Int/' to rerun this test only.
    alterFInsert
      String:           FAIL (4.86s)
        570  μs ± 1.5 μs, 15% faster than baseline
        Use -p '/alterFInsert.String/' to rerun this test only.
      ByteString:       FAIL (1.21s)
        518  μs ± 5.8 μs, 20% faster than baseline
        Use -p '/alterFInsert.ByteString/' to rerun this test only.
      Int:              FAIL (1.10s)
        465  μs ± 7.9 μs, 22% faster than baseline
        Use -p '/alterFInsert.Int/' to rerun this test only.
    alterFInsert-dup
      Int:              FAIL (0.94s)
        387  μs ± 5.8 μs, 15% faster than baseline
        Use -p '/alterFInsert-dup.Int/' to rerun this test only.
    alterFDelete-miss
      String:           FAIL (0.96s)
        203  μs ± 2.9 μs,  5% faster than baseline
        Use -p '/alterFDelete-miss.String/' to rerun this test only.
      ByteString:       FAIL (0.76s)
        148  μs ± 2.8 μs,  6% slower than baseline
        Use -p '/alterFDelete-miss.ByteString/' to rerun this test only.
    fromListWith
      long
        String:         FAIL (0.92s)
          387  μs ± 7.0 μs,  6% faster than baseline
          Use -p '/fromListWith.long.String/' to rerun this test only.

24 out of 118 tests failed (184.35s)

$ cabal bench --benchmark-options "--stdev=1 --timeout=10 --csv=bench-2.csv --baseline=bench-1.csv --fail-if-slower=5 --fail-if-faster=5 --hide-successes"
All
  hashmap/Map
    delete
      ByteString:       FAIL (0.82s)
        677  μs ±  11 μs,  8% faster than baseline
        Use -p '/hashmap\/Map.delete.ByteString/' to rerun this test only.
  IntMap
    delete:             FAIL (1.16s)
      495  μs ± 5.5 μs,  9% faster than baseline
      Use -p '$0=="All.IntMap.delete"' to rerun this test only.
  HashMap
    delete
      ByteString:       FAIL (1.38s)
        610  μs ± 5.5 μs, 15% faster than baseline
        Use -p '/HashMap.delete.ByteString/' to rerun this test only.
      Int:              FAIL (1.06s)
        444  μs ± 5.7 μs, 12% faster than baseline
        Use -p '/delete.Int/' to rerun this test only.
    delete-miss
      Int:              FAIL (2.30s)
        260  μs ± 1.4 μs, 10% faster than baseline
        Use -p '/delete-miss.Int/' to rerun this test only.
    alterInsert-dup
      Int:              FAIL (1.02s)
        429  μs ± 5.3 μs, 13% faster than baseline
        Use -p '/alterInsert-dup.Int/' to rerun this test only.
    alterDelete
      String:           FAIL (0.91s)
        760  μs ±  11 μs, 11% faster than baseline
        Use -p '/alterDelete.String/' to rerun this test only.
      ByteString:       FAIL (0.79s)
        637  μs ±  12 μs, 13% faster than baseline
        Use -p '/alterDelete.ByteString/' to rerun this test only.
      Int:              FAIL (1.08s)
        453  μs ± 6.0 μs, 11% faster than baseline
        Use -p '/alterDelete.Int/' to rerun this test only.
    alterFDelete
      String:           FAIL (0.88s)
        745  μs ±  11 μs, 12% faster than baseline
        Use -p '/alterFDelete.String/' to rerun this test only.
      ByteString:       FAIL (1.40s)
        609  μs ± 5.4 μs, 15% faster than baseline
        Use -p '/alterFDelete.ByteString/' to rerun this test only.
      Int:              FAIL (1.04s)
        446  μs ± 6.1 μs, 10% faster than baseline
        Use -p '/alterFDelete.Int/' to rerun this test only.
    alterDelete-miss
      Int:              FAIL (1.25s)
        269  μs ± 2.8 μs,  8% faster than baseline
        Use -p '/alterDelete-miss.Int/' to rerun this test only.
    alterFDelete-miss
      Int:              FAIL (1.23s)
        260  μs ± 3.0 μs,  7% faster than baseline
        Use -p '/alterFDelete-miss.Int/' to rerun this test only.

14 out of 118 tests failed (137.69s)
```

</details>

Benchmarks for `containers` and `hashmap` were included by uncommenting this line:

https://github.com/haskell-unordered-containers/unordered-containers/blob/bd165b028d6e0e5bec7c5b386734424f94f12caa/unordered-containers.cabal#L217

I did try to make my machine pretty quiet for these runs. I don't know why these benchmarks are still so very noisy, but I note that most of these are on the slower end of our benchmark suite.

It also seems noteworthy that hardly any of the `containers` and `hashmap` benchmarks are included, apparently more than would be explained by their smaller share of the suite.

Maybe implementing https://github.com/haskell-unordered-containers/unordered-containers/issues/293 would help?!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some benchmarks are very noisy #332

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some benchmarks are very noisy #332

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions