Skip to content

Some benchmarks are very noisy #332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sjakobi opened this issue Nov 29, 2021 · 8 comments
Open

Some benchmarks are very noisy #332

sjakobi opened this issue Nov 29, 2021 · 8 comments

Comments

@sjakobi
Copy link
Member

sjakobi commented Nov 29, 2021

Here's a sequence of benchmark runs on the same code (bd165b0) using tasty-bench's --fail-faster and --fail-slower flags to highlight differing results:

$ cabal bench --benchmark-options "--stdev=1 --timeout=10 --csv=bench-0.csv"
<snip>
$ cabal bench --benchmark-options "--stdev=1 --timeout=10 --csv=bench-1.csv --baseline=bench-0.csv --fail-if-slower=5 --fail-if-faster=5 --hide-successes"
All
  Map
    insert
      String:           FAIL (0.95s)
        1.63 ms ±  25 μs,  5% faster than baseline
        Use -p '/All.Map.insert.String/' to rerun this test only.
      ByteStringString: FAIL (0.86s)
        1.45 ms ±  23 μs,  5% faster than baseline
        Use -p '/All.Map.insert.ByteStringString/' to rerun this test only.
    fromList
      ByteString:       FAIL (0.86s)
        1.46 ms ±  28 μs,  5% faster than baseline
        Use -p '/All.Map.fromList.ByteString/' to rerun this test only.
  hashmap/Map
    delete-miss
      ByteString:       FAIL (0.78s)
        650  μs ±  11 μs, 10% faster than baseline
        Use -p '/hashmap\/Map.delete-miss.ByteString/' to rerun this test only.
  IntMap
    lookup-miss:        FAIL (1.54s)
      338  μs ± 2.7 μs,  5% slower than baseline
      Use -p '/IntMap.lookup-miss/' to rerun this test only.
    delete-miss:        FAIL (1.34s)
      582  μs ± 6.1 μs,  9% faster than baseline
      Use -p '/IntMap.delete-miss/' to rerun this test only.
  HashMap
    lookup-miss
      ByteString:       FAIL (1.09s)
        112  μs ± 1.6 μs,  5% slower than baseline
        Use -p '/HashMap.lookup-miss.ByteString/' to rerun this test only.
      Int:              FAIL (1.01s)
        102  μs ± 1.6 μs,  9% slower than baseline
        Use -p '/lookup-miss.Int/' to rerun this test only.
    insert
      ByteString:       FAIL (1.21s)
        523  μs ± 6.3 μs, 19% faster than baseline
        Use -p '/HashMap.insert.ByteString/' to rerun this test only.
      Int:              FAIL (1.11s)
        469  μs ± 5.6 μs, 13% faster than baseline
        Use -p '/insert.Int/' to rerun this test only.
    insert-dup
      Int:              FAIL (0.96s)
        398  μs ± 5.7 μs, 14% faster than baseline
        Use -p '/insert-dup.Int/' to rerun this test only.
    delete
      String:           FAIL (0.90s)
        754  μs ±  11 μs, 12% faster than baseline
        Use -p '/HashMap.delete.String/' to rerun this test only.
    delete-miss
      String:           FAIL (0.97s)
        205  μs ± 3.0 μs,  5% faster than baseline
        Use -p '/HashMap.delete-miss.String/' to rerun this test only.
      ByteString:       FAIL (0.77s)
        149  μs ± 2.6 μs,  7% slower than baseline
        Use -p '/HashMap.delete-miss.ByteString/' to rerun this test only.
      Int:              FAIL (1.33s)
        289  μs ± 2.8 μs,  5% slower than baseline
        Use -p '/delete-miss.Int/' to rerun this test only.
    alterInsert
      ByteString:       FAIL (1.31s)
        580  μs ± 7.2 μs, 18% faster than baseline
        Use -p '/alterInsert.ByteString/' to rerun this test only.
      Int:              FAIL (1.19s)
        505  μs ± 5.9 μs, 21% faster than baseline
        Use -p '/alterInsert.Int/' to rerun this test only.
    alterFInsert
      String:           FAIL (4.86s)
        570  μs ± 1.5 μs, 15% faster than baseline
        Use -p '/alterFInsert.String/' to rerun this test only.
      ByteString:       FAIL (1.21s)
        518  μs ± 5.8 μs, 20% faster than baseline
        Use -p '/alterFInsert.ByteString/' to rerun this test only.
      Int:              FAIL (1.10s)
        465  μs ± 7.9 μs, 22% faster than baseline
        Use -p '/alterFInsert.Int/' to rerun this test only.
    alterFInsert-dup
      Int:              FAIL (0.94s)
        387  μs ± 5.8 μs, 15% faster than baseline
        Use -p '/alterFInsert-dup.Int/' to rerun this test only.
    alterFDelete-miss
      String:           FAIL (0.96s)
        203  μs ± 2.9 μs,  5% faster than baseline
        Use -p '/alterFDelete-miss.String/' to rerun this test only.
      ByteString:       FAIL (0.76s)
        148  μs ± 2.8 μs,  6% slower than baseline
        Use -p '/alterFDelete-miss.ByteString/' to rerun this test only.
    fromListWith
      long
        String:         FAIL (0.92s)
          387  μs ± 7.0 μs,  6% faster than baseline
          Use -p '/fromListWith.long.String/' to rerun this test only.

24 out of 118 tests failed (184.35s)

$ cabal bench --benchmark-options "--stdev=1 --timeout=10 --csv=bench-2.csv --baseline=bench-1.csv --fail-if-slower=5 --fail-if-faster=5 --hide-successes"
All
  hashmap/Map
    delete
      ByteString:       FAIL (0.82s)
        677  μs ±  11 μs,  8% faster than baseline
        Use -p '/hashmap\/Map.delete.ByteString/' to rerun this test only.
  IntMap
    delete:             FAIL (1.16s)
      495  μs ± 5.5 μs,  9% faster than baseline
      Use -p '$0=="All.IntMap.delete"' to rerun this test only.
  HashMap
    delete
      ByteString:       FAIL (1.38s)
        610  μs ± 5.5 μs, 15% faster than baseline
        Use -p '/HashMap.delete.ByteString/' to rerun this test only.
      Int:              FAIL (1.06s)
        444  μs ± 5.7 μs, 12% faster than baseline
        Use -p '/delete.Int/' to rerun this test only.
    delete-miss
      Int:              FAIL (2.30s)
        260  μs ± 1.4 μs, 10% faster than baseline
        Use -p '/delete-miss.Int/' to rerun this test only.
    alterInsert-dup
      Int:              FAIL (1.02s)
        429  μs ± 5.3 μs, 13% faster than baseline
        Use -p '/alterInsert-dup.Int/' to rerun this test only.
    alterDelete
      String:           FAIL (0.91s)
        760  μs ±  11 μs, 11% faster than baseline
        Use -p '/alterDelete.String/' to rerun this test only.
      ByteString:       FAIL (0.79s)
        637  μs ±  12 μs, 13% faster than baseline
        Use -p '/alterDelete.ByteString/' to rerun this test only.
      Int:              FAIL (1.08s)
        453  μs ± 6.0 μs, 11% faster than baseline
        Use -p '/alterDelete.Int/' to rerun this test only.
    alterFDelete
      String:           FAIL (0.88s)
        745  μs ±  11 μs, 12% faster than baseline
        Use -p '/alterFDelete.String/' to rerun this test only.
      ByteString:       FAIL (1.40s)
        609  μs ± 5.4 μs, 15% faster than baseline
        Use -p '/alterFDelete.ByteString/' to rerun this test only.
      Int:              FAIL (1.04s)
        446  μs ± 6.1 μs, 10% faster than baseline
        Use -p '/alterFDelete.Int/' to rerun this test only.
    alterDelete-miss
      Int:              FAIL (1.25s)
        269  μs ± 2.8 μs,  8% faster than baseline
        Use -p '/alterDelete-miss.Int/' to rerun this test only.
    alterFDelete-miss
      Int:              FAIL (1.23s)
        260  μs ± 3.0 μs,  7% faster than baseline
        Use -p '/alterFDelete-miss.Int/' to rerun this test only.

14 out of 118 tests failed (137.69s)

Benchmarks for containers and hashmap were included by uncommenting this line:

-- cpp-options: -DBENCH_containers_Map -DBENCH_containers_IntMap -DBENCH_hashmap_Map

I did try to make my machine pretty quiet for these runs. I don't know why these benchmarks are still so very noisy, but I note that most of these are on the slower end of our benchmark suite.

It also seems noteworthy that hardly any of the containers and hashmap benchmarks are included, apparently more than would be explained by their smaller share of the suite.

Maybe implementing #293 would help?!

@doyougnu
Copy link
Contributor

doyougnu commented Dec 22, 2021

I started having a go at #17 and got distracted by this. I don't have conclusive evidence but I suspect that a lot of this noise is due to stalled backend cycles in the CPU. This is observable with perf:

Here I run the benchmarks just for lookups

[nix-shell:~/programming/haskell/unordered-containers]$ perf stat -r 5 ./dist-newstyle/build/x86_64-linux/ghc-8.10.7/unordered-containers-0.2.16.0/b/benchmarks/build/benchmarks/benchmarks -m prefix "HashMap/lookup"
benchmarked HashMap/lookup/String
time                 1.585 ms   (1.579 ms .. 1.596 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.608 ms   (1.604 ms .. 1.612 ms)
std dev              10.69 μs   (8.154 μs .. 15.22 μs)

...

benchmarked HashMap/lookup-miss/Int
time                 268.4 μs   (262.2 μs .. 275.5 μs)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 261.8 μs   (260.6 μs .. 262.9 μs)
std dev              1.956 μs   (1.459 μs .. 2.744 μs)


 Performance counter stats for './dist-newstyle/build/x86_64-linux/ghc-8.10.7/unordered-containers-0.2.16.0/b/benchmarks/build/benchmarks/benchmarks -m prefix HashMap/lookup' (5 runs):

         32,708.84 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.27% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             8,390      page-faults:u             #    0.257 K/sec                    ( +-  0.01% )
   129,934,644,209      cycles:u                  #    3.972 GHz                      ( +-  0.26% )
     3,223,693,319      stalled-cycles-frontend:u #    2.48% frontend cycles idle     ( +-  4.37% )
    42,343,991,912      stalled-cycles-backend:u  #   32.59% backend cycles idle      ( +-  0.35% )
   207,466,788,278      instructions:u            #    1.60  insn per cycle
                                                  #    0.20  stalled cycles per insn  ( +-  0.58% )
    36,443,482,339      branches:u                # 1114.178 M/sec                    ( +-  0.57% )
       347,372,412      branch-misses:u           #    0.95% of all branches          ( +-  0.44% )

           32.7236 +- 0.0878 seconds time elapsed  ( +-  0.27% )

32% is quite a bit of stalled cycles. Typically this means that on my CPU (AMD Ryzen 2700X) the cpu spends a lot of time waiting for memory to load into the CPU caches.

So I wonder if some prefetchValue3#s from GHC.Exts would reduce noise and improve performance. For example, I think in lookupInArrayCont we could try to prefetch several array indices at a time. Deciding on the right number will require some investigation but if successful we'll be saving millionsbillions stalled CPU Cycles.

The same signal exists for inserts:

[nix-shell:~/programming/haskell/unordered-containers]$ perf stat -r 5 ./dist-newstyle/build/x86_64-linux/ghc-8.10.7/unordered-containers-0.2.16.0/b/benchmarks/build/benchmarks/benchmarks -m prefix "HashMap/insert"
benchmarked HashMap/insert/String
time                 4.627 ms   (4.435 ms .. 4.808 ms)
                     0.994 R²   (0.989 R² .. 0.998 R²)
mean                 4.388 ms   (4.351 ms .. 4.457 ms)
std dev              131.1 μs   (88.37 μs .. 210.5 μs)

...

benchmarked HashMap/insert-dup/Int
time                 1.691 ms   (1.670 ms .. 1.710 ms)
                     0.998 R²   (0.992 R² .. 1.000 R²)
mean                 1.687 ms   (1.679 ms .. 1.714 ms)
std dev              35.49 μs   (11.40 μs .. 70.98 μs)


 Performance counter stats for './dist-newstyle/build/x86_64-linux/ghc-8.10.7/unordered-containers-0.2.16.0/b/benchmarks/build/benchmarks/benchmarks -m prefix HashMap/insert' (5 runs):

         31,887.49 msec task-clock:u              #    0.999 CPUs utilized            ( +-  0.12% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
            11,974      page-faults:u             #    0.376 K/sec                    ( +-  0.00% )
   126,683,235,387      cycles:u                  #    3.973 GHz                      ( +-  0.12% )
     3,373,656,912      stalled-cycles-frontend:u #    2.66% frontend cycles idle     ( +-  5.53% )
    40,100,948,950      stalled-cycles-backend:u  #   31.65% backend cycles idle      ( +-  0.30% )
   195,121,617,598      instructions:u            #    1.54  insn per cycle
                                                  #    0.21  stalled cycles per insn  ( +-  0.38% )
    34,366,487,498      branches:u                # 1077.742 M/sec                    ( +-  0.37% )
       353,643,833      branch-misses:u           #    1.03% of all branches          ( +-  0.24% )

           31.9046 +- 0.0349 seconds time elapsed  ( +-  0.11% )

@treeowl
Copy link
Collaborator

treeowl commented Dec 22, 2021

@sjakobi yes, for accessibility purposes, please use text rather than images of text. Images should only be used if they add substantial value (e.g., diagrams or graphs).

@sjakobi
Copy link
Member Author

sjakobi commented Dec 23, 2021

@treeowl Alright. I've updated the issue description. Luckily, tasty-bench makes it pretty easy to get text-based output regarding differences between benchmark runs.

@treeowl
Copy link
Collaborator

treeowl commented Dec 23, 2021

Thanks!

@sjakobi
Copy link
Member Author

sjakobi commented Dec 23, 2021

I have updated the issue description again to include benchmark results from containers and hashmap.

@sjakobi
Copy link
Member Author

sjakobi commented Dec 23, 2021

@doyougnu That's a very interesting lead! An array-related underlying problem would explain why the benchmarks for containers and hashmap (which is based on IntMap) are less noisy than those for u-c.

It would be great if you could take a stab at fixing this! I've got to admit that this is the first time I hear of prefetchValue3#, so this is quite a bit out of my usual comfort zone.

@sjakobi
Copy link
Member Author

sjakobi commented Dec 23, 2021

For reference:


EDIT: I've requested better haddocks for these operations in https://gitlab.haskell.org/ghc/ghc/-/issues/20856.

@doyougnu
Copy link
Contributor

@sjakobi I've got an MR up for 10809 now: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7245

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants