Simplified implementations of Array.foreach and Array.map improve performance by 2–3.5× #23649
Replies: 3 comments 1 reply
-
That's certainly interesting. The source code of the standard library (also the one used in Scala 3, for now) is over at https://github.com/scala/scala (Scala 2), there's also some existing infra for writing benchmarks (https://github.com/scala/scala/blob/2.13.x/test/benchmarks/README.md). It seems to me the old code is trying to prevent primitive boxing / unboxing, I'd be interested to check in detail (in bytecode) if that wasn't actually helpful. It's possible that the situation is different on Scala 2 because Scala 3 doesn't have function specialization (right?). |
Beta Was this translation helpful? Give feedback.
-
@lrytz I didn’t notice any significant differences in the bytecode between Scala 2 and Scala 3 for the same code. I think the main difference lies in the implementation of map: the old version produce fairly large bytecode with type-based branching, and according to the JIT logs, the compiler occasionally fails to inline this method ( Thank you for the benchmark links. It was interesting to go through them. It seems to me that, for example
use rather simple lambdas that the JIT will most likely optimize aggressively. In real-world scenarios, where the functions inside map are more complex, the compiler’s behavior could be different. |
Beta Was this translation helpful? Give feedback.
-
Here’s a more detailed analysis with proof: Bytecode in benchmarks. // method -> specialized lambda (identical for oldMap/newMap)
bh.consume(ArrayOpsCompat.ArrayOpsExt$.MODULE$.newMap$extension(
ArrayOpsCompat$.MODULE$.ArrayOpsExt(this.array()),
(JFunction1.mcII.sp)(x) -> this.heavy(x),
scala.reflect.ClassTag$.MODULE$.Int()
));
// regular lambda -> (Object -> Object)
bh.consume(ArrayOpsCompat.ArrayOpsExt$.MODULE$.newMap$extension(
ArrayOpsCompat$.MODULE$.ArrayOpsExt(this.array()),
(Object x) -> this.genericHeavy(x),
scala.reflect.ClassTag$.MODULE$.Object()
)); Bytecode inside method implementations. // oldMap
if ($this instanceof int[]) {
for (int[] arr = (int[]) $this; i < len; ++i) {
ys[i] = f.apply(BoxesRunTime.boxToInteger(arr[i]));
}
} else if ($this instanceof double[]) {
// branch for double[]
}
// ... other primitive array branches ... // newMap
for (int i = 0; i < len; i++) {
ys[i] = f.apply(xs[i]);
} JIT log summary. Analysis of the JIT logs shows that specialized lambda calls ( <!-- context (C2): call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.oldMap$extension -->
<method id='1455' holder='1411' name='oldMap$extension' return='1222' arguments='1222 1453 1454' flags='17' bytes='599' compile_id='781' compiler='c2' level='4' iicount='696'/>
<call method='1455' instr='invokevirtual'/>
<inline_fail reason='inlining prohibited by policy'/>
<!-- context (C1): call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.oldMap$extension -->
<method id='1444' holder='1398' name='oldMap$extension' return='1222' arguments='1222 1442 1443' flags='17' bytes='599' compile_id='767' compiler='c1' level='3' iicount='406'/>
<call method='1444' instr='invokevirtual'/>
<inline_fail reason='callee is too large'/>
<!-- context: call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.newMap$extension -->
<method id='1421' holder='1409' name='newMap$extension' return='1222' arguments='1222 1419 1420' flags='17' bytes='63' compile_id='775' compiler='c2' level='4' iicount='636'/>
<call method='1421' count='113050' prof_factor='1.000000' inline='1'/>
<inline_success reason='inline (hot)'/>
<!-- context: specialized lambda inlining -->
<method id='1507' holder='1424' name='apply$mcII$sp' return='1215' arguments='1215' flags='1' bytes='9' compile_id='753' compiler='c1' level='3' iicount='28090'/>
<call method='1507' count='27579' prof_factor='1.000000' inline='1'/>
<inline_success reason='inline (hot)'/> |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I've been benchmarking performance of
Array.map
andArray.foreach
in Scala and found that their implementations can be made significantly faster —map
by up to 3.5×, andforeach
by up to 2× — by simplifying the loop structure in cases where the function operates on statically known primitive types.The current implementation relies on a
match
over array types, introducing overhead and making performance less predictable in cases chained transformations. In contrast, a plain loop yields consistent scaling and significantly lower latency per element.Here’s how the current
map
method can be simplified:You can find the full analysis with plots and measurements below.
Would this kind of improvement be something you'd like to see in the standard library? I'd be happy to contribute a PR.
📊 https://github.com/2Pit/scala-benchmarks/blob/main/review/array_map_foreach/README.md
Beta Was this translation helpful? Give feedback.
All reactions