Skip to content

Make intersections much faster #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

oberblastmeister
Copy link
Contributor

Resolves #225. I also incorporated techniques from #291 and #362. All tests pass. I need to do some benchmarking and clean up the code, probably reformat it also.

@oberblastmeister
Copy link
Contributor Author

oberblastmeister commented Apr 9, 2022

Benchmarks show that this is 2x 3x faster than the previous implementation! (also faster than union now)

treeowl
treeowl previously requested changes Apr 9, 2022
Copy link
Collaborator

@treeowl treeowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few concerns, most importantly about inlining.

@sjakobi
Copy link
Member

sjakobi commented Apr 9, 2022 via email

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

Is stylish Haskell happening on GitHub or something? If so, I guess @sjakobi wants that. Don't ask me; I don't know tools.

The potential performance concern with an unboxed tuple result is that when the passed function doesn't inline, we ended up with an extra function call per element. I don't know if that makes a big enough difference to worry about; it'd be worth experimenting. I imagine it would save a lot of code duplication if it turns out okay. Specifically, it would avoid source duplication between lazy and strict versions, and object code duplication among the variants.

@oberblastmeister
Copy link
Contributor Author

How is that different than with a function that does not return an unboxed tuple? Shouldn't it still have to make an extra function call for each element if the function doesn't inline?

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

Oh, I was thinking of not inlining the version that produces an unboxed tuple. If we do inline that into the other variants, then yeah, everything should be basically the same but with less source code. But it's worth finding out how much we pay for not inlining it, because object code isn't free either.

@oberblastmeister
Copy link
Contributor Author

oberblastmeister commented Apr 9, 2022

What if we just inline the ones with the unboxed tuples. Is there any disadvantage? Because we already inline intersectionWith, so inline intersectionWith# shouldn't create extra code? Then later we can experiment and see if we don't have to inline the unboxed version.

oberblastmeister and others added 3 commits April 9, 2022 16:43
This one inlines the unboxed form into everything else, hopefully.
@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

I opened a PR against your branch to do that.

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

There are CI failures on older GHC, because shrinking wasn't available yet. We'll need a fall-back.

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

The fallback should probably just define the shrinking operation manually in the Array module. It won't be too efficient, but whatever.

@oberblastmeister
Copy link
Contributor Author

Should I just use something like copy to implement the fallback?

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

I dunno. I haven't actually looked at how you're using shrink. Really, you can do whatever you think is reasonable, but I'd prefer to keep the CPP for it in Array if possible.

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

Oh, I just looked. You'll want to use cloneSmallMutableArray#

@oberblastmeister
Copy link
Contributor Author

Something like

#if __GLASGOW_HASKELL__ >= 8.10.7
shrink = ...
#else
shrink = ...
#endif

(never used cpp before)

@treeowl
Copy link
Collaborator

treeowl commented Apr 9, 2022

Yup! It's one of the world's worst macro systems, but it's a lot faster than Template Haskell. Sigh

@sjakobi
Copy link
Member

sjakobi commented Apr 12, 2022

Could you please rebase, so the changes from #407 are removed from the diff?

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed intersectionCollisions yet. Could you add some documentation on searchSwap first?

Comment on lines 1842 to 1859
-- iterate over nonzero bits of b1 .&. b2
let go !i !i1 !i2 !b !bFinal
| b == 0 = pure (i, bFinal)
| testBit $ b1 .&. b2 = do
x1 <- A.indexM ary1 i1
x2 <- A.indexM ary2 i2
case f x1 x2 of
Empty -> go i (i1 + 1) (i2 + 1) b' (bFinal .&. complement m)
_ -> do
A.write mary i $! f x1 x2
go (i + 1) (i1 + 1) (i2 + 1) b' bFinal
| testBit b1 = go i (i1 + 1) i2 b' bFinal
| otherwise = go i i1 (i2 + 1) b' bFinal
where
m = 1 `unsafeShiftL` countTrailingZeros b
testBit x = x .&. m /= 0
b' = b .&. complement m
(maryLen, bFinal) <- go 0 0 0 bCombined bIntersect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at the top seems incorrect: Currently the loop actually iterates over b1 .|. b2. It would be nice to change this though. In that case the i1 and i2 indices could be computed with sparseIndex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the i1 and i2 indices could be computed with sparseIndex.

For comparison, here is a version of unionArrayBy that uses sparseIndex to compute all the indices:

unionArrayBy f !b1 !b2 !ary1 !ary2 = A.run $ do
let b' = b1 .|. b2
mary <- A.new_ (popCount b')
-- iterate over nonzero bits of b1 .|. b2
let go !b
| b == 0 = return ()
| otherwise = do
let ba = b1 .&. b2
c = countTrailingZeros b
m = bit c
i = sparseIndex b' m
i1 = sparseIndex b1 m
i2 = sparseIndex b2 m
t <- if | testBit ba c -> do
x1 <- A.indexM ary1 i1
x2 <- A.indexM ary2 i2
return $! f x1 x2
| testBit b1 c -> A.indexM ary1 i1
| otherwise -> A.indexM ary2 i2
A.write mary i t
go (clearBit b c)
go b'

I expect that keeping i as a loop argument will be more efficient than recomputing it on each iteration though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sparseIndex makes benchmarks slower

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me the diff?

Using sparseIndex makes benchmarks slower

By how much? I also think the benchmark data might be a bit weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I show you the diff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse index:

All
  HashMap
    intersection
      Int:        OK (0.34s)
        52.1 μs ± 3.3 μs
      ByteString: OK (0.26s)
        62.6 μs ± 6.1 μs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without sparse index:

All
  HashMap
    intersection
      Int:        OK (0.86s)
        42.1 μs ± 1.5 μs
      ByteString: OK (0.33s)
        47.7 μs ± 2.8 μs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thanks! I think we might get different results with data where there's less overlap between the two maps. But that can be investigated at a different time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up issue in #416.


intersectionCollisions :: Eq k => (k -> v1 -> v2 -> (# v3 #)) -> A.Array (Leaf k v1) -> A.Array (Leaf k v2) -> ST s (Int, A.MArray s (Leaf k v3))
intersectionCollisions f ary1 ary2 = do
mary2 <- A.thaw ary2 0 $ A.length ary2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we actually need to allocate two arrays for this. The alternative would be to perform the search-and-swap operations on the output array itself.

It might be a bit tricky though – maybe leave it for a follow-up PR, so this one doesn't get too huge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with this is that the type could change. For example if we have two arrays with the numbers as keys, and the arrays are both different types
1 2 3 4
3 4 2 1
Let's thaw the first array, and mutate it to
(f 3 3) 2 1 4
f 3 3 could change the type to be something difference than the 2 1 4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, good point. Unsafe coercions might work for this, but I'd prefer not trying this in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only an issue for intersectionWithKey and such; intersection itself has no type issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another thing about intersection that's special: we can reuse the leaves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be better if intersection had custom code for handling collisions. Maybe this can be achieved by changing intersectionWithKey# to something similar to filterMapAux.

I'd slightly prefer if we'd leave this for a follow-up PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have recorded these ideas in #415.

@oberblastmeister
Copy link
Contributor Author

What should I do about inlining? I understand the need to eliminate the closures, but the functions are truly massive, intersection has 1,700 terms, while unionWithKey has 2,200! Wouldn't it be bad to mark these as {-# INLINE #-}? We could add a comment saying to explicitly inline if you want to remove the closure.

@oberblastmeister
Copy link
Contributor Author

Also if we pass the function around through recursion, then we wouldn't be able to implement intersection in terms of intersectionWithKey right?

@sjakobi
Copy link
Member

sjakobi commented Apr 12, 2022

What should I do about inlining? I understand the need to eliminate the closures, but the functions are truly massive, intersection has 1,700 terms, while unionWithKey has 2,200! Wouldn't it be bad to mark these as {-# INLINE #-}?

I think we should stick with INLINABLE until we're convinced that INLINE is better somehow. INLINABLE is better for compile times for example.

We could add a comment saying to explicitly inline if you want to remove the closure.

I think the docs of these functions are the wrong place to teach people about GHC.Exts.inline. Maybe it should be mentioned in https://github.com/input-output-hk/hs-opt-handbook.github.io?!

@sjakobi sjakobi mentioned this pull request Apr 12, 2022
@oberblastmeister
Copy link
Contributor Author

@sjakobi Is there anything else that I need to do?

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the merge conflict, @oberblastmeister?

I'll merge afterwards.

import Data.Hashable (Hashable)
import Data.Hashable.Lifted (Hashable1, Hashable2)
import Data.HashMap.Internal.List (isPermutationBy, unorderedCompare)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the changed sorting of imports is probably due to haskell/stylish-haskell#385, which was recently released.

@sjakobi sjakobi dismissed treeowl’s stale review April 15, 2022 20:09

The concerns seem to have been addressed.

@sjakobi sjakobi merged commit b73381e into haskell-unordered-containers:master Apr 15, 2022
@sjakobi
Copy link
Member

sjakobi commented Apr 15, 2022

Thank you, @oberblastmeister! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make intersections much faster
3 participants