-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster megular expressions #265
Conversation
x/meg/meg.go
Outdated
generation int | ||
code int | ||
capture captureFunc | ||
next int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use int64, int is implementation dependent. The bitwise calculations won't work for 32bit machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not relying on int64 bitwise calculations here. Maybe you're thinking of the
visited bitset which is explicitly int64. That's used to flip the bit of when
we've visited a matchstate in the array by its index. And it's fine if that
index is int32 or int64.
Changes LGTM! |
stack = append(stack, task{splitIdx, t.cap}) | ||
stack = append(stack, task{s.next, t.cap}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this was using recursion before and an explicit stack now.
How much better is the stack compared to recursion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursive
goos: darwin
goarch: arm64
pkg: github.com/multiformats/go-multiaddr/x/meg
cpu: Apple M1 Max
BenchmarkIsWebTransportMultiaddrPrealloc-10 2778994 429.7 ns/op 160 B/op 9 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapturePrealloc-10 5164788 231.8 ns/op 0 B/op 0 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapture-10 2945781 414.0 ns/op 472 B/op 2 allocs/op
BenchmarkIsWebTransportMultiaddr-10 1296012 919.5 ns/op 920 B/op 25 allocs/op
Non-recursive
BenchmarkIsWebTransportMultiaddrPrealloc-10 3170400 380.5 ns/op 160 B/op 9 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapturePrealloc-10 6977448 171.9 ns/op 0 B/op 0 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapture-10 3290984 353.9 ns/op 472 B/op 2 allocs/op
BenchmarkIsWebTransportMultiaddr-10 1374780 863.3 ns/op 920 B/op 25 allocs/op
This is the recursive implementation:
func appendStateRecursive(arr statesAndCaptures, states []MatchState, stateIndex int, c *capture, visitedBitSet []uint64) statesAndCaptures {
if stateIndex >= len(states) {
return arr
}
s := states[stateIndex]
if visitedBitSet[stateIndex/64]&(1<<(stateIndex%64)) != 0 {
return arr
}
visitedBitSet[stateIndex/64] |= 1 << (stateIndex % 64)
if s.codeOrKind < done {
arr = appendStateRecursive(arr, states, s.next, c, visitedBitSet)
arr = appendStateRecursive(arr, states, restoreSplitIdx(s.codeOrKind), c, visitedBitSet)
} else {
arr.states = append(arr.states, stateIndex)
arr.captures = append(arr.captures, c)
}
return arr
}
It's a bit faster, but if you prefer the simplicity of the recursive approach I'm happy to change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do prefer the simplicity of the recursive approach
It would be interesting to have a benchmark for a multiaddr to String and then using a stdlib string regex on it. |
Okay this is much slower:
|
* much cheaper copies of captures * Add a benchmark * allocate to a slice. Use indexes as handles * cleanup * Add nocapture loop benchmark It's really fast. No surprise * cleanup * nits
This PR stills needs some cleanup but sharing early to get some feedback.
This is overall 4x faster than the original PR for the normal case (no preallocation).
The main changes are:
Protocol
type. (I feel like this should have been optimized by the compiler though)MatchStates and Index handles:
By allocating to a single slice rather than the heap, we can prepare the memory upfront and reduce our overall memory pressure. We also make the
MatchState
much smaller (24 bytes from 48 bytes). The smaller size and having the states in contiguous memory enable it to be more cache friendly.On my machine, with these changes, using Megular expressions normally is about as fast as the
ForEach
style in v0.14.Benchmark Results
Assume this system unless otherwise noted:
go-multiaddr @ v0.14 using
.ForEach
https://github.com/multiformats/go-multiaddr/blob/marco/v0.14.0-bench/bench_test.go:The original megular expressions PR:
All but the Index handles change (the first two points from above):
With the index handles change