Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster megular expressions #265

Merged
merged 7 commits into from
Feb 20, 2025
Merged

Conversation

MarcoPolo
Copy link
Contributor

@MarcoPolo MarcoPolo commented Feb 11, 2025

This PR stills needs some cleanup but sharing early to get some feedback.

This is overall 4x faster than the original PR for the normal case (no preallocation).

The main changes are:

  • faster Code() method that avoids an unnecessary copy for the Protocol type. (I feel like this should have been optimized by the compiler though)
  • Capture state is now a linked list rather than a map. This lets us "fork" states cheaply with structural sharing.
  • Allocated MatchStates to a slice, and use indices as handles. This is a big change that does introduce a fair amount of complexity. I'll expand a bit on this next.

MatchStates and Index handles:

By allocating to a single slice rather than the heap, we can prepare the memory upfront and reduce our overall memory pressure. We also make the MatchState much smaller (24 bytes from 48 bytes). The smaller size and having the states in contiguous memory enable it to be more cache friendly.

On my machine, with these changes, using Megular expressions normally is about as fast as the ForEach style in v0.14.

Benchmark Results

Assume this system unless otherwise noted:

goos: darwin
goarch: arm64
pkg: github.com/multiformats/go-multiaddr
cpu: Apple M1 Max

go-multiaddr @ v0.14 using .ForEach https://github.com/multiformats/go-multiaddr/blob/marco/v0.14.0-bench/bench_test.go:

BenchmarkIsWebTransportMultiaddrForEach-10          1499752               783.0 ns/op          1912 B/op         13 allocs/op

The original megular expressions PR:

BenchmarkIsWebTransportMultiaddr-10       422496              2503 ns/op            2888 B/op         74 allocs/op

All but the Index handles change (the first two points from above):

BenchmarkIsWebTransportMultiaddrPrealloc-10                      2924878               414.3 ns/op           160 B/op          9 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapturePrealloc-10             5895022               202.8 ns/op             0 B/op          0 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapture-10                     1276366               937.1 ns/op          1144 B/op         28 allocs/op
BenchmarkIsWebTransportMultiaddr-10                               716146              1659 ns/op            1656 B/op         59 allocs/op
BenchmarkIsWebTransportMultiaddrLoop-10                          4707390               252.7 ns/op           136 B/op         12 allocs/op

With the index handles change

BenchmarkIsWebTransportMultiaddrPrealloc-10                      3011650               382.4 ns/op           160 B/op          9 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapturePrealloc-10             6963790               171.7 ns/op             0 B/op          0 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapture-10                     3364348               357.9 ns/op           472 B/op          2 allocs/op
BenchmarkIsWebTransportMultiaddr-10                              1377448               883.8 ns/op           920 B/op         25 allocs/op
BenchmarkIsWebTransportMultiaddrLoop-10                          4797229               250.4 ns/op           136 B/op         12 allocs/op

x/meg/meg.go Outdated
generation int
code int
capture captureFunc
next int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use int64, int is implementation dependent. The bitwise calculations won't work for 32bit machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not relying on int64 bitwise calculations here. Maybe you're thinking of the
visited bitset which is explicitly int64. That's used to flip the bit of when
we've visited a matchstate in the array by its index. And it's fine if that
index is int32 or int64.

@sukunrt
Copy link
Member

sukunrt commented Feb 13, 2025

Changes LGTM!
Feel free to merge this to the main PR.

Comment on lines +177 to +178
stack = append(stack, task{splitIdx, t.cap})
stack = append(stack, task{s.next, t.cap})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this was using recursion before and an explicit stack now.
How much better is the stack compared to recursion here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursive

goos: darwin
goarch: arm64
pkg: github.com/multiformats/go-multiaddr/x/meg
cpu: Apple M1 Max
BenchmarkIsWebTransportMultiaddrPrealloc-10                      2778994               429.7 ns/op           160 B/op          9 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapturePrealloc-10             5164788               231.8 ns/op             0 B/op          0 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapture-10                     2945781               414.0 ns/op           472 B/op          2 allocs/op
BenchmarkIsWebTransportMultiaddr-10                              1296012               919.5 ns/op           920 B/op         25 allocs/op

Non-recursive

BenchmarkIsWebTransportMultiaddrPrealloc-10                      3170400               380.5 ns/op           160 B/op          9 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapturePrealloc-10             6977448               171.9 ns/op             0 B/op          0 allocs/op
BenchmarkIsWebTransportMultiaddrNoCapture-10                     3290984               353.9 ns/op           472 B/op          2 allocs/op
BenchmarkIsWebTransportMultiaddr-10                              1374780               863.3 ns/op           920 B/op         25 allocs/op

This is the recursive implementation:

func appendStateRecursive(arr statesAndCaptures, states []MatchState, stateIndex int, c *capture, visitedBitSet []uint64) statesAndCaptures {
	if stateIndex >= len(states) {
		return arr
	}
	s := states[stateIndex]

	if visitedBitSet[stateIndex/64]&(1<<(stateIndex%64)) != 0 {
		return arr
	}
	visitedBitSet[stateIndex/64] |= 1 << (stateIndex % 64)

	if s.codeOrKind < done {
		arr = appendStateRecursive(arr, states, s.next, c, visitedBitSet)
		arr = appendStateRecursive(arr, states, restoreSplitIdx(s.codeOrKind), c, visitedBitSet)
	} else {
		arr.states = append(arr.states, stateIndex)
		arr.captures = append(arr.captures, c)
	}
	return arr
}

It's a bit faster, but if you prefer the simplicity of the recursive approach I'm happy to change it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer the simplicity of the recursive approach

@sukunrt
Copy link
Member

sukunrt commented Feb 13, 2025

It would be interesting to have a benchmark for a multiaddr to String and then using a stdlib string regex on it.

@sukunrt
Copy link
Member

sukunrt commented Feb 13, 2025

Okay this is much slower:

func BenchmarkIsWebRTCDirectMultiaddrString(b *testing.B) {
	addr := multiaddr.StringCast("/ip4/1.2.3.4/udp/1234/webrtc-direct/")

	b.ResetTimer()
	b.ReportAllocs()
	reg := regexp.MustCompile(`^/ip4/(?P<ip>.+)/udp/(?P<port>.+)/webrtc-direct(/certhash/[^/]+)*$`)
	for i := 0; i < b.N; i++ {
		addrS := addr.String()
		res := reg.FindStringSubmatch(addrS)
		if len(res) != 4 {
			b.Fatal("unexpected result", addrS, len(res))
		}
	}
}

@MarcoPolo MarcoPolo marked this pull request as ready for review February 20, 2025 01:20
@MarcoPolo MarcoPolo merged commit ff4bf42 into marco/match-and-capture Feb 20, 2025
MarcoPolo added a commit that referenced this pull request Feb 26, 2025
* much cheaper copies of captures

* Add a benchmark

* allocate to a slice. Use indexes as handles

* cleanup

* Add nocapture loop benchmark

It's really fast. No surprise

* cleanup

* nits
@p-shahi p-shahi mentioned this pull request Feb 26, 2025
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants