Skip to content

Conversation

@hugoferreira
Copy link

@hugoferreira hugoferreira commented Nov 24, 2025

Problem & Rationale

Parsing defers field assignments until the winning branch is known. Each Defer call allocates a fresh contextFieldSet so those captured values survive branch backtracking. Benchmarks showed parseContext.Defer/Branch accounting for nearly half of total allocations (pprof: Branch ~25%, Defer ~19%) even on tiny inputs. These structs are short‑lived, small, and have fixed shape, so recycling them avoids steady heap pressure and reduces GC work without touching parser semantics.

Fix

This change adds a sync.Pool of contextFieldSet objects. Defer now grabs a zeroed struct from the pool, fills it, and Apply returns each struct to the pool after invoking setField. No other behaviour changes, branches still copy apply slices, and errors propagate the same way.

Benchmark

$ (cd _examples && go test ./thrift -bench . -benchmem -run ^$)
goos: darwin
goarch: arm64
pkg: github.com/alecthomas/participle/v2/_examples/thrift
cpu: Apple M4 Max
BenchmarkParticipleThrift-16             	    9830	    119054 ns/op	         7.434 MiB/s	  140573 B/op	    1902 allocs/op
BenchmarkParticipleThriftGenerated-16    	   15954	     74994 ns/op	        12.07 MiB/s	  135850 B/op	    1666 allocs/op
BenchmarkGoThriftParser-16               	    7281	    162447 ns/op	         5.506 MiB/s	  146584 B/op	    2620 allocs/op
PASS
ok  	github.com/alecthomas/participle/v2/_examples/thrift	4.591s

Both participle variants improved about 6–7% in wall time and shed ~350–400 KiB + ~150 allocations per parse (compared to the pre‑change baselines of 127 µs / 172 KB / 2053 allocs and 78 µs / 167 KB / 1817 allocs).

Extending the Technique

Hoping that this technique is sound, it's observable that even after pooling contextFieldSet, profiling the Thrift benchmark still showed parseContext.Branch dominating allocations: every speculative branch clones an entire parseContext, and failed branches keep their deferred captures alive until GC. go tool pprof -alloc_space attributed ~25% of bytes to Branch and ~19% to Defer, so eliminating those short-lived context copies promised another allocation drop.

Extending the fix

  • Introduced a sync.Pool for parseContext instances (context.go:37-118) plus small helpers: discardDeferred zeros and returns any unused capture records, and recycle hands the whole context back to the pool. Accept now recycles the accepted branch automatically.
  • Each node that creates speculative branches (group repetitions, disjunctions, lookahead, negation in nodes.go:263-512) now explicitly calls branch.recycle(false) when a branch fails, ensuring both the context and any deferred captures are released immediately.
  • No parser semantics changed: Stop, Accept, and error tracking all behave exactly as before; only swapped raw allocations for pooled scratch structs.

Second line of Benchmark

With both optimisations:

BenchmarkParticipleThrift-16             	    9172	    127150 ns/op	         6.936 MiB/s	   98042 B/op	    1638 allocs/op
BenchmarkParticipleThriftGenerated-16    	   15225	     78240 ns/op	         11.51 MiB/s	   93656 B/op	    1402 allocs/op
BenchmarkGoThriftParser-16               	    6562	    183980 ns/op	         4.963 MiB/s	  146585 B/op	    2620 allocs/op
PASS
ok  	github.com/alecthomas/participle/v2/_examples/thrift	4.683s

Compared to the prior (already pooled captures) run at ~119 µs/op with 140 kB / 1902 allocs, the new branch pooling holds throughput steady while cutting another ~40% of heap use (98 kB, 1638 allocs) for the runtime-built parser; the generated parser sees a similar improvement (from 136 kB / 1666 allocs down to 94 kB / 1402 allocs). Go-thrift remains the same, so participle now wins clearly on allocation footprint while matching its earlier speed.

`Defer` now grabs a zeroed struct from the pool, fills it, and `Apply` returns each struct to the pool after invoking `setField`. No other behavior changes, branches still copy `apply` slices, and errors propagate the same way.
@hugoferreira
Copy link
Author

hugoferreira commented Nov 24, 2025

Would love to get your input if it makes sense to continue searching for these low-hanging fruit performance improvements. I don't have the time to go through deep stuff, but pprof does provide some immediate clues. I'm using your library in a different project, where code is compiled in runtime, so any speed-up I can get has direct impact on my side.

Using the exact same technique (building a tokenSlicePool), I can get to:

  BenchmarkParticipleThrift-16         124391 ns/op   6.897 MiB/s   58082 B/op   1630 allocs/op
  BenchmarkParticipleThriftGenerated-16 75488 ns/op   12.00 MiB/s   54155 B/op   1394 allocs/op
  BenchmarkGoThriftParser-16           186693 ns/op   4.766 MiB/s  146584 B/op   2620 allocs/op

Which, compared to the last run in this PR (~98 KB/op for runtime-built, ~94 KB/op for generated), pooling the lexer buffer would trim another ~40 KB/op and ~60–70 allocs/op. That would need to come in a separate PR though, as it needs deeper analysis from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants