Epic: Scaling event reporting follow-ups #1220

sharnoff · 2025-01-27T12:16:39Z

ref https://neondb.slack.com/archives/C03TN5G758R/p1737753872393229?thread_ts=1737751496.835499

Basically, there are some follow-up changes to #1107 we should make to improve the efficiency of our scaling event reporting.

Tasks

Give feedback

Follow-up to #1107, ref #1220. Basically, this change adds support into 'pkg/reporting' (which the autoscaler-agent uses for billing and scaling events) to cleanly send all remaining events on shutdown. This has three pieces: 1. The event sender threads now send and exit when their context expires; 2. reporting.NewEventSink() now takes a parent context (or .Finish() can be called instead to explicitly trigger a clean exit); and 3. reporting.NewEventSink() now spawns the senders with a taskgroup.Group passed in by the caller, so that shutdown of the entire autoscaler-agent will wait for sending to complete.

Follow-up to #1107, ref #1220. In short, this change: 1. Moves reporting.EventSink's event sender thread management into a new Run() method 2. Calls (EventSink).Run() from (billing.MetricsCollector).Run(), appropriately canceling when done generating events, and waiting for it to finish before returning. 3. Adds (scalingevents.Reporter).Run() that just calls the underlying (EventSink).Run(), and calls *that* in the entrypoint's taskgroup. Or in other words, exactly what was suggested in this comment: #1221 (comment)

Part of #1220. In cases where (a) we expect batches to be very large, and (b) we expect that some autoscaler-agent instances may end up with many batches of events between reporting periods, we might see excessive memory usage from those events. So instead of pushing events exactly every push period, we should push *at least* every push period, and exactly when the next batch is full if it's full before the push period is reached.

Part of #1220. For scaling events, the size of a gzip-compressed batch is *much* less than the uncompressed version, so it's actually worth it to compress the JSON-encoded contents of the batch as we go, instead of buffering the events as a []ScalingEvent in memory. **Implementation** This change primarily adds the reporting.BatchBuilder[E] type (and some implementations), which abstracts over the process of gradually building the serialized bytes from a batch of events. Two implementations of BatchBuilder are provided in pkg/reporting - one for large JSON arrays, and one for the JSON lines format. To allow continuous gzip compression, both of these JSON batch builders are constructed with a new 'IOBuffer' type, which is just an io.Writer with a method to collect the bytes at the end. There's implementations of IOBuffer for normal bytes and for a gzip-compressed buffer. All that together means that batches of billing and scaling events should now be GZIP-compressed (where possible) as events are added to the batches, dramatically reducing the memory footprint.

Follow-up to #1107, ref #1220. In short, this change: 1. Moves reporting.EventSink's event sender thread management into a new Run() method 2. Calls (EventSink).Run() from (billing.MetricsCollector).Run(), appropriately canceling when done generating events, and waiting for it to finish before returning. 3. Adds (scalingevents.Reporter).Run() that just calls the underlying (EventSink).Run(), and calls *that* in the entrypoint's taskgroup. Or in other words, exactly what was suggested in this comment: #1221 (comment)

Part of #1220. In cases where (a) we expect batches to be very large, and (b) we expect that some autoscaler-agent instances may end up with many batches of events between reporting periods, we might see excessive memory usage from those events. So instead of pushing events exactly every push period, we should push *at least* every push period, and exactly when the next batch is full if it's full before the push period is reached.

Part of #1220. For scaling events, the size of a gzip-compressed batch is *much* less than the uncompressed version, so it's actually worth it to compress the JSON-encoded contents of the batch as we go, instead of buffering the events as a []ScalingEvent in memory. **Implementation** This change primarily adds the reporting.BatchBuilder[E] type (and some implementations), which abstracts over the process of gradually building the serialized bytes from a batch of events. Two implementations of BatchBuilder are provided in pkg/reporting - one for large JSON arrays, and one for the JSON lines format. To allow continuous gzip compression, both of these JSON batch builders are constructed with a new 'IOBuffer' type, which is just an io.Writer with a method to collect the bytes at the end. There's implementations of IOBuffer for normal bytes and for a gzip-compressed buffer. All that together means that batches of billing and scaling events should now be GZIP-compressed (where possible) as events are added to the batches, dramatically reducing the memory footprint.

sharnoff added the c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent label Jan 27, 2025

sharnoff self-assigned this Jan 27, 2025

sharnoff mentioned this issue Jan 27, 2025

agent: Send last billing/scaling events before shutdown #1221

Merged

This was referenced Jan 29, 2025

reporting: Continuously send full batches of events #1226

Open

reporting: Serialize batches as they're constructed #1227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Scaling event reporting follow-ups #1220

Epic: Scaling event reporting follow-ups #1220

sharnoff commented Jan 27, 2025 •

edited

Loading

Tasks

Epic: Scaling event reporting follow-ups #1220

Epic: Scaling event reporting follow-ups #1220

Comments

sharnoff commented Jan 27, 2025 • edited Loading

Tasks

sharnoff commented Jan 27, 2025 •

edited

Loading