Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Scaling event reporting follow-ups #1220

Open
1 of 3 tasks
sharnoff opened this issue Jan 27, 2025 · 0 comments
Open
1 of 3 tasks

Epic: Scaling event reporting follow-ups #1220

sharnoff opened this issue Jan 27, 2025 · 0 comments
Assignees
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent

Comments

@sharnoff
Copy link
Member

sharnoff commented Jan 27, 2025

ref https://neondb.slack.com/archives/C03TN5G758R/p1737753872393229?thread_ts=1737751496.835499

Basically, there are some follow-up changes to #1107 we should make to improve the efficiency of our scaling event reporting.

Tasks

Preview Give feedback
  1. Omrigan
  2. sharnoff
  3. petuhovskiy
@sharnoff sharnoff added the c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent label Jan 27, 2025
@sharnoff sharnoff self-assigned this Jan 27, 2025
sharnoff added a commit that referenced this issue Jan 27, 2025
Follow-up to #1107, ref #1220.

Basically, this change adds support into 'pkg/reporting' (which the
autoscaler-agent uses for billing and scaling events) to cleanly send
all remaining events on shutdown.

This has three pieces:

1. The event sender threads now send and exit when their context expires;
2. reporting.NewEventSink() now takes a parent context (or .Finish() can
   be called instead to explicitly trigger a clean exit); and
3. reporting.NewEventSink() now spawns the senders with a taskgroup.Group
   passed in by the caller, so that shutdown of the entire
   autoscaler-agent will wait for sending to complete.
sharnoff added a commit that referenced this issue Jan 28, 2025
Follow-up to #1107, ref #1220.

In short, this change:

1. Moves reporting.EventSink's event sender thread management into a new
   Run() method
2. Calls (EventSink).Run() from (billing.MetricsCollector).Run(),
   appropriately canceling when done generating events, and waiting for
   it to finish before returning.
3. Adds (scalingevents.Reporter).Run() that just calls the underlying
   (EventSink).Run(), and calls *that* in the entrypoint's taskgroup.

Or in other words, exactly what was suggested in this comment:
#1221 (comment)
sharnoff added a commit that referenced this issue Jan 29, 2025
Part of #1220.

In cases where (a) we expect batches to be very large, and (b) we expect
that some autoscaler-agent instances may end up with many batches of
events between reporting periods, we might see excessive memory usage
from those events.

So instead of pushing events exactly every push period, we should push
*at least* every push period, and exactly when the next batch is full if
it's full before the push period is reached.
sharnoff added a commit that referenced this issue Jan 29, 2025
Part of #1220.

For scaling events, the size of a gzip-compressed batch is *much* less
than the uncompressed version, so it's actually worth it to compress the
JSON-encoded contents of the batch as we go, instead of buffering the
events as a []ScalingEvent in memory.

**Implementation**

This change primarily adds the reporting.BatchBuilder[E] type (and some
implementations), which abstracts over the process of gradually building
the serialized bytes from a batch of events.

Two implementations of BatchBuilder are provided in pkg/reporting - one
for large JSON arrays, and one for the JSON lines format.

To allow continuous gzip compression, both of these JSON batch builders
are constructed with a new 'IOBuffer' type, which is just an io.Writer
with a method to collect the bytes at the end. There's implementations
of IOBuffer for normal bytes and for a gzip-compressed buffer.

All that together means that batches of billing and scaling events
should now be GZIP-compressed (where possible) as events are added to
the batches, dramatically reducing the memory footprint.
sharnoff added a commit that referenced this issue Jan 29, 2025
Part of #1220.

For scaling events, the size of a gzip-compressed batch is *much* less
than the uncompressed version, so it's actually worth it to compress the
JSON-encoded contents of the batch as we go, instead of buffering the
events as a []ScalingEvent in memory.

**Implementation**

This change primarily adds the reporting.BatchBuilder[E] type (and some
implementations), which abstracts over the process of gradually building
the serialized bytes from a batch of events.

Two implementations of BatchBuilder are provided in pkg/reporting - one
for large JSON arrays, and one for the JSON lines format.

To allow continuous gzip compression, both of these JSON batch builders
are constructed with a new 'IOBuffer' type, which is just an io.Writer
with a method to collect the bytes at the end. There's implementations
of IOBuffer for normal bytes and for a gzip-compressed buffer.

All that together means that batches of billing and scaling events
should now be GZIP-compressed (where possible) as events are added to
the batches, dramatically reducing the memory footprint.
sharnoff added a commit that referenced this issue Feb 10, 2025
Follow-up to #1107, ref #1220.

In short, this change:

1. Moves reporting.EventSink's event sender thread management into a new
   Run() method
2. Calls (EventSink).Run() from (billing.MetricsCollector).Run(),
   appropriately canceling when done generating events, and waiting for
   it to finish before returning.
3. Adds (scalingevents.Reporter).Run() that just calls the underlying
   (EventSink).Run(), and calls *that* in the entrypoint's taskgroup.

Or in other words, exactly what was suggested in this comment:
#1221 (comment)
sharnoff added a commit that referenced this issue Feb 13, 2025
Part of #1220.

In cases where (a) we expect batches to be very large, and (b) we expect
that some autoscaler-agent instances may end up with many batches of
events between reporting periods, we might see excessive memory usage
from those events.

So instead of pushing events exactly every push period, we should push
*at least* every push period, and exactly when the next batch is full if
it's full before the push period is reached.
sharnoff added a commit that referenced this issue Feb 13, 2025
Part of #1220.

For scaling events, the size of a gzip-compressed batch is *much* less
than the uncompressed version, so it's actually worth it to compress the
JSON-encoded contents of the batch as we go, instead of buffering the
events as a []ScalingEvent in memory.

**Implementation**

This change primarily adds the reporting.BatchBuilder[E] type (and some
implementations), which abstracts over the process of gradually building
the serialized bytes from a batch of events.

Two implementations of BatchBuilder are provided in pkg/reporting - one
for large JSON arrays, and one for the JSON lines format.

To allow continuous gzip compression, both of these JSON batch builders
are constructed with a new 'IOBuffer' type, which is just an io.Writer
with a method to collect the bytes at the end. There's implementations
of IOBuffer for normal bytes and for a gzip-compressed buffer.

All that together means that batches of billing and scaling events
should now be GZIP-compressed (where possible) as events are added to
the batches, dramatically reducing the memory footprint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent
Projects
None yet
Development

No branches or pull requests

1 participant