Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: truncate resource journal at a configurable size #6631

Closed
wants to merge 11 commits into from

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 13, 2025

This PR adds support for truncation of the in-memory resource eventlog, including addition of the truncate event proposed for RFC 44 which holds a summary of information lost by the truncation.

The current limit here is an arbitrarily chosen 100,000 entries. The implementation actually allows the eventlog to grow to journal-max + 1% before truncating, at which point enough entries are dropped to bring the size back to the limit. This is to avoid truncating on every event appended to the log once it has reached the limit.

The journal limit can be altered via the resource.journal-max configuration key. A live update immediately truncates the event if it exceeds the max + 1% water mark.

This is a WIP because maintaining the truncated "state" ended a bit more complex than I originally thought, though it is perhaps 95% complete here. However, if there's no actual need for a resource journal consumer to reconstruct state at this point, it might be worth dropping the truncate event context which would make this PR trivial.

Items that are still TODO if we keep the current approach:

  • add a method to the truncate_info object to allow R to be pushed into the context on a resource-define event.
  • add a handler to the truncate class for the resource-update event
  • better error handling and error messages (there are none now)
  • further testing

If we decide to drop the truncate event context which summarizes lost state, most of the above TODO items could be dropped. The truncate event would just notify the consumer that the log has been truncated and record the timestamp of the most recently dropped event.

Maybe I'll work up a PR that does this 2nd part only, then make all the context keys in the RFC 44 proposal set to OPTIONAL. We could always tack the extra truncate event context handling on in another PR (I should have thought of this before, sorry!)

I'll still leave this here for consideration.

An example:

$ flux resource eventlog -H
[Feb12 19:53] restart ranks="0-7" online="" nodelist="corona[212,212,212,212,212,212,212,212]"
[  +1.242239] online idset="0"
[  +2.321909] online idset="1"
[  +3.363048] online idset="2"
[  +4.437279] online idset="3"
[  +5.504496] online idset="4"
[  +6.550387] online idset="5"
[  +7.554111] resource-define method="dynamic-discovery"
[  +7.610144] online idset="6"
[  +8.653062] online idset="7"
$ echo resource.journal-max=4 | flux config load
$ flux resource eventlog -H
[Feb12 19:53] truncate ranks="0-7" online="0-5" nodelist="corona[212,212,212,212,212,212,212,212]" torpid="" drain={}
[  +1.003724] resource-define method="dynamic-discovery"
[  +1.059757] online idset="6"
[  +2.102675] online idset="7"
$ flux resource drain 0
$ flux resource eventlog -H
[Feb12 19:53] truncate ranks="0-7" online="0-5" nodelist="corona[212,212,212,212,212,212,212,212]" torpid="" drain={} discovery-method="dynamic-discovery"
[  +0.056033] online idset="6"
[  +1.098951] online idset="7"
[Feb12 20:01] drain idset="0" nodelist="corona212" overwrite=0
$ flux resource drain 1
$ flux resource eventlog -H
[Feb12 19:53] truncate ranks="0-7" online="0-6" nodelist="corona[212,212,212,212,212,212,212,212]" torpid="" drain={} discovery-method="dynamic-discovery"
[  +1.042918] online idset="7"
[Feb12 20:01] drain idset="0" nodelist="corona212" overwrite=0
[ +18.420785] drain idset="1" nodelist="corona212" overwrite=0
$ echo resource.journal-max=1 | flux config load
$ flux resource eventlog -H
[Feb12 20:02] truncate ranks="0-7" online="0-7" nodelist="corona[212,212,212,212,212,212,212,212]" torpid="" drain={"1":{"timestamp":1739419321.7013323,"reason":""},"0":{"timestamp":1739419303.2805471,"reason":""}} discovery-method="dynamic-discovery"

Problem: It would be useful to create a struct drainset from a JSON
serialized version of the structure.

Add drainset_from_json() which can decode the result of
drainset_to_json().
Problem: There are no unit tests for the drainset_from_json()
function.

Add them to the drainset unit test.
Problem: The drainset class cannot be used to capture state for
drain events since it does not support the overwrite options nor
undrain.

Add drainset_drain_ex(), which takes an overwrite parameter and handles
it as described in RFC 44.

Add drainset_undrain(), which undrains a single rank.
Problem: The drainset unit tests do not exercise drainset_drain_ex()
and drainset_undrain().

Update unit tests.
Problem: RFC 44 stipulates that a truncate event shall appear as the
first event in the resource eventlog when it has been truncated, and
that the context of this event shall contain a summary of truncated
information, but the resource eventlog has no support for this event.

Add a truncate class to the resource module which can accumulate
required information from dropped events, providing the base
functionality for a truncate event.
Problem: There are no unit tests for the resource module truncate
class.

Add a set of simple unit tests to test basic functionality.
Problem: The resource journal will need to truncated by dropping
entries from the start, but a jansson json array is not conducive
to this operation.

Swtich to a zlistx_t for the reosurce journal. This data structure
allows the removal and addition of items to the front of the list
(without moving the memory for the rest of the items on the list)
Problem: The in-memory resource journal grows without bound and there
is no way to limit its size.

In preparation for truncating the resource journal, add a new config
key in the [resource] table, `journal_max`, which can be used to set
a limit on the journal size. Default this limit to 100000 entries.

Modify the reslog constructor to accept this journal_max limit, but
don't do anything with it just yet.
Problem: The resource journal limit cannot be configured at runtime
because configuration updates are ignored by the resource module.

Capture any new resource.journal_max entry on config reload and send
it to the reslog implementation.
Problem: The resource journal can grow without bound.

Truncate the in-memory resource eventlog if configured and replace
the most recent event with an RFC 44 truncate event.
Problem: There's no testing of resource journal truncation in the
testsuite.

Add a set of very basic tests to t2355-resource-journal.t that ensure
basic eventlog truncation works.

#include "truncate.h"
#include "drainset.h"
//#include "inventory.h"

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
json_t *context)
{
/* Add method and R to context */
// TODO: json_t *R = json_incref (inventory_get (reslog->ctx->inventory));

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 85.11450% with 39 lines in your changes missing coverage. Please review.

Project coverage is 83.85%. Comparing base (27f4390) to head (9bd3535).
Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/resource/truncate.c 87.91% 18 Missing ⚠️
src/modules/resource/drainset.c 80.35% 11 Missing ⚠️
src/modules/resource/reslog.c 80.39% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6631      +/-   ##
==========================================
- Coverage   83.86%   83.85%   -0.02%     
==========================================
  Files         534      535       +1     
  Lines       88358    88608     +250     
==========================================
+ Hits        74100    74298     +198     
- Misses      14258    14310      +52     
Files with missing lines Coverage Δ
src/modules/resource/resource.c 86.43% <100.00%> (-0.24%) ⬇️
src/modules/resource/reslog.c 73.30% <80.39%> (+0.92%) ⬆️
src/modules/resource/drainset.c 79.51% <80.35%> (+0.23%) ⬆️
src/modules/resource/truncate.c 87.91% <87.91%> (ø)

... and 6 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant