Draft: Add design doc for deduplication layer in syslog health monitor#1119
Draft: Add design doc for deduplication layer in syslog health monitor#1119XRFXLP wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds ADR-033: a design for syslog health monitor event deduplication using normalized messages, per-check seen sets, suppression of duplicate unhealthy events, selective clearing on GPU recovery, full clearing on boot ID change, and optional persisted dedup state. Changes
Sequence Diagram(s)sequenceDiagram
participant Source as Event Source
participant Monitor as Syslog Monitor
participant Tracker as Dedup Tracker
participant Sender as Event Sender (gRPC)
Source->>Monitor: Emit raw syslog event
Monitor->>Tracker: Normalize message & ask "isDuplicate?"
alt Unhealthy event
Tracker-->>Monitor: duplicate? (yes/no)
alt yes
Monitor->>Monitor: suppress event (increment suppressed counter)
else no
Tracker->>Tracker: mark seen
Monitor->>Sender: send event
end
else Healthy event
Monitor->>Tracker: clear entries (PCI-specific or full on boot)
Monitor->>Sender: send healthy event
end
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/designs/033-syslog-health-monitor-event-deduplication.md (2)
9-13: Add a language tag to the fenced code block.The fence starting at Line 9 is untyped, which triggers markdownlint
MD040.Suggested doc fix
-``` +```text Poll 1: [ 1108.858286] NVRM: Xid (PCI:0000:b3:00.0): 79, pid=1234, name=nv-hostengine → event sent Poll 2: [ 1843.308145] NVRM: Xid (PCI:0000:b3:00.0): 79, pid=1234, name=nv-hostengine → duplicate event sent Poll 3: [ 2501.556012] NVRM: Xid (PCI:0000:b3:00.0): 79, pid=1234, name=nv-hostengine → duplicate event sent</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/designs/033-syslog-health-monitor-event-deduplication.mdaround lines 9
- 13, The fenced code block containing the three "Poll" lines is missing a
language tag (causing MD040); update the opening triple-backtick fence to
include a language identifier such as "text" (i.e., changetotext) so
the block is typed, e.g., the block starting with "Poll 1: [ 1108.858286] NVRM:
Xid ..." should use ```text as the fence.</details> --- `309-309`: **Tighten wording for readability.** “a small number of unique message strings” is vague; “a few unique message strings” reads cleaner in ADR prose. <details> <summary>Suggested wording tweak</summary> ```diff -- The state file grows by the size of the seen set. Between a fault and its remediation, this is typically a small number of unique message strings (order of 1-10), so the impact is negligible. +- The state file grows by the size of the seen set. Between a fault and its remediation, this is typically a few unique message strings (roughly 1-10), so the impact is negligible. ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@docs/designs/033-syslog-health-monitor-event-deduplication.md` at line 309, Replace the phrase "a small number of unique message strings" with the clearer wording "a few unique message strings" in the sentence describing state file growth so the line reads: "The state file grows by the size of the seen set. Between a fault and its remediation, this is typically a few unique message strings (order of 1-10), so the impact is negligible." Ensure only the phrase is changed and punctuation remains consistent. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@docs/designs/033-syslog-health-monitor-event-deduplication.md:
- Line 27: Remove the trailing spaces inside inline code spans used for the
kernel timestamp example (e.g., change"[12345.678901] "to
"[12345.678901]") and do the same for the other inline span later in the doc
(the second occurrence around the timestamp example), ensuring no internal
trailing spaces remain in any inline backtick-enclosed tokens to satisfy
markdownlint MD038.
Nitpick comments:
In@docs/designs/033-syslog-health-monitor-event-deduplication.md:
- Around line 9-13: The fenced code block containing the three "Poll" lines is
missing a language tag (causing MD040); update the opening triple-backtick fence
to include a language identifier such as "text" (i.e., changetotext) so
the block is typed, e.g., the block starting with "Poll 1: [ 1108.858286] NVRM:
Xid ..." should use ```text as the fence.- Line 309: Replace the phrase "a small number of unique message strings" with
the clearer wording "a few unique message strings" in the sentence describing
state file growth so the line reads: "The state file grows by the size of the
seen set. Between a fault and its remediation, this is typically a few unique
message strings (order of 1-10), so the impact is negligible." Ensure only the
phrase is changed and punctuation remains consistent.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro **Run ID**: `97ac7690-8047-4b26-8893-9a6dacf8c5e1` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between efe1e1754922ecd02cd11fed437e6863f605a447 and ab0671f0342ec3fa5f9293adeb723dad62f9d35d. </details> <details> <summary>📒 Files selected for processing (1)</summary> * `docs/designs/033-syslog-health-monitor-event-deduplication.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/designs/033-syslog-health-monitor-event-deduplication.md`:
- Around line 9-13: The fenced code block showing the Poll 1/2/3 syslog examples
is missing a language tag and triggers markdownlint MD040; update the opening
fence to include a language (for example change ``` to ```text) so the block
containing the lines beginning with "Poll 1:", "Poll 2:", "Poll 3:" is
explicitly marked as text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d5b234d6-3fa3-466b-8327-4953167f19c9
📒 Files selected for processing (1)
docs/designs/033-syslog-health-monitor-event-deduplication.md
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/designs/033-syslog-health-monitor-event-deduplication.md (2)
9-13:⚠️ Potential issue | 🟡 MinorAdd a language tag to the fenced example block.
Line 9 opens a fenced block without a language identifier, which triggers markdownlint MD040. Use an explicit language like
text.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/designs/033-syslog-health-monitor-event-deduplication.md` around lines 9 - 13, The fenced example block using triple backticks (``` ) on Poll 1/2/3 should include a language identifier to satisfy markdownlint MD040; update the opening fence to include a language tag (e.g., change the opening ``` to ```text) so the example block is explicitly marked as plain text.
27-27:⚠️ Potential issue | 🟡 MinorRemove trailing spaces inside inline code spans.
Line 27 and Line 372 include inline code tokens with internal trailing spaces (MD038). Remove the trailing spaces from the backtick-wrapped examples.
Also applies to: 372-372
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/designs/033-syslog-health-monitor-event-deduplication.md` at line 27, Remove the trailing spaces inside the inline code spans that trigger MD038: change the backtick-wrapped examples from "`[12345.678901] `" (and the similar one at the second occurrence) to "`[12345.678901]`" so there is no internal trailing space inside the code span; update both occurrences referenced in the diff (the inline code at line showing the kernel timestamp example and its duplicate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/designs/033-syslog-health-monitor-event-deduplication.md`:
- Around line 9-13: The fenced example block using triple backticks (``` ) on
Poll 1/2/3 should include a language identifier to satisfy markdownlint MD040;
update the opening fence to include a language tag (e.g., change the opening ```
to ```text) so the example block is explicitly marked as plain text.
- Line 27: Remove the trailing spaces inside the inline code spans that trigger
MD038: change the backtick-wrapped examples from "`[12345.678901] `" (and the
similar one at the second occurrence) to "`[12345.678901]`" so there is no
internal trailing space inside the code span; update both occurrences referenced
in the diff (the inline code at line showing the kernel timestamp example and
its duplicate).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7a271bd6-2669-4246-a26b-a48513105328
📒 Files selected for processing (1)
docs/designs/033-syslog-health-monitor-event-deduplication.md
|
|
||
| ### What counts as "the same message" | ||
|
|
||
| The dedup key is the **exact message string** with the kernel timestamp prefix stripped. Two messages that differ in any field — PCI address, XID code, pid, channel, process name — are treated as distinct and are **not** deduplicated. Only truly identical repeated error lines are suppressed. |
There was a problem hiding this comment.
how does this play with the health event analyzer rules where it checks for multiple XID 13s/31s? Can we document if there will be any impact to those rules?
There was a problem hiding this comment.
We will have impact on the rules that assume repetition of XIDs with same impact entities. So it will be RepatedXIDonSameGPU and RepeatedXIDonSameGPCAndTPC, will add this into a doc.
|
|
||
| The dedup key is the **exact message string** with the kernel timestamp prefix stripped. Two messages that differ in any field — PCI address, XID code, pid, channel, process name — are treated as distinct and are **not** deduplicated. Only truly identical repeated error lines are suppressed. | ||
|
|
||
| ### What clears the dedup |
There was a problem hiding this comment.
Can this be a leaky bucket kind of approach so we will still continue to see independent bursts if they occur at two different times before remediation?
There was a problem hiding this comment.
Why do we need a leaky bucket? If the original event has already triggered breakfix pipeline then why do we need to follow up with the same error?
There was a problem hiding this comment.
There is also the use case outside of breakfix like in the case of the exporter where we want to publish the health events for analytics purposes outside of the cluster. Now if we supress in the vast majority of the cases, I'm worried we'd end up losing data on bursts and potentially might delay our response to a bug
There was a problem hiding this comment.
In theory, we can monitor the suppression metric to get an idea of the burst that we're getting, as prioritizing analytics over resilience of breakfix system just for getting an idea of size of the burst doesn't make sense. Adding any dedup window duration doesn't make sense because we've seen the burst of size of 10+ hours.
Also are we assuming that large burst indicates an error of higher severity here?
There was a problem hiding this comment.
In theory, we can monitor the suppression metric
Sure, but that won't have all the information as we have with the health events.
as prioritizing analytics over resilience of breakfix system just for getting an idea of size of the burst doesn't make sense
I'm not saying prioritize analytics over the breakfix system, I'm saying we provide all the parts of the system with as much data as they can handle. I'm thinking about this in terms of throughput. What is the maximum possible throughput that the system can handle which we can saturate.
Adding any dedup window duration doesn't make sense because we've seen the burst of size of 10+ hours.
I think the idea here is that we'd want to see them as multiple bursts
Also are we assuming that large burst indicates an error of higher severity here?
to a certain extent, yes.
There was a problem hiding this comment.
I guess it would be better to optimize the system throughout first (if we can). One very obvious I can think of is using event ID rather than full event ID in node drainer queue that would decrease the memory usage as a function of number of events consequently increasing the max throughput. Checking that possibility, meanwhile keeping this MR as draft for now as dedup window straightaway looks like arbitrary.
There was a problem hiding this comment.
I agree, let's find opportunities for optimization first. One that @KaivalyaMDabhadkar is already working on is the cold start and there could be other optimizations that we can do as well.
There was a problem hiding this comment.
+1 to prioritising system stability over analytics. If we need a record of all health events regardless of duplication, we should think of an alternate mechanism such as exporting all the events via logs to Kratos and it does not need to go through MongoDB either.
There was a problem hiding this comment.
I think it would also make sense to reduce the frequency of re-emissions rather than complete deduplication. Between the spectrum of complete deduplication (proposed approach) and no deduplication ( current implementation) we have a case where we do re-emit some duplicates but not as frequently, but just enough to give us an idea of different "bursts" in time as @lalitadithya has suggested. This will also provide the HEA with more information regarding the frequency of the XIDs compared to the proposed approach where the HEA loses all information regarding duplicates in time and their frequency. A simple way to accomplish this would be to re-emit the same error after an exponentially increasing number of suppressed occurrences, for eg. emit the 1st, then suppress until the 10th occurence, then suppress until the 20th, 40th, 80th and so on (or until an upper limit which we set is reached). This means that our total duplicates emitted would only grow logarithmically with the total frequency of ocurrences, which is still a huge amount of deduplication and which is enough for the HEA to detect repetition patterns and for rules like RepeatedXIDonSameGPU to fire. Thoughts about this?
| - **Monitor-level dedup, not handler-level**: placing dedup in `handleSingleLine` avoids modifying the `Handler` interface or each handler implementation. The `Handler.ProcessLine` contract stays unchanged — it returns events, and the monitor decides whether to send them. | ||
| - **Exact message matching (minus timestamp)**: this is the simplest correct key. Any semantic difference in the message (different PCI, different pid, different XID code) is a different error and should not be suppressed. | ||
| - **State file persistence**: the monitor is a long-running daemon, so in-memory tracking covers cross-poll dedup. Persisting to the state file additionally covers pod restarts without needing to re-report errors that downstream components have already processed. | ||
| - **No TTL/expiry**: dedup is scoped to "until remediation". Healthy events and reboots are the remediation signals. A TTL would introduce a tuning parameter with no universally correct value and risk either premature re-reporting or unbounded suppression. |
There was a problem hiding this comment.
I would say that having without a TTL we have unbounded suppression since we don't get feedback into the different bursts of XIDs that can happen prior to remediation
|
|
||
| The dedup key is the **exact message string** with the kernel timestamp prefix stripped. Two messages that differ in any field — PCI address, XID code, pid, channel, process name — are treated as distinct and are **not** deduplicated. Only truly identical repeated error lines are suppressed. | ||
|
|
||
| ### What clears the dedup |
There was a problem hiding this comment.
also what if the XIDs are in patterns XID 1 -> XID-2 -> XID-1 -> XID-2 -> XID-1, should we dedup or no?
There was a problem hiding this comment.
As per the current design, we will only have the first XID 1 -> XID-2, rest of the chain would be de-duplicated.
There was a problem hiding this comment.
does order matter here? I mean are you using a pattern match or a search in unordered set for deduplication?
| Version int `json:"version"` | ||
| BootID string `json:"boot_id"` | ||
| CheckLastCursors map[string]string `json:"check_last_cursors"` | ||
| SeenMessages map[string][]string `json:"seen_messages,omitempty"` |
There was a problem hiding this comment.
ignore the comment. can't get rid of it.
|
|
||
| ### 5. Boot ID change handling | ||
|
|
||
| In [`handleBootIDChange`](health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go), clear all dedup trackers alongside cursors: |
There was a problem hiding this comment.
does GPU reset also resets the seen messages?
There was a problem hiding this comment.
yes, but only XIDs for that specific GPU.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Summary
Markdown preview: Syslog Health Monitor — Event Deduplication Until Remediation
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit