Skip to content

Draft: Add design doc for deduplication layer in syslog health monitor#1119

Draft
XRFXLP wants to merge 5 commits intoNVIDIA:mainfrom
XRFXLP:deduplication-design-doc
Draft

Draft: Add design doc for deduplication layer in syslog health monitor#1119
XRFXLP wants to merge 5 commits intoNVIDIA:mainfrom
XRFXLP:deduplication-design-doc

Conversation

@XRFXLP
Copy link
Copy Markdown
Member

@XRFXLP XRFXLP commented Apr 7, 2026

Summary

Markdown preview: Syslog Health Monitor — Event Deduplication Until Remediation

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added an architecture decision record describing a proposed syslog health monitor event deduplication approach: message normalization to ignore kernel timestamp prefixes, per-check tracking and suppression of duplicate unhealthy events, selective clearing when a GPU recovers and full clearing on reboot, persistence of dedup state across restarts, Prometheus metrics for suppressed events, and related test expectations.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 7, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e5bfd2e0-a1af-4b25-93c8-3372248aa56f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds ADR-033: a design for syslog health monitor event deduplication using normalized messages, per-check seen sets, suppression of duplicate unhealthy events, selective clearing on GPU recovery, full clearing on boot ID change, and optional persisted dedup state.

Changes

Cohort / File(s) Summary
Syslog Deduplication ADR
docs/designs/033-syslog-health-monitor-event-deduplication.md
New ADR detailing normalization (strip leading kernel timestamps), per-check dedup tracker API (normalize, isDuplicate, mark, clear, snapshot/restore), placement in monitor event path, observability (suppressed counter labels: check,node,code), selective PCI-address clearing on GPU recovery, full clearing on boot ID change, and persistence of seen sets in the existing syslog monitor state file.

Sequence Diagram(s)

sequenceDiagram
    participant Source as Event Source
    participant Monitor as Syslog Monitor
    participant Tracker as Dedup Tracker
    participant Sender as Event Sender (gRPC)

    Source->>Monitor: Emit raw syslog event
    Monitor->>Tracker: Normalize message & ask "isDuplicate?"
    alt Unhealthy event
        Tracker-->>Monitor: duplicate? (yes/no)
        alt yes
            Monitor->>Monitor: suppress event (increment suppressed counter)
        else no
            Tracker->>Tracker: mark seen
            Monitor->>Sender: send event
        end
    else Healthy event
        Monitor->>Tracker: clear entries (PCI-specific or full on boot)
        Monitor->>Sender: send healthy event
    end
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

Hopping through syslogs, I sniff and prune,
I strip the timestamps, hum a tidy tune,
Repeaters I tuck gently out of sight,
Fresh signals skip and bounce with light—
A rabbit's hop keeps logs polite 🐇✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title clearly summarizes the main change: adding a design document for a deduplication layer in the syslog health monitor, which matches the primary content of the pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/designs/033-syslog-health-monitor-event-deduplication.md (2)

9-13: Add a language tag to the fenced code block.

The fence starting at Line 9 is untyped, which triggers markdownlint MD040.

Suggested doc fix
-```
+```text
 Poll 1:  [ 1108.858286] NVRM: Xid (PCI:0000:b3:00.0): 79, pid=1234, name=nv-hostengine  →  event sent
 Poll 2:  [ 1843.308145] NVRM: Xid (PCI:0000:b3:00.0): 79, pid=1234, name=nv-hostengine  →  duplicate event sent
 Poll 3:  [ 2501.556012] NVRM: Xid (PCI:0000:b3:00.0): 79, pid=1234, name=nv-hostengine  →  duplicate event sent
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/designs/033-syslog-health-monitor-event-deduplication.md around lines 9

  • 13, The fenced code block containing the three "Poll" lines is missing a
    language tag (causing MD040); update the opening triple-backtick fence to
    include a language identifier such as "text" (i.e., change totext) so
    the block is typed, e.g., the block starting with "Poll 1: [ 1108.858286] NVRM:
    Xid ..." should use ```text as the fence.

</details>

---

`309-309`: **Tighten wording for readability.**

“a small number of unique message strings” is vague; “a few unique message strings” reads cleaner in ADR prose.


<details>
<summary>Suggested wording tweak</summary>

```diff
-- The state file grows by the size of the seen set. Between a fault and its remediation, this is typically a small number of unique message strings (order of 1-10), so the impact is negligible.
+- The state file grows by the size of the seen set. Between a fault and its remediation, this is typically a few unique message strings (roughly 1-10), so the impact is negligible.
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@docs/designs/033-syslog-health-monitor-event-deduplication.md` at line 309,
Replace the phrase "a small number of unique message strings" with the clearer
wording "a few unique message strings" in the sentence describing state file
growth so the line reads: "The state file grows by the size of the seen set.
Between a fault and its remediation, this is typically a few unique message
strings (order of 1-10), so the impact is negligible." Ensure only the phrase is
changed and punctuation remains consistent.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/designs/033-syslog-health-monitor-event-deduplication.md:

  • Line 27: Remove the trailing spaces inside inline code spans used for the
    kernel timestamp example (e.g., change "[12345.678901] " to
    "[12345.678901]") and do the same for the other inline span later in the doc
    (the second occurrence around the timestamp example), ensuring no internal
    trailing spaces remain in any inline backtick-enclosed tokens to satisfy
    markdownlint MD038.

Nitpick comments:
In @docs/designs/033-syslog-health-monitor-event-deduplication.md:

  • Around line 9-13: The fenced code block containing the three "Poll" lines is
    missing a language tag (causing MD040); update the opening triple-backtick fence
    to include a language identifier such as "text" (i.e., change totext) so
    the block is typed, e.g., the block starting with "Poll 1: [ 1108.858286] NVRM:
    Xid ..." should use ```text as the fence.
  • Line 309: Replace the phrase "a small number of unique message strings" with
    the clearer wording "a few unique message strings" in the sentence describing
    state file growth so the line reads: "The state file grows by the size of the
    seen set. Between a fault and its remediation, this is typically a few unique
    message strings (order of 1-10), so the impact is negligible." Ensure only the
    phrase is changed and punctuation remains consistent.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `97ac7690-8047-4b26-8893-9a6dacf8c5e1`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between efe1e1754922ecd02cd11fed437e6863f605a447 and ab0671f0342ec3fa5f9293adeb723dad62f9d35d.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `docs/designs/033-syslog-health-monitor-event-deduplication.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread docs/designs/033-syslog-health-monitor-event-deduplication.md
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/designs/033-syslog-health-monitor-event-deduplication.md`:
- Around line 9-13: The fenced code block showing the Poll 1/2/3 syslog examples
is missing a language tag and triggers markdownlint MD040; update the opening
fence to include a language (for example change ``` to ```text) so the block
containing the lines beginning with "Poll 1:", "Poll 2:", "Poll 3:" is
explicitly marked as text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d5b234d6-3fa3-466b-8327-4953167f19c9

📥 Commits

Reviewing files that changed from the base of the PR and between ab0671f and 933b678.

📒 Files selected for processing (1)
  • docs/designs/033-syslog-health-monitor-event-deduplication.md

Comment thread docs/designs/033-syslog-health-monitor-event-deduplication.md
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/designs/033-syslog-health-monitor-event-deduplication.md (2)

9-13: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced example block.

Line 9 opens a fenced block without a language identifier, which triggers markdownlint MD040. Use an explicit language like text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/designs/033-syslog-health-monitor-event-deduplication.md` around lines 9
- 13, The fenced example block using triple backticks (``` ) on Poll 1/2/3
should include a language identifier to satisfy markdownlint MD040; update the
opening fence to include a language tag (e.g., change the opening ``` to
```text) so the example block is explicitly marked as plain text.

27-27: ⚠️ Potential issue | 🟡 Minor

Remove trailing spaces inside inline code spans.

Line 27 and Line 372 include inline code tokens with internal trailing spaces (MD038). Remove the trailing spaces from the backtick-wrapped examples.

Also applies to: 372-372

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/designs/033-syslog-health-monitor-event-deduplication.md` at line 27,
Remove the trailing spaces inside the inline code spans that trigger MD038:
change the backtick-wrapped examples from "`[12345.678901] `" (and the similar
one at the second occurrence) to "`[12345.678901]`" so there is no internal
trailing space inside the code span; update both occurrences referenced in the
diff (the inline code at line showing the kernel timestamp example and its
duplicate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/designs/033-syslog-health-monitor-event-deduplication.md`:
- Around line 9-13: The fenced example block using triple backticks (``` ) on
Poll 1/2/3 should include a language identifier to satisfy markdownlint MD040;
update the opening fence to include a language tag (e.g., change the opening ```
to ```text) so the example block is explicitly marked as plain text.
- Line 27: Remove the trailing spaces inside the inline code spans that trigger
MD038: change the backtick-wrapped examples from "`[12345.678901] `" (and the
similar one at the second occurrence) to "`[12345.678901]`" so there is no
internal trailing space inside the code span; update both occurrences referenced
in the diff (the inline code at line showing the kernel timestamp example and
its duplicate).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a271bd6-2669-4246-a26b-a48513105328

📥 Commits

Reviewing files that changed from the base of the PR and between 933b678 and 5f9c3e8.

📒 Files selected for processing (1)
  • docs/designs/033-syslog-health-monitor-event-deduplication.md

@XRFXLP XRFXLP self-assigned this Apr 7, 2026
@XRFXLP XRFXLP linked an issue Apr 7, 2026 that may be closed by this pull request
1 task

### What counts as "the same message"

The dedup key is the **exact message string** with the kernel timestamp prefix stripped. Two messages that differ in any field — PCI address, XID code, pid, channel, process name — are treated as distinct and are **not** deduplicated. Only truly identical repeated error lines are suppressed.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this play with the health event analyzer rules where it checks for multiple XID 13s/31s? Can we document if there will be any impact to those rules?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have impact on the rules that assume repetition of XIDs with same impact entities. So it will be RepatedXIDonSameGPU and RepeatedXIDonSameGPCAndTPC, will add this into a doc.


The dedup key is the **exact message string** with the kernel timestamp prefix stripped. Two messages that differ in any field — PCI address, XID code, pid, channel, process name — are treated as distinct and are **not** deduplicated. Only truly identical repeated error lines are suppressed.

### What clears the dedup
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a leaky bucket kind of approach so we will still continue to see independent bursts if they occur at two different times before remediation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a leaky bucket? If the original event has already triggered breakfix pipeline then why do we need to follow up with the same error?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the use case outside of breakfix like in the case of the exporter where we want to publish the health events for analytics purposes outside of the cluster. Now if we supress in the vast majority of the cases, I'm worried we'd end up losing data on bursts and potentially might delay our response to a bug

Copy link
Copy Markdown
Member Author

@XRFXLP XRFXLP Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we can monitor the suppression metric to get an idea of the burst that we're getting, as prioritizing analytics over resilience of breakfix system just for getting an idea of size of the burst doesn't make sense. Adding any dedup window duration doesn't make sense because we've seen the burst of size of 10+ hours.

Also are we assuming that large burst indicates an error of higher severity here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we can monitor the suppression metric

Sure, but that won't have all the information as we have with the health events.

as prioritizing analytics over resilience of breakfix system just for getting an idea of size of the burst doesn't make sense

I'm not saying prioritize analytics over the breakfix system, I'm saying we provide all the parts of the system with as much data as they can handle. I'm thinking about this in terms of throughput. What is the maximum possible throughput that the system can handle which we can saturate.

Adding any dedup window duration doesn't make sense because we've seen the burst of size of 10+ hours.

I think the idea here is that we'd want to see them as multiple bursts

Also are we assuming that large burst indicates an error of higher severity here?

to a certain extent, yes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be better to optimize the system throughout first (if we can). One very obvious I can think of is using event ID rather than full event ID in node drainer queue that would decrease the memory usage as a function of number of events consequently increasing the max throughput. Checking that possibility, meanwhile keeping this MR as draft for now as dedup window straightaway looks like arbitrary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's find opportunities for optimization first. One that @KaivalyaMDabhadkar is already working on is the cold start and there could be other optimizations that we can do as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to prioritising system stability over analytics. If we need a record of all health events regardless of duplication, we should think of an alternate mechanism such as exporting all the events via logs to Kratos and it does not need to go through MongoDB either.

Copy link
Copy Markdown
Contributor

@KaivalyaMDabhadkar KaivalyaMDabhadkar Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also make sense to reduce the frequency of re-emissions rather than complete deduplication. Between the spectrum of complete deduplication (proposed approach) and no deduplication ( current implementation) we have a case where we do re-emit some duplicates but not as frequently, but just enough to give us an idea of different "bursts" in time as @lalitadithya has suggested. This will also provide the HEA with more information regarding the frequency of the XIDs compared to the proposed approach where the HEA loses all information regarding duplicates in time and their frequency. A simple way to accomplish this would be to re-emit the same error after an exponentially increasing number of suppressed occurrences, for eg. emit the 1st, then suppress until the 10th occurence, then suppress until the 20th, 40th, 80th and so on (or until an upper limit which we set is reached). This means that our total duplicates emitted would only grow logarithmically with the total frequency of ocurrences, which is still a huge amount of deduplication and which is enough for the HEA to detect repetition patterns and for rules like RepeatedXIDonSameGPU to fire. Thoughts about this?

- **Monitor-level dedup, not handler-level**: placing dedup in `handleSingleLine` avoids modifying the `Handler` interface or each handler implementation. The `Handler.ProcessLine` contract stays unchanged — it returns events, and the monitor decides whether to send them.
- **Exact message matching (minus timestamp)**: this is the simplest correct key. Any semantic difference in the message (different PCI, different pid, different XID code) is a different error and should not be suppressed.
- **State file persistence**: the monitor is a long-running daemon, so in-memory tracking covers cross-poll dedup. Persisting to the state file additionally covers pod restarts without needing to re-report errors that downstream components have already processed.
- **No TTL/expiry**: dedup is scoped to "until remediation". Healthy events and reboots are the remediation signals. A TTL would introduce a tuning parameter with no universally correct value and risk either premature re-reporting or unbounded suppression.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that having without a TTL we have unbounded suppression since we don't get feedback into the different bursts of XIDs that can happen prior to remediation


The dedup key is the **exact message string** with the kernel timestamp prefix stripped. Two messages that differ in any field — PCI address, XID code, pid, channel, process name — are treated as distinct and are **not** deduplicated. Only truly identical repeated error lines are suppressed.

### What clears the dedup
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what if the XIDs are in patterns XID 1 -> XID-2 -> XID-1 -> XID-2 -> XID-1, should we dedup or no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the current design, we will only have the first XID 1 -> XID-2, rest of the chain would be de-duplicated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does order matter here? I mean are you using a pattern match or a search in unordered set for deduplication?

@XRFXLP XRFXLP changed the title docs: Add design doc for deduplication layer in syslog health monitor Draft: Add design doc for deduplication layer in syslog health monitor Apr 7, 2026
@XRFXLP XRFXLP marked this pull request as draft April 7, 2026 08:13
Version int `json:"version"`
BootID string `json:"boot_id"`
CheckLastCursors map[string]string `json:"check_last_cursors"`
SeenMessages map[string][]string `json:"seen_messages,omitempty"`
Copy link
Copy Markdown
Contributor

@neerajnv neerajnv Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore the comment. can't get rid of it.


### 5. Boot ID change handling

In [`handleBootIDChange`](health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go), clear all dedup trackers alongside cursors:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does GPU reset also resets the seen messages?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but only XIDs for that specific GPU.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/tracing 4.78% (ø)
github.com/nvidia/nvsentinel/fault-quarantine 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/breaker 30.06% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher 2.96% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer 34.50% (-0.06%) 👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler 23.49% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.47% (ø)
github.com/nvidia/nvsentinel/node-drainer 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/queue 46.49% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 37.79% (ø)
github.com/nvidia/nvsentinel/platform-connectors 0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/grpcsink 70.18% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes 84.64% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store 75.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/pipeline 38.46% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer 82.05% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/server 77.78% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/transformers/metadata 70.42% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/transformers/overrides 71.67% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/config 29.71% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/controller 15.77% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/gang/coordinator 37.16% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/gang/types 0.00% (ø)
github.com/nvidia/nvsentinel/preflight/pkg/webhook 28.35% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client 5.72% (-0.01%) 👎
github.com/nvidia/nvsentinel/store-client/pkg/datastore 3.15% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb 6.03% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 4.61% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/tracing/span_attributes.go 0.00% (ø) 329 0 329
github.com/nvidia/nvsentinel/commons/pkg/tracing/tracing.go 6.48% (ø) 926 60 866
github.com/nvidia/nvsentinel/fault-quarantine/main.go 0.00% (ø) 272 0 272
github.com/nvidia/nvsentinel/fault-quarantine/pkg/breaker/breaker.go 30.06% (ø) 835 251 584
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher/event_watcher.go 2.96% (ø) 1282 38 1244
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/k8s_client.go 35.52% (ø) 1067 379 688
github.com/nvidia/nvsentinel/fault-quarantine/pkg/informer/node_informer.go 33.01% (-0.14%) 727 240 (-1) 487 (+1) 👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer/init.go 0.00% (ø) 280 0 280
github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler.go 23.49% (ø) 3380 794 2586
github.com/nvidia/nvsentinel/node-drainer/main.go 0.00% (ø) 475 0 475
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/queue.go 66.07% (ø) 56 37 19
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/worker.go 37.98% (ø) 129 49 80
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 37.79% (ø) 1085 410 675
github.com/nvidia/nvsentinel/platform-connectors/main.go 0.00% (ø) 200 0 200
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/grpcsink/grpc_sink_connector.go 70.18% (ø) 57 40 17
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/grpcsink/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector.go 2.94% (ø) 34 1 33
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/process_node_events.go 93.57% (ø) 311 291 20
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector.go 75.00% (ø) 80 60 20
github.com/nvidia/nvsentinel/platform-connectors/pkg/pipeline/factory.go 0.00% (ø) 16 0 16
github.com/nvidia/nvsentinel/platform-connectors/pkg/pipeline/pipeline.go 100.00% (ø) 10 10 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer.go 78.12% (ø) 32 25 7
github.com/nvidia/nvsentinel/platform-connectors/pkg/server/platform_connector_server.go 77.78% (ø) 18 14 4
github.com/nvidia/nvsentinel/platform-connectors/pkg/transformers/metadata/transformer.go 97.83% (ø) 46 45 1
github.com/nvidia/nvsentinel/platform-connectors/pkg/transformers/overrides/cel.go 74.29% (ø) 35 26 9
github.com/nvidia/nvsentinel/platform-connectors/pkg/transformers/overrides/transformer.go 78.26% (ø) 46 36 10
github.com/nvidia/nvsentinel/preflight/pkg/config/config.go 29.71% (ø) 313 93 220
github.com/nvidia/nvsentinel/preflight/pkg/controller/gang_controller.go 15.77% (ø) 539 85 454
github.com/nvidia/nvsentinel/preflight/pkg/gang/coordinator/coordinator.go 37.16% (ø) 662 246 416
github.com/nvidia/nvsentinel/preflight/pkg/gang/types/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/preflight/pkg/webhook/handler.go 24.12% (ø) 311 75 236
github.com/nvidia/nvsentinel/preflight/pkg/webhook/injector.go 29.26% (ø) 1456 426 1030
github.com/nvidia/nvsentinel/store-client/pkg/client/convenience.go 3.66% (ø) 655 24 631
github.com/nvidia/nvsentinel/store-client/pkg/client/postgresql_changestream.go 3.42% (ø) 2278 78 2200
github.com/nvidia/nvsentinel/store-client/pkg/datastore/interfaces.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb/health_store.go 8.20% (ø) 1342 110 1232
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/health_events.go 0.03% (ø) 3139 1 3138

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/tracing/tracing_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/queue/queue_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_integration_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/grpcsink/grpc_sink_connector_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/transformers/overrides/cel_test.go
  • github.com/nvidia/nvsentinel/preflight/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/preflight/pkg/controller/gang_controller_test.go
  • github.com/nvidia/nvsentinel/preflight/pkg/gang/coordinator/coordinator_test.go
  • github.com/nvidia/nvsentinel/preflight/pkg/webhook/handler_test.go
  • github.com/nvidia/nvsentinel/preflight/pkg/webhook/injector_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/behavioral_contract_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Deduplicate XID/SXID in syslog health monitor

4 participants