Skip to content

Commit 4b77fcd

Browse files
authored
fix: Add IbPortDown alert for machines with down IB ports (#519)
## Description When the IB Fabric Monitor detects ports not in Active state, it now sets a PreventAllocations health alert on the affected machine. This prevents Carbide from attempting to allocate instances on machines with degraded IB connectivity, avoiding SRE alerts. Port-down alerting uses a precedence model: 1. If a machine has a SKU assigned, alert on any down port not in the SKU's inactive_devices list (hardware-level truth). 2. If no SKU but an active instance exists, alert on down ports in the instance's IB config. 3. If neither SKU nor instance, no alert is generated. - Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down() - Detect ports not in Active state during IB fabric monitoring - Set/clear IbPortDown health alert via health report overrides - Update existing test to expect health alert blocking ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [x] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com>
1 parent 29d12fc commit 4b77fcd

4 files changed

Lines changed: 786 additions & 15 deletions

File tree

0 commit comments

Comments
 (0)