Commit 4b77fcd
authored
fix: Add IbPortDown alert for machines with down IB ports (#519)
## Description
When the IB Fabric Monitor detects ports not in Active state, it now
sets a PreventAllocations health alert on the affected machine. This
prevents Carbide from attempting to allocate instances on machines with
degraded IB connectivity, avoiding SRE alerts.
Port-down alerting uses a precedence model:
1. If a machine has a SKU assigned, alert on any down port not in the
SKU's inactive_devices list (hardware-level truth).
2. If no SKU but an active instance exists, alert on down ports in the
instance's IB config.
3. If neither SKU nor instance, no alert is generated.
- Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
- Detect ports not in Active state during IB fabric monitoring
- Set/clear IbPortDown health alert via health report overrides
- Update existing test to expect health alert blocking
## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)
## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->
## Breaking Changes
- [ ] This PR contains breaking changes
<!-- If checked above, describe the breaking changes and migration steps
-->
## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)
## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com>1 parent 29d12fc commit 4b77fcd
4 files changed
Lines changed: 786 additions & 15 deletions
File tree
- crates
- api/src
- ib_fabric_monitor
- tests
- health-report/src
0 commit comments