Skip to content

OSAC-1029: Unified Networking API for VMaaS, CaaS, and BMaaS#50

Merged
danmanor merged 1 commit into
mainfrom
unified-networking-design
Jun 17, 2026
Merged

OSAC-1029: Unified Networking API for VMaaS, CaaS, and BMaaS#50
danmanor merged 1 commit into
mainfrom
unified-networking-design

Conversation

@danmanor

@danmanor danmanor commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Introduces a redesigned networking architecture that generalizes the OSAC Networking API across all three service types: VMaaS (ComputeInstance), CaaS (Cluster), and BMaaS (BaremetalInstance).

The core premise is that VMs are part of the fabric. OSAC runs VMs on OpenShift using KubeVirt (VMs encapsulated in pods, networking managed by OVN-Kubernetes). A K8s manager bridges the OVN overlay to the physical fabric, making VMs first-class participants alongside bare-metal servers and cluster nodes. Once on the fabric, all resource types are treated uniformly.

This enhancement replaces the original networking enhancement which listed CaaS and BMaaS as non-goals.

Key Design Decisions

  • NetworkClass simplified to two fields: fabricManager (handles all physical networking) and optional k8sManager (bridges VMs to the fabric)
  • Fabric manager is a single product (e.g., Netris, Neutron) that handles isolation, ACLs, IP allocation, DNAT, SNAT — no per-action driver composition
  • K8s manager is a single product (e.g., CUDN with LocalNet, OVN EVPN, VRF-lite) that makes VMs part of the fabric — optional for BM-only regions
  • ExternalIP (renamed from PublicIP) — "external to the VirtualNetwork", not necessarily internet-routable; supports air-gapped environments
  • ExternalIPAttachment = inbound DNAT only; NATGateway = outbound SNAT only — clear separation
  • ExternalIPAttachment extended with cluster and baremetal_instance targets + target_endpoint enum for CaaS (API/INGRESS)
  • Infrastructure-agnostic subnets — same subnet hosts VMs, BM servers, and cluster nodes; no scope field
  • Fabric-only security — SecurityGroups enforced by fabric for all resource types; no separate K8s-level ACL
  • Per-interface BM attachment — tenant discovers interfaces from template, maps each to a subnet
  • Uniform dispatcher — no VM-vs-BM distinction for ExternalIP or NATGateway operations
  • Managers self-register via ConfigMaps; adding a new manager requires no API or operator changes
  • CaaS prerequisites (ExternalIPs before cluster) are template-dependent, not an API mandate

Related

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a PRD and complementary enhancement design defining a provider-only unified networking model (NetworkClass/NetworkBackend/ServiceNetworkClass), API changes (VirtualNetwork.scope, network_attachments), expanded PublicIPAttachment/NATGateway semantics, end-to-end workflows and ordering rules. Risk: Medium — impacts tenant/provider networking semantics and may affect availability and security.

Changes

Unified Networking Enhancement

Layer / File(s) Summary
PRD metadata and introduction
enhancements/unified-networking-prd/README.md
PRD front-matter and document header metadata.
PRD terminology & problem statement
enhancements/unified-networking-prd/README.md
Core terminology, networking resource model, and problem statement noting original API targeted VMaaS only.
PRD gaps, stories, goals
enhancements/unified-networking-prd/README.md
Lists design gaps across VMaaS/CaaS/BMaaS and tenant/provider user stories, goals, and non-goals.
PRD requirements & acceptance criteria
enhancements/unified-networking-prd/README.md
Requirements R1–R9 with acceptance criteria addressing unified resources, scope-driven provider selection, composable drivers, DNAT/SNAT separation, CaaS activation timing, and forward compatibility.
Design metadata and introduction
enhancements/unified-networking/README.md
Enhancement design front-matter and header metadata.
Architecture and core concepts
enhancements/unified-networking/README.md
Defines NetworkClass, driver-group/action mapping, capability registration, and operator-driven driver discovery (ConfigMaps).
Scope semantics and dispatcher behavior
enhancements/unified-networking/README.md
VirtualNetwork.scope (vm/bm) semantics and dispatcher/operator composition for two-step VM ingress/egress vs one-step BM behavior.
Resource hierarchy and PublicIPPool
enhancements/unified-networking/README.md
Resource hierarchy and PublicIPPool meaning (external-to-VN, provider-managed allocation across scopes).
API/model extensions
enhancements/unified-networking/README.md
Adds VirtualNetworkSpec.scope, ClusterSpec/BaremetalInstanceSpec.network_attachments, NetworkAttachment semantics, PublicIPAttachment / target_endpoint, and VN-scoped NATGateway outbound-only.
VMaaS end-to-end workflow
enhancements/unified-networking/README.md
Scope validation, provider setup, VN/subnet/SG creation, VM placement, inbound PublicIPAttachment (DNAT), outbound NATGateway (SNAT).
CaaS workflow and ordering rules
enhancements/unified-networking/README.md
CaaS cluster flow, scope-driven backend resolution, and prerequisite ordering requiring PublicIPAttachments before cluster creation (hosted control plane hairpin NAT).
BMaaS workflow
enhancements/unified-networking/README.md
Baremetal provisioning steps, attachments, and inbound/outbound exposure behavior without VM-specific bridging.
Implementation notes and constraints
enhancements/unified-networking/README.md
NATGateway scoping, multi-NIC support limits, single hosting-cluster assumption, v1 cross-VN isolation out-of-scope, DNS external integration, and infrastructure needs.
Risks, alternatives, questions, governance
enhancements/unified-networking/README.md
Risks/mitigations, alternatives not implemented, resolved questions, and placeholder test/graduation/upgrade/support sections.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • mhrivnak
  • larsks
  • trewest
  • avishayt

Poem

🔗 One spec to name them, one scope to bind,
Provider drivers map actions that we find.
DNAT in, SNAT out — clear lanes defined,
Attachments ordered so clusters can align.
Unified networks sketched, with risks outlined.

🚥 Pre-merge checks | ✅ 5 | ❌ 6

❌ Failed checks (6 inconclusive)

Check name Status Explanation Resolution
No-Hardcoded-Secrets ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Injection-Vectors ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ai-Attribution ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically identifies the main change: introducing a unified networking API covering VMaaS, CaaS, and BMaaS, referenced by the ticket ID OSAC-1029. It accurately summarizes the core enhancement without vagueness or misleading claims.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch unified-networking-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@enhancements/unified-networking/README.md`:
- Around line 240-252: The fenced code blocks in the README containing the
topology diagram (lines showing "NetworkClass (per region, provider-only)",
"VirtualNetwork ...", "PublicIPPool ...") and the other step-flow blocks
referenced (also at 467-477) are missing language identifiers; update those
fences to include a language tag (use "text" for the hierarchy/diagram blocks
and "text" or "bash" for any step/command flows) so markdown linters (MD040)
pass, i.e., locate the triple-backtick fences around the diagram and step flows
and replace ``` with ```text or ```bash as appropriate.
- Around line 351-356: The PublicIPAttachment schema leaves target_endpoint
optional which allows ambiguous behavior for target=cluster; update the
PublicIPAttachment validation to require target_endpoint whenever target ==
"cluster" and restrict its allowed values to the closed enum {"api", "ingress"}
so callers must explicitly choose the cluster endpoint; locate and change the
validation logic for PublicIPAttachment (the target and target_endpoint fields)
and enforce the conditional required rule and enum constraint at the
schema/validation layer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 274546ac-ac4e-4942-bab1-125f1365a466

📥 Commits

Reviewing files that changed from the base of the PR and between cba4bd4 and 6f84859.

📒 Files selected for processing (1)
  • enhancements/unified-networking/README.md

Comment thread enhancements/unified-networking/README.md
Comment thread enhancements/unified-networking/README.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
enhancements/unified-networking/README.md (6)

428-430: 🧹 Nitpick | 🔵 Trivial

NATGateway VN-wide scoping may be overly broad for compliance scenarios.

Line 429 states "All subnets in the VN use the gateway for outbound traffic." While this simplifies v1, it may create compliance or security posture challenges:

Operational concern:

  • Impact: In multi-tier architectures (e.g., web/app/db subnets in same VN), organizations may want only specific subnets to have internet egress
  • Compliance risk: Some subnets (e.g., database tier) may require air-gapped isolation from external networks
  • Audit complexity: Cannot prove which subnet initiated outbound traffic when all subnets share the same SNAT IP

Current mitigation: Operators can create separate VirtualNetworks per tier, but this increases configuration overhead and complicates cross-tier communication (requires future VN peering).

Consider documenting this as a known limitation with a future enhancement path for per-subnet NAT association (e.g., adding an optional subnet field to NATGatewaySpec to scope it to specific subnets within the VN).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 428 - 430, Document
the VN-wide scope limitation for NATGateway and add a recommended future
enhancement: update the NATGateway documentation and the NATGatewaySpec schema
description to note that "All subnets in the VN use the gateway for outbound
traffic" is a known limitation for compliance-sensitive deployments, and propose
adding an optional subnet scoping field (e.g., an optional subnet reference
field on NATGatewaySpec) to allow per-subnet NAT association; reference
NATGateway, VirtualNetwork, and NATGatewaySpec in the note and include a short
migration/operational workaround recommending separate VirtualNetworks or VN
peering until per-subnet scoping is implemented.

304-337: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

NATGateway public_ip mutability is not specified.

Line 318 declares public_ip as "required" but does not specify whether it is mutable or immutable. Allowing runtime changes to the public IP could:

Risk impact:

  • Severity: Major — Runtime IP changes break active connections and firewall rules
  • Security concern: IP switching could bypass audit trails or circumvent IP-based access controls
  • Operational risk: Changing the SNAT IP without proper cleanup leaves stale NAT mappings

Recommendation: Mark public_ip as immutable (like PublicIPAttachment's public_ip field on line 291). If IP changes are needed, the operator should delete and recreate the NATGateway.

🔒 Proposed fix
 message NATGatewaySpec {
   string virtual_network = 1;  // parent VN ID, required, immutable
-  string public_ip = 2;        // PublicIP ID for outbound NAT, required
+  string public_ip = 2;        // PublicIP ID for outbound NAT, required, immutable
 }

Update the admission webhook or CRD validation to reject updates to spec.public_ip.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 304 - 337, The
NATGatewaySpec.public_ip field must be treated as immutable to avoid runtime IP
switches; update the NATGateway documentation and schema to mark spec.public_ip
as immutable (consistent with PublicIPAttachment.public_ip) and enforce this in
the control plane by adding a validation rule in the CRD or admission webhook
that rejects updates which change NATGateway.spec.public_ip (use the NATGateway
and NATGatewaySpec identifiers when locating the code that creates validation or
webhook logic).

431-443: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

DNS ownership inconsistency between API boundary and CaaS workflow.

Lines 440-442 state "DNS record management is a service-integration concern, not part of the networking API," but line 389 (CaaS workflow) states "DNS records created by the CaaS template role."

Impact:

  • Minor documentation inconsistency — The boundary is actually clear (DNS is handled by service templates, not the networking API), but the phrasing could be misread as conflicting
  • Future API risk: If DNS becomes part of the networking API later, the "not part of" statement creates technical debt

Clarification needed: The DNS section should explicitly state that while DNS is not exposed as a networking API resource, service-specific template roles (like CaaS) integrate with DNS infrastructure as part of their provisioning logic.

📝 Suggested clarification
 #### DNS
 
-DNS record management is a service-integration concern, not part of the networking API. CaaS template roles create DNS records. A DNS API is a future enhancement.
+DNS record management is a service-integration concern, not exposed as part of the networking API. Service-specific template roles (e.g., CaaS) integrate with the provider's DNS infrastructure during provisioning (as mentioned in the CaaS workflow, line 389). A unified DNS API resource is a future enhancement.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 431 - 443, Update the
"DNS" section to remove ambiguity by explicitly stating that DNS is not an
exposed networking API resource but is handled by service provisioning logic
(e.g., "CaaS template roles create DNS records")—keep both sentences and
rephrase so they read together (DNS is managed by service integration, not the
networking API; CaaS workflow creates records via templates), and add a short
caveat that this boundary may change if DNS becomes a networking API feature in
future to avoid technical debt; reference the "DNS" heading and the "CaaS
workflow"/"CaaS template roles create DNS records" phrasing when making the
edit.

374-391: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

CaaS prerequisite validation mechanism is under-specified, creating bypass risk.

Lines 383-385 state that the system "validates prerequisites exist" and that PublicIPAttachments target the cluster "by name before it exists," but the document does not specify:

Risk impact:

  • Severity: Major — Missing or weak validation allows cluster creation without required networking, causing runtime failures or hairpin NAT breakage
  • TOCTOU risk: If cluster names are not scoped by tenant/namespace, an attacker could create a cluster with the same name as pending PublicIPAttachments intended for another tenant's cluster, hijacking the public IPs
  • Bypass risk: Without explicit validation hooks, operators could skip prerequisites

Gaps:

  1. Validation mechanism: Admission webhook? Controller reconciliation? Both? When does it run?
  2. Name scoping: Are cluster names unique per tenant/namespace? The document references Metadata (line 312) but doesn't show namespace scoping for Cluster resources
  3. Prerequisite enforcement: What happens if prerequisites are deleted after cluster creation? Are they immutable references?
  4. CIDR overlap: No validation mentioned to prevent network.pod_cidr / service_cidr (lines 244-246) from overlapping with network_attachments subnet CIDRs

Recommended mitigations:

  1. Specify validation mechanism: Document whether validation is performed at admission (webhook) or during controller reconciliation, or both
  2. Add cluster name uniqueness constraint: State that cluster names must be unique within tenant namespace (assuming standard OSAC Metadata includes namespace)
  3. Add prerequisite immutability: Once a cluster references PublicIPAttachments/NATGateway, those resources should be immutable or have deletion protection
  4. Add CIDR overlap validation: Validate that network.pod_cidr and network.service_cidr do not overlap with any subnet CIDR in network_attachments
🔒 Suggested documentation additions

Add to the CaaS workflow section:

**Prerequisite Validation (Cluster Admission Webhook):**
- Cluster names must be unique within the tenant's namespace
- When cluster.spec.network_attachments references subnets:
  - Subnets must exist and have scope=cluster
  - At least one PublicIPAttachment must exist with target=<cluster-name> and target_endpoint="api"
  - At least one PublicIPAttachment must exist with target=<cluster-name> and target_endpoint="ingress"
  - At least one NATGateway must exist on the VirtualNetwork
- Cluster-internal CIDRs (pod_cidr, service_cidr) must not overlap with subnet CIDRs

**Prerequisite Protection:**
- Once a Cluster references a PublicIPAttachment or NATGateway, those resources gain a finalizer
- Deletion is blocked until the Cluster is deleted
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 374 - 391, Document
and implement explicit prerequisite validation for CaaS: specify that Cluster
admission uses an admission webhook (and optional controller reconciliation) to
validate Cluster.spec.network_attachments and network.pod_cidr/service_cidr;
require Cluster names be unique within the tenant namespace (OSAC Metadata
namespace) and ensure PublicIPAttachment resources with target=<cluster-name>
and target_endpoint="api"/"ingress" plus a NATGateway on the referenced
VirtualNetwork exist before admitting the Cluster; mark referenced
PublicIPAttachment/NATGateway resources immutable (add finalizer/deletion
protection while Cluster exists) and enforce CIDR non-overlap between
network.pod_cidr/service_cidr and any subnet CIDRs in network_attachments.

280-303: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Target-endpoint validation gaps create traffic misrouting risk.

Line 298 leaves target_endpoint optional, but line 302 states that clusters have multiple endpoints requiring "explicit targeting." This creates a security and reliability risk:

Risk impact:

  • Severity: Major — Ambiguous defaults could route traffic to the wrong endpoint (API server vs. ingress controller)
  • Security exposure: If defaulted incorrectly, could expose cluster API server when only ingress was intended, or vice versa
  • User experience: Silent misrouting is hard to debug

Missing validations:

  1. target_endpoint should be required when target = cluster
  2. target_endpoint values should be constrained to a closed enum (api, ingress) at the validation layer
  3. No validation is specified to enforce scope compatibility (e.g., rejecting a PublicIPAttachment targeting a ComputeInstance on a baremetal-scoped VN)
🔒 Proposed validation rules

Add to the protobuf validation or admission webhook logic:

 message PublicIPAttachmentSpec {
   string public_ip = 1;           // PublicIP ID, required, immutable
   
   oneof target {                   // EXTENDED: add Cluster and BaremetalInstance
     string compute_instance = 2;   // ComputeInstance ID (existing)
     string cluster = 3;            // Cluster ID (new)
     string baremetal_instance = 4; // BaremetalInstance ID (new)
   }
-  string target_endpoint = 5;     // optional: for Cluster, specifies "api" or "ingress"
+  string target_endpoint = 5;     // required when target=cluster; enum: "api" | "ingress"
+                                  // validation: must be empty for compute_instance and baremetal_instance
 }
+
+// Validation rules (admission webhook or CRD validation):
+// 1. If target == cluster: target_endpoint is required and must be in {"api", "ingress"}
+// 2. If target != cluster: target_endpoint must be empty
+// 3. Target resource's VirtualNetwork.scope must be compatible with target type
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 280 - 303, The
PublicIPAttachmentSpec validation is incomplete: update validation (protobuf
rules or the admission webhook) to (1) require target_endpoint when the oneof
target is cluster (i.e., when cluster field in PublicIPAttachmentSpec is set),
(2) restrict target_endpoint to a closed enum with only "api" and "ingress"
values, and (3) add scope compatibility checks that reject attachments that
mismatch network scope (e.g., prevent compute_instance from being attached to a
VN that is marked baremetal-scoped or other incompatible combinations involving
baremetal_instance and cluster). Ensure these checks run where
PublicIPAttachmentSpec is validated so the oneof (compute_instance, cluster,
baremetal_instance) and target_endpoint semantics are enforced atomically.

444-452: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Risks table is incomplete; missing several major security and reliability concerns.

The risks table (lines 446-451) captures backend complexity and scope limitations but omits several major risks identified in the API and workflow sections:

Missing risks:

  1. PublicIPAttachment target_endpoint ambiguity — Optional target_endpoint for cluster targets creates traffic misrouting and security exposure risk (severity: major)
  2. NATGateway public_ip mutability — Unspecified mutability allows runtime IP changes that break connections and bypass audit trails (severity: major)
  3. CaaS prerequisite TOCTOU — Cluster name targeting without namespace scoping allows name collision attacks (severity: major)
  4. CIDR overlap between cluster-internal and OSAC networks — No validation prevents pod_cidr/service_cidr from overlapping with network_attachments subnet CIDRs (severity: major)
🔒 Proposed risk additions

Add the following rows to the Risks and Mitigations table:

Risk Impact Mitigation
PublicIPAttachment cluster target ambiguity Traffic misrouting or security exposure if target_endpoint defaults incorrectly Make target_endpoint required when target=cluster; validate enum values {"api", "ingress"} at admission
NATGateway public_ip runtime changes Connection breakage, stale NAT mappings, audit trail bypass Mark public_ip as immutable; require delete/recreate for IP changes
CaaS cluster name TOCTOU PublicIPAttachments targeting wrong cluster if names collide across tenants Enforce cluster name uniqueness per tenant namespace; validate scope compatibility
Cluster CIDR overlap Routing conflicts between cluster-internal CNI and OSAC tenant network Validate pod_cidr/service_cidr do not overlap with network_attachments subnet CIDRs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 444 - 452, Update the
Risks and Mitigations table to add four rows covering the missing major issues:
(1) PublicIPAttachment cluster target ambiguity — reference the
PublicIPAttachment resource and its target_endpoint field and state "Make
target_endpoint required when target=cluster; validate enum values
{\"api\",\"ingress\"} at admission" as the mitigation; (2) NATGateway public_ip
mutability — reference the NATGateway resource and its public_ip field and mark
public_ip immutable with "require delete/recreate for IP changes" as the
mitigation; (3) CaaS cluster name TOCTOU — reference cluster name targeting in
PublicIPAttachment and require cluster name uniqueness per tenant/namespace and
scope compatibility validation; (4) Cluster CIDR overlap — reference
pod_cidr/service_cidr and network_attachments subnet CIDRs and add validation to
prevent overlap. Ensure each new row mirrors the existing table format and uses
the exact identifiers: PublicIPAttachment, target_endpoint, target=cluster,
NATGateway, public_ip, pod_cidr, service_cidr, and network_attachments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@enhancements/unified-networking/README.md`:
- Around line 428-430: Document the VN-wide scope limitation for NATGateway and
add a recommended future enhancement: update the NATGateway documentation and
the NATGatewaySpec schema description to note that "All subnets in the VN use
the gateway for outbound traffic" is a known limitation for compliance-sensitive
deployments, and propose adding an optional subnet scoping field (e.g., an
optional subnet reference field on NATGatewaySpec) to allow per-subnet NAT
association; reference NATGateway, VirtualNetwork, and NATGatewaySpec in the
note and include a short migration/operational workaround recommending separate
VirtualNetworks or VN peering until per-subnet scoping is implemented.
- Around line 304-337: The NATGatewaySpec.public_ip field must be treated as
immutable to avoid runtime IP switches; update the NATGateway documentation and
schema to mark spec.public_ip as immutable (consistent with
PublicIPAttachment.public_ip) and enforce this in the control plane by adding a
validation rule in the CRD or admission webhook that rejects updates which
change NATGateway.spec.public_ip (use the NATGateway and NATGatewaySpec
identifiers when locating the code that creates validation or webhook logic).
- Around line 431-443: Update the "DNS" section to remove ambiguity by
explicitly stating that DNS is not an exposed networking API resource but is
handled by service provisioning logic (e.g., "CaaS template roles create DNS
records")—keep both sentences and rephrase so they read together (DNS is managed
by service integration, not the networking API; CaaS workflow creates records
via templates), and add a short caveat that this boundary may change if DNS
becomes a networking API feature in future to avoid technical debt; reference
the "DNS" heading and the "CaaS workflow"/"CaaS template roles create DNS
records" phrasing when making the edit.
- Around line 374-391: Document and implement explicit prerequisite validation
for CaaS: specify that Cluster admission uses an admission webhook (and optional
controller reconciliation) to validate Cluster.spec.network_attachments and
network.pod_cidr/service_cidr; require Cluster names be unique within the tenant
namespace (OSAC Metadata namespace) and ensure PublicIPAttachment resources with
target=<cluster-name> and target_endpoint="api"/"ingress" plus a NATGateway on
the referenced VirtualNetwork exist before admitting the Cluster; mark
referenced PublicIPAttachment/NATGateway resources immutable (add
finalizer/deletion protection while Cluster exists) and enforce CIDR non-overlap
between network.pod_cidr/service_cidr and any subnet CIDRs in
network_attachments.
- Around line 280-303: The PublicIPAttachmentSpec validation is incomplete:
update validation (protobuf rules or the admission webhook) to (1) require
target_endpoint when the oneof target is cluster (i.e., when cluster field in
PublicIPAttachmentSpec is set), (2) restrict target_endpoint to a closed enum
with only "api" and "ingress" values, and (3) add scope compatibility checks
that reject attachments that mismatch network scope (e.g., prevent
compute_instance from being attached to a VN that is marked baremetal-scoped or
other incompatible combinations involving baremetal_instance and cluster).
Ensure these checks run where PublicIPAttachmentSpec is validated so the oneof
(compute_instance, cluster, baremetal_instance) and target_endpoint semantics
are enforced atomically.
- Around line 444-452: Update the Risks and Mitigations table to add four rows
covering the missing major issues: (1) PublicIPAttachment cluster target
ambiguity — reference the PublicIPAttachment resource and its target_endpoint
field and state "Make target_endpoint required when target=cluster; validate
enum values {\"api\",\"ingress\"} at admission" as the mitigation; (2)
NATGateway public_ip mutability — reference the NATGateway resource and its
public_ip field and mark public_ip immutable with "require delete/recreate for
IP changes" as the mitigation; (3) CaaS cluster name TOCTOU — reference cluster
name targeting in PublicIPAttachment and require cluster name uniqueness per
tenant/namespace and scope compatibility validation; (4) Cluster CIDR overlap —
reference pod_cidr/service_cidr and network_attachments subnet CIDRs and add
validation to prevent overlap. Ensure each new row mirrors the existing table
format and uses the exact identifiers: PublicIPAttachment, target_endpoint,
target=cluster, NATGateway, public_ip, pod_cidr, service_cidr, and
network_attachments.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4dd1bbea-0ade-40ab-8fb8-943d1a80c983

📥 Commits

Reviewing files that changed from the base of the PR and between 6f84859 and a1ad750.

📒 Files selected for processing (2)
  • enhancements/unified-networking-prd/README.md
  • enhancements/unified-networking/README.md

string region = 1; // required, immutable
string ipv4_cidr = 2; // optional, immutable
string ipv6_cidr = 3; // optional, immutable
string scope = 4; // NEW: "compute", "cluster", or "baremetal"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be enum, with the accepted values and their documentation.

message ClusterNetworkSpec {
string pod_cidr = 1; // default: 10.128.0.0/14
string service_cidr = 2; // default: 172.30.0.0/16
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be exposed to the user? Can't we internally decide what values to use?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, this is optional in case the user want to set these values to something else

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like assisted lets the user set it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. But we try (or tried) to keep things simple in the API. Unless there are users already asking for this we shouldn't expose it in the API.

string cluster = 3; // Cluster ID (new)
string baremetal_instance = 4; // BaremetalInstance ID (new)
}
string target_endpoint = 5; // optional: for Cluster, specifies "api" or "ingress"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an enum.

Comment thread enhancements/unified-networking-prd/README.md
Comment thread enhancements/unified-networking-prd/README.md Outdated
Comment on lines +376 to +377
The design must accommodate future mixed-scope VirtualNetworks (VMs and BM on
the same network) without requiring API changes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This answers my previous question, but I think this explanation should be earlier in the doc.

1. **Create VirtualNetwork** (`--scope baremetal`) → resolves `fabric-managed` → Netris creates VPC (VRF)
2. **Create Subnet** → inherits `network: netris` → Netris creates V-Net + IPAM subnet
3. **Create SecurityGroup** → reads `acl: netris` → Netris ACL rules on fabric
4. **Create BaremetalInstance** with `network_attachments` → operator dispatches to AAP → HostLease CR → baremetal fulfillment component provisions server (switch port config, IP assignment, ACL rules)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the logic to actually get a host attached to a network defined? Does that end up just being some shared role in aap?

1. **Create VirtualNetwork** (`--scope baremetal`) → resolves `fabric-managed` → Netris creates VPC (VRF)
2. **Create Subnet** → inherits `network: netris` → Netris creates V-Net + IPAM subnet
3. **Create SecurityGroup** → reads `acl: netris` → Netris ACL rules on fabric
4. **Create BaremetalInstance** with `network_attachments` → operator dispatches to AAP → HostLease CR → baremetal fulfillment component provisions server (switch port config, IP assignment, ACL rules)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also are network_attachments mutable? I imagine this depends on the backend?
@trewest mentioned something about NICo not being able to change machine networks once they are created, is that a common limitation?

I think it's hard to reason about this if netris is the only backend we know about right now. Are there others that can inform these kinds of choices?

@tzumainn tzumainn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! I left a few comments/questions.

Comment thread enhancements/unified-networking/README.md Outdated
- **BMaaS** (Bare Metal as a Service): Provisions physical bare-metal
servers. The API resource is `BaremetalInstance` (defined in
[PR #45](https://github.com/osac-project/enhancement-proposals/pull/45),
unmerged).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there (currently) a distinction between CaaS and BMaaS? If clusters are installed on bare metal, then the former would just be a subset of the latter. I know that down below you describe different workflows for each - but won't the added capabilities needed for CaaS also theoretically be available for BMaaS? Are there cases where someone may want to build out their cluster infrastructure with additional bare metal hosts running other bits of infrastructure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way CaaS will consume BMaaS / VMaaS (if we will run on vms as well) is still not defined very well and probably require an enhancement of its own. In the networking perspective, CaaS requires some things that BMaaS not - for example public ips for SNAT / DNAT to ingress / API

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those public IPs are pointing at internal IPs on the virtual network's subnet, correct? If so, that's a feature that BMaaS needs as well.

```protobuf
message NetworkAttachment {
string subnet = 1; // Subnet ID, required, immutable
repeated string security_groups = 2; // SecurityGroup IDs, optional, mutable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be obvious if I knew more about how this works for ComputeInstances, but how does this attachment know which host it's associated with?

Would this be the place to specify which bare metal port a network is/should be attached to?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a new proposed field in BaremetalInstanceSpec just above this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, missed that! I think my question regarding bare metal ports stands, though - at the MOC there are cases where we have to attach network to specific bare metal ports that have specific characteristics.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so I guess you're asking if we need another abstraction that includes a MAC or something to indicate which interface maps to which subnet? Good question unless I'm missing something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's exactly what I mean!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is addressed by the optional interface field in NetworkAttachments but I have questions in those relevant sections about how that will work.

@vladikr vladikr left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks like a great step forward!

and other resources. Both service types build ad-hoc networking outside the
API.

### 2. Tenants must choose networking backends

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, indeed.

One approach I'm thinking about is to make NetworkClass the main extension point (cluster-scoped), where it declares capabilities (I support: L2, L3, DPU support, etc.) and points to a driver implementation.

The osac-operator would then act as a dispatcher:
when reconciling a VirtualNetwork, it reads the NetworkClass and selects the right driver (CUDN/Netris/physical backends, and more to come).

This way tenants don't choose backends but they just get the right implementation based on what the provider configured in the NetworkClass.
I think it would also make it much easier to add new backends in the future without changing the API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for PublicIP, and Netris for DNAT. The current design has no way to express
this composition — each resource independently declares its backend with no
validation that the combination is coherent.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gap is pretty much related to how we model the NetworkClass.

If we will turn the NetworkClass into a self-describing object that advertises its capabilities (L2/L3, Public IPs, ...) and declares which driver implements it, then the operator can both discover capabilities and route work to the correct driver.
That was my thinking with the above comment.

This would also help with composition. Instead of every resource declaring its own implementation_strategy, the driver (referenced via NetworkClass) can be responsible for coordinating the right backends for VirtualNetwork, Subnet, SecurityGroup, and so on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I M curios to understand how is this sill look like, can you perhaps share an example ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Similar to what we've discussed offline, but now with multiple drivers/backends, as you pointed out, via actions (as in this proposal)

The main idea is that the NetworkClass becomes the main extension point. It's provider-only and cluster-scoped.

  • Provides capabilities that will help with discovery and future UI/validation.
  • This replaces the need for a separate ServiceNetworkClass.
  • Can map different actions to different drivers.
apiVersion: network.osac.io/v1alpha1
kind: NetworkClass
metadata:
  name: k8s-on-fabric
spec:
  description: "VMs on OVN with fabric-managed external access"

  drivers:
    network: cudn_net              # VirtualNetwork + Subnet
    acl: network_policy            # SecurityGroup rules
    externalAccess: netris         # PublicIP, DNAT, SNAT
    fabricBridge: metallb_l2       # Bridge OVN to physical fabric

  capabilities:
    supportsL2: true
    supportsL3: true
    supportsPublicIPs: true
    supportsDPU: false

The VirtualNetwork CR:

  • Tenants mostly use scope.
  • They can optionally pin a specific networkClass if needed.
  • The NetworkClass is resolved by the operator (via scope + region, or explicit reference).
apiVersion: network.osac.io/v1alpha1
kind: VirtualNetwork
metadata:
  name: my-cluster-network
  namespace: tenant-project
spec:
  region: moc-region-1
  scope: cluster                    # compute | cluster | baremetal
  networkClass: k8s-on-fabric       # optional – can be defaulted by scope + region
  ipv4_cidr: 10.0.0.0/16

When any networking resource is created/updated, the Dispatcher in the osac-operator does this:

  • Resolves the NetworkClass for the resource
  • Looks at the action required by the resource:
    • -VirtualNetwork / Subnet: action network
      • SecurityGroup: action acl
      • PublicIPAttachment / NATGateway: action externalAccess
  • Calls the registered Driver for each action
  • The Driver executes and updates status/conditions.

In terms of validation, we could have a simple wenhook that validates the structure, but it's the controller eventually that will validate whether a given driver supports an action and raise a relevant condition.
How we register the driver is an implementation detail, I think, but I can share my thoughts.
You proposal is already great, I just believe that we can simpify the both the mental model and have less CRDs.
With this the NetworkClass becomes the single place where providers define capabilities and select drivers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if i understand you correctly you are saying:

  • Drop the NetworkBackend CRs - This means we don't have a way to specify what actions backend support.
  • Drop the ServiceNetworkClass CRs - The ones composing the drivers.
  • Use NetworkClass to compose capabilities and drivers instead, instead of using one NetwokClass use one per scope and infer the right one by the scope.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that drivers wouldn’t need to be represented as CRDs.
A driver is basically an implementation (either a Go struct or a thin wrapper that triggers an AAP job).
There’s nothing to reconcile for the driver itself, so I don’t see a strong reason to introduce a NetworkBackend CRD just for registration.

I thought that each driver could declare the list of actions it supports.
This can be done in code (for Go drivers) or via a small metadata/config for ansible drivers.
Ideally, we could have a thing go layer that represents and registers the driver

Just a thought:

type NetworkDriver interface {
    SupportedActions() []string
    ReconcileVirtualNetwork(ctx, vn) error
    ReconcileSubnet(ctx, subnet) error
    ReconcileSecurityGroup(ctx, sg) error
....

And then the registration (for example, for cudn_driver.go):

func init() {
    dispatcher.MustRegister("cudn_net", &CUDNDriver{})
}

The NetworkClass then maps actions to driver names:

drivers:
  network: cudn_net
  acl: network_policy
  externalAccess: netris
  fabricBridge: metallb_l2

Regarding discoverability, I agree it’s useful to know what drivers are available.
We could expose the registered drivers and their supported actions via the status of the operator, or come up with something else, instead of requiring a CRD per driver.
The idea of a NetworkBackend CRD is nice if it brings real value but I’d avoid it until we have more use cases (dynamic registration/pluggable)

Comment on lines +311 to +312
Tenants must not choose networking backends. The system must select the correct
backend based on the provider's infrastructure and the tenant's intent.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes! +1

I think making NetworkClass provider-only and using it as the driver/capability selector (instead of something tenants pick) is the right direction.

The operator can then dispatch to the appropriate driver based on the NetworkClass.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as #50 (comment), lets chat about it


**Key distinction:** `network_attachments` is the OSAC tenant network (where nodes physically sit). `network.pod_cidr`/`service_cidr` is the cluster-internal CNI — separate, non-overlapping address space.

#### BMaaS — End-to-End Flow

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean that BareMetalPool will consume VirtualNetwork as well, right? :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, BareMetalPool is not meant for the tenants, but for internal use when interacting with the fulfillment-service. The tenants will user BaremetalInstnace. Is that correct @adriengentil @tzumainn ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe I don't have a full understanding, but why not to reuse the same mechanism across the project? I mean, the fulfillment-service could create a VirtualNetowrk + Subnet and BareMetalPool will consume it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but my thinking was that we want to make networking a reusable layer across VMaaS, CaaS, and BMaaS, which aligns with the direction of the unified networking proposal.

Instead of having BareMetalPool create networking internally via pool-network-create, I think we should move toward BareMetalPool consuming a VirtualNetwork (and Subnet) that was created either explicitly by the tenant or automatically by a higher-level component (such as the fulfillment-service or ClusterOrder Operator).

The BareMetalPool would then focus on host selection, attachment, and lifecycle, while delegating network creation/provisioning to the networking layer.

@danmanor danmanor Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr You are raising good points which weren't mentioned in the doc. I am not entirely sure what are the plans for BMaaS, if I would have to take a guess I would assume - BaremetalInstance -> HostLease / Host. The BareMetalPool controller will need to adapt to using the given network resources rather than creating them. cc @tzumainn @carbonin What do you think ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no issue with the BareMetalPool controller adapting to that workflow; the current network creation step is simply defined as a template that's passed in by the profile, and it's not a problem to create a new template that doesn't do that.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@enhancements/unified-networking-prd/README.md`:
- Line 203: In the README sentence "VMs use OVN (Layer2 UDN) for isolation — a
software-defined overlay on the", change the networking term "Layer2" to the
spaced, standards-consistent form "Layer 2" so the line reads "VMs use OVN
(Layer 2 UDN) for isolation — a software-defined overlay on the"; update that
exact phrase wherever it appears to ensure consistent protocol-layer naming.
- Around line 324-335: The acceptance criterion currently labeled R1 in the
"Acceptance criteria" section is too strict compared to the Goals note; update
the README.md acceptance criteria by scoping R1 to the current phase (e.g., "v1
/ current phase") and either remove or rephrase the sentence "VMs in the same
Subnet are in the same broadcast domain regardless of which hypervisor node or
OpenShift cluster they run on" to explicitly state OVN-scoped behavior (or mark
it as a future/fabric-integration requirement). Locate the "Acceptance criteria"
block and the "Goals" note references to ensure consistency, and if helpful
split the list into "v1 (current scope)" vs "future scope" so validation tests
target only the deliverable behavior.

In `@enhancements/unified-networking/README.md`:
- Around line 304-307: The comment is ambiguous about enum semantics: update the
README validation text for the PublicIPAttachmentEndpoint field
`target_endpoint` to use proto3 enum semantics—specifically state that
`target_endpoint` MUST be UNSPECIFIED for `target=compute_instance` and
`target=baremetal_instance`, and MUST NOT be UNSPECIFIED for `target=cluster`
(i.e., reject if `target=cluster` and `target_endpoint==UNSPECIFIED`); apply the
same clear wording to the other occurrence on lines 310-311 so controllers
validate against UNSPECIFIED rather than an impossible “empty” value.
- Around line 624-631: The example CLI and docs currently permit
tenant-specified host class via the `--node-set` flag (e.g., `--node-set
workers=large,size=3`), which violates the immutable `host_class` contract;
update the CLI parsing/validation for `--node-set` to reject or ignore any
`host_class`/host-class token and only accept `size` (and other allowed
overrides), and update the README example to remove `workers=large` so it only
shows `workers=size=3`; reference the `--node-set` flag, the `host_class` term,
and the contract documented in `enhancements/caas/README.md` when making
changes.
- Line 43: Add a clear, explicit security statement in the README that
authentication, authorization, and tenant resource-scoping are handled centrally
by the Fulfillment Service as a cross-cutting concern and not by individual
modules; update sections referencing NetworkBackend, ServiceNetworkClass,
NetworkClass, and VirtualNetwork to state that the Fulfillment Service enforces
tenant ownership, RBAC/policies, and request-level authentication for all
Fulfillment validation flows, and require downstream modules to rely on those
assurances rather than reimplementing local access checks; apply the same
clarification to the other referenced sections (around lines 400-401 and
635-640) so the document consistently mandates centralized authn/authz and
tenant-scoping ownership.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ac13e8e2-a23e-48ca-80cc-9069159169ae

📥 Commits

Reviewing files that changed from the base of the PR and between 44a95b8 and 230e3b4.

📒 Files selected for processing (2)
  • enhancements/unified-networking-prd/README.md
  • enhancements/unified-networking/README.md

Comment thread enhancements/unified-networking-prd/README.md Outdated
Comment on lines +324 to +335
**Acceptance criteria:**
- Resources in different VirtualNetworks cannot communicate (full isolation)
- Resources in the same Subnet are in the same L2 broadcast domain (can
communicate at Layer 2)
- Resources in different Subnets within the same VirtualNetwork can
communicate via Layer 3 routing
- SecurityGroups control which traffic is permitted within these boundaries
- Bare-metal servers in the same Subnet are in the same broadcast domain
regardless of their physical location (rack, switch)
- VMs in the same Subnet are in the same broadcast domain regardless of
which hypervisor node or OpenShift cluster they run on

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Major: R1 acceptance criteria are stricter than the documented current architecture, creating a release-gating mismatch.

R1 currently reads as an immediate requirement (“VMs in the same Subnet … regardless of which hypervisor node or OpenShift cluster they run on”), but the Goals note states full VM/fabric integration is future work and current isolation remains OVN-scoped. This makes acceptance ambiguous and high-risk for delivery sign-off. Please scope R1 explicitly to what is required in this phase (or split into v1 vs future criteria) so validation is testable and non-contradictory.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking-prd/README.md` around lines 324 - 335, The
acceptance criterion currently labeled R1 in the "Acceptance criteria" section
is too strict compared to the Goals note; update the README.md acceptance
criteria by scoping R1 to the current phase (e.g., "v1 / current phase") and
either remove or rephrase the sentence "VMs in the same Subnet are in the same
broadcast domain regardless of which hypervisor node or OpenShift cluster they
run on" to explicitly state OVN-scoped behavior (or mark it as a
future/fabric-integration requirement). Locate the "Acceptance criteria" block
and the "Goals" note references to ensure consistency, and if helpful split the
list into "v1 (current scope)" vs "future scope" so validation tests target only
the deliverable behavior.

Comment thread enhancements/unified-networking/README.md Outdated
Comment thread enhancements/unified-networking/README.md Outdated
Comment on lines +624 to +631
```bash
osac create cluster \
--template ocp_4_17_small \
--network-attachment subnet=cluster-nodes-subnet,security-groups=cluster-sg \
--node-set workers=large,size=3 \
-p pull_secret="$(cat pull-secret.json)" \
-p ssh_public_key="$(cat ~/.ssh/id_rsa.pub)" \
--name my-cluster

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Major security/contract drift: cluster workflow appears to allow tenant host-class override.

--node-set workers=large,size=3 implies tenant-controlled host class selection, which conflicts with the CaaS contract where template-defined node set host_class is immutable. This can lead to unauthorized infrastructure shape changes and inconsistent scheduler/provisioning behavior across services.

Based on learnings: in enhancements/caas/README.md, template-defined node-set host_class is immutable and tenants may only override size.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 624 - 631, The
example CLI and docs currently permit tenant-specified host class via the
`--node-set` flag (e.g., `--node-set workers=large,size=3`), which violates the
immutable `host_class` contract; update the CLI parsing/validation for
`--node-set` to reject or ignore any `host_class`/host-class token and only
accept `size` (and other allowed overrides), and update the README example to
remove `workers=large` so it only shows `workers=size=3`; reference the
`--node-set` flag, the `host_class` term, and the contract documented in
`enhancements/caas/README.md` when making changes.

configure NetworkClasses, and manage PublicIPPools. Tenants do not see
provider-level configuration.

- **Region**: A management boundary representing a deployment location (e.g.,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that the did not agree yet on what is a region in the scope of OSAC: #20

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, Region was in the old enhancement so I kept it here as a datacenter concept. If we will need to change it, I will


### R9: Future mixed-scope VNs without API changes

The design must accommodate future mixed-scope VirtualNetworks (VMs and BM on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that VMs might be running on a dedicated cluster. Today, the UDN implementation is aware of that and creates the network pieces in that dedicated cluster, so depending on your plans, you might have to think about this use case.
https://github.com/osac-project/osac-operator/blob/main/docs/vmaas-dedicated-cluster/README.md

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this suggestion mentions in its goals that the virtual network / subnet concepts must apply regardless of which cluster / node the VM run in (e.g. the VMs multi-tenancy need to participate in the fabric)

@ygalblum ygalblum requested review from ygalblum and removed request for vladikr June 4, 2026 14:32

@mhrivnak mhrivnak left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general there are a lot of technical designs and decisions baked into the PRD. We should be as disciplined as we can to keep PRDs focused on requirements in terms of user story outcomes without prescribing how to implement them.


- **OVN / UDN**: Open Virtual Network / User Defined Network. The
software-defined networking layer in OpenShift that provides VM isolation.
VMs on OVN have internal IPs that are not visible on the physical fabric.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not visible by default, but can be made visible via bgp and vfr-lite from a CUDN.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned elsewhere in the doc

### 7. VMs and bare metal cannot share a network

VMs use OVN (Layer2 UDN) for isolation — a software-defined overlay on the
OpenShift cluster. Bare-metal servers use physical VLANs configured on

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or vxlans

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually vxlans used together with vlans as a tagging mechanism


### 9. CaaS has unique prerequisite ordering

Cluster worker nodes reach the hosted control plane API server via hairpin

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more complete description somewhere of this network setup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exposing implementation details to tenants
- As a provider, I want to compose different backends for different networking
functions (isolation, ACLs, external access, fabric bridging)
- As a provider, I want to register new backends by deploying NetworkBackend

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prescribing the specific solution doesn't work so well in a user story. "As a user, I want to use the thing proposed in this document" doesn't add value. I would leave it at just "As a provider, I need to register backends."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

of choosing implementation details
- Define clear direction for PublicIPAttachment (inbound DNAT) and NATGateway
(outbound SNAT)
- Support air-gapped environments where "Public" IPs are data-center-routable,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in a connected environment, having IPs that are routable within a shared provider network enables one tenant to offer services to other tenants. I wouldn't over-complicate things right now by trying to have both public and semi-public IPs, but it would be reasonable to keep that in mind. I think it would be better to at least keep those concepts distinct and not necessarily assume that "public IPs" can easily become non-public when there's no internet connectivity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure OSAC should care how "public" the IPs are. The provider defines the pools. Where the IPs reside doesn't change anything. They are public because they are not within the VirtualNetwork to which the network interface is attached.

- ComputeInstance, Cluster, and BaremetalInstance all have a `network_attachments`
field using the same `NetworkAttachment` message
- PublicIPAttachment supports all three as targets
- The tenant workflow for creating networking resources is identical regardless

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nice to have, but might not be practical. For example a cluster may need a load balancer as the endpoint where an IP gets attached.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Load Balancer is out of scope at the moment, we will add it in the future


**Acceptance criteria:**
- NetworkClass is not exposed in the tenant API
- VirtualNetwork has a `scope` field that describes what the network is for

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid prescribing specific API changes in PRDs. We can discuss that in the solution doc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

**Acceptance criteria:**
- The bridge plugin is specified in the provider configuration, not by tenants
- The operator orchestrates bridging and external access as independent steps
- Backend roles are atomic — no cross-dependencies between roles

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this relates to the rest of this section.

- "Public" IP means external to the VirtualNetwork, not internet-routable
- The API and workflow are identical for internet-facing and air-gapped
deployments
- CaaS clusters can provision using data-center-routable IPs for API server

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? Why would we not always default to CaaS using tenant-private IPs, and they're only exposed outside the tenant's network if they opt-in to attaching an IP?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will, but this was the implementation we decided on when implementing netris, so I have mentioned it here. I will remove it from the prd


### R7: CaaS prerequisite support

CaaS must be able to create PublicIPAttachments before the cluster exists, so

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a cluster require a public IP?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limitation of the current netris implementation, hosted cluster nodes reach hostinh cluster via SNAT / DNAT instead of VPC peering

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't require a public ip for the mere reason that you don't want to force the user to take an address from the pool unless they actually need it

@ygalblum ygalblum left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only reviewed the PRD for now.


- As a tenant, I want to place my cluster's worker nodes on a Subnet in my
VirtualNetwork
- As a tenant, I want to attach PublicIPs to my cluster's API server and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we mandate all or nothing here? Can the tenant attach a PublicIP only to the ingress but not to the API server?


### BMaaS-Specific Stories

- As a tenant, I want to place my BaremetalInstance on a Subnet in my

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is written in singular. What about multiple network interfaces? Can a single instance be placed in multiple VirtualNetworks?
Once we support multiple, there will need to be a way to match the physical interface with the attached network.
This also affects the next item. If there are multiple interfaces, to which one am I attaching the public ip?
These questions are also relevant for VMaaS.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the rest of the document talks about interfaces and networks in plural. So, again, a small nit. on a Subnet -> on Subnets.
Also, can different interfaces be connected to Subnets in different VirtualNetwork? This sentence should clarify the range of options

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Regarding cross-VN: all subnets in a multi-attachment must belong to the same VirtualNetwork — clarified in the validation rules.

physical location (rack, switch)
- These guarantees must hold for VMs regardless of which hypervisor node
or OpenShift cluster they run on — two VMs on the same Subnet must be
in the same broadcast domain even if they are on different nodes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM by different nodes. What are nodes in this context?

of choosing implementation details
- Define clear direction for PublicIPAttachment (inbound DNAT) and NATGateway
(outbound SNAT)
- Support air-gapped environments where "Public" IPs are data-center-routable,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure OSAC should care how "public" the IPs are. The provider defines the pools. Where the IPs reside doesn't change anything. They are public because they are not within the VirtualNetwork to which the network interface is attached.

Comment on lines +332 to +334
- Bare-metal servers in the same Subnet are in the same broadcast domain
regardless of their physical location (rack, switch)
- VMs in the same Subnet are in the same broadcast domain regardless of

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You list these two separately. Is this because for the first version they will not share a network?

**Acceptance criteria:**
- PublicIPAttachment handles inbound traffic only (DNAT)
- NATGateway handles outbound traffic only (SNAT)
- A resource needing both directions requires both resources explicitly

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that unless explicitly attached, the resource (e.g. VM) will not have any egress? Isn't this a network level configuration? Or maybe I'm misunderstanding the NATGateway resource.


### R7: Air-gapped environment support

The API must work in air-gapped environments where IPs are data-center-routable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the goals section, I don't think OSAC should care about it. Furthermore, the environment might not be air-gapped but the "public" IPs may still not be routable from the internet but rather only from the intranet. Think about services you need VPN connection to access but they themselves can access the internet.


### R7: CaaS prerequisite support

CaaS must be able to create PublicIPAttachments before the cluster exists, so

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't require a public ip for the mere reason that you don't want to force the user to take an address from the pool unless they actually need it

@danmanor danmanor force-pushed the unified-networking-design branch from 2ca0fda to 25751d2 Compare June 7, 2026 14:50
coderabbitai[bot]
coderabbitai Bot previously requested changes Jun 7, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
enhancements/unified-networking-prd/README.md (1)

310-321: ⚠️ Potential issue | 🟠 Major

Fix R1 acceptance criteria scope to match v1 single-hosting-cluster reality

R1 currently requires that VMs in the same Subnet share the same L2 broadcast domain “regardless of which hypervisor node or cluster they run on” (README.md lines 319-320), but the companion design states “V1 assumes one hosting cluster per region” and that multi-hosting-cluster support is a future enhancement. This makes the “or cluster” portion non-testable for v1.

Risk (medium): ambiguous validation can lead to tenants expecting cross-cluster L2 reachability/isolation guarantees that v1 doesn’t promise.

Scope the acceptance criteria explicitly to the single hosting cluster for v1 (remove/guard the “or cluster” text), or split into v1 vs future multi-cluster requirements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking-prd/README.md` around lines 310 - 321, Update
the R1 acceptance criteria in the README to reflect v1 single-hosting-cluster
scope: locate the Acceptance criteria block and change the line that currently
reads "VMs in the same Subnet are in the same broadcast domain regardless of
which hypervisor node or cluster they run on" so it either removes "or cluster"
or is split into two bullets (one explicit v1 statement limited to the single
hosting cluster and a separate future enhancement bullet for multi-cluster
reachability); ensure you reference the R1 acceptance criteria wording in the
update so tests/QA validate the single-cluster promise only for v1.
enhancements/unified-networking/README.md (3)

37-43: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Major security gap: missing explicit authentication, authorization, and tenant-scoping ownership statement.

The document describes provider/tenant workflows and validation but does not explicitly state that authentication, authorization, and tenant resource-scoping are handled centrally by the Fulfillment Service as cross-cutting concerns. Without this, downstream modules may implement inconsistent or missing access controls.

Add a clear security statement (e.g., in a new "Security Model" subsection after line 43 or in the "Proposal" section) that:

  • Authentication, authorization, and tenant resource-scoping are enforced by the Fulfillment Service layer for all networking resources
  • Individual networking modules (NetworkClass, VirtualNetwork, PublicIP, etc.) rely on Fulfillment Service assurances rather than implementing local access checks
  • Tenant ownership validation for all resources is performed at the Fulfillment Service API boundary

Based on learnings: OSAC enhancement proposals should document authn/authz/resource scoping as Fulfillment Service cross-cutting concerns and avoid module-local security redefinition.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 37 - 43, Add a new
"Security Model" subsection (e.g., after the existing intro paragraph or inside
the "Proposal" section) that explicitly states: authentication, authorization,
and tenant resource-scoping are enforced centrally by the Fulfillment Service;
individual networking modules and resources (e.g., NetworkClass, VirtualNetwork,
PublicIP, BaremetalInstance) must rely on Fulfillment Service assurances and
must not implement independent access checks; and that tenant ownership
validation for all resources is performed at the Fulfillment Service API
boundary as a cross-cutting concern. Ensure the statement is unambiguous about
who enforces authn/authz, that module implementations assume those guarantees,
and that ownership checks occur at the Fulfillment Service layer. Include a
short note advising contributors to avoid module-local security redefinition.

Source: Learnings


582-584: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Potential CaaS contract violation: CLI example may allow tenant host_class override.

The example --node-set workers=large,size=3 is ambiguous. If workers=large allows tenants to specify or override the host_class of a node set, this violates the CaaS immutability contract where template-defined node-set host_class is immutable and tenants may only override size.

Clarify the CLI syntax:

  • If workers=large means "node-set named 'workers' using host_class 'large'", remove the large parameter and show only --node-set workers,size=3 or --node-set workers=size:3
  • If large is a different parameter (not host_class), document the syntax clearly to avoid confusion

Based on learnings: In the CaaS enhancement proposal, template-defined node-set host_class is immutable and cannot be overridden by tenants at cluster creation time. Tenants may only override size.

🔒 Proposed fix if this is a host_class override
 osac create cluster --template ocp_4_17_small \
   --network-attachment subnet=cluster-subnet,security-groups=cluster-sg \
-  --node-set workers=large,size=3 --name my-cluster
+  --node-set workers,size=3 --name my-cluster
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 582 - 584, The CLI
example using "--node-set workers=large,size=3" is ambiguous about whether
"large" overrides the template-defined host_class (which would violate the CaaS
immutability contract); update the docs to disallow host_class overrides and
show only size overrides for node-sets (i.e., for the node-set named "workers"
allow specifying only size), or explicitly rename the parameter if "large" is a
different field (e.g., host_class=... vs size=...), and document the exact
syntax for the --node-set flag and the immutable host_class vs overridable size
behavior.

Source: Learnings


412-414: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect enum validation semantics in comment.

The comment states target_endpoint "must be empty otherwise" (for non-cluster targets), but proto3 enums cannot be "empty" — the zero value is UNSPECIFIED. The validation logic should state: target_endpoint MUST be UNSPECIFIED when target is compute_instance or baremetal_instance, and MUST NOT be UNSPECIFIED when target is cluster.

This ensures controllers validate against the correct proto3 enum semantics.

📝 Proposed fix for validation comment
   PublicIPAttachmentEndpoint target_endpoint = 5;
-  // Required when target=cluster; must be empty otherwise.
+  // Required (non-UNSPECIFIED) when target=cluster;
+  // must be UNSPECIFIED when target=compute_instance or baremetal_instance.
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking/README.md` around lines 412 - 414, Update the
comment for the PublicIPAttachmentEndpoint field `target_endpoint` to reflect
proto3 enum semantics: state that when `target` is `cluster` the
`target_endpoint` MUST NOT be `UNSPECIFIED` (i.e. must be set), and when
`target` is `compute_instance` or `baremetal_instance` the `target_endpoint`
MUST be `UNSPECIFIED` (i.e. must be the enum zero value). Refer to
`target_endpoint`, `PublicIPAttachmentEndpoint`, and the `target` enum values
`cluster`, `compute_instance`, and `baremetal_instance` when describing the
required validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@enhancements/unified-networking-prd/README.md`:
- Around line 397-404: Acceptance criteria for PublicIPAttachment pending
activation lack authorization checks; update the spec and implementation so that
when a PublicIPAttachment in Pending state is activated (i.e., when the target
resource is created) the system validates tenant ownership, namespace scope, and
RBAC permissions before binding the PublicIP. Concretely, add an acceptance
criterion like "Pending PublicIPAttachments must validate tenant ownership,
namespace scope, and RBAC permissions at activation time" and ensure the CaaS
template role activation flow (the component that "activates pending
attachments") performs these checks against the target resource and
tenant/namespace metadata before changing state from Pending to Active to
prevent cross-tenant or unauthorized attachments and dangling/leaked PublicIPs.

In `@enhancements/unified-networking/README.md`:
- Around line 455-462: Add a blank line before the fenced code block that starts
with the triple backticks and contains the "osac admin create publicippool"
command in the README (the fenced code block beginning at the line with
"```bash"); update the markdown so there is an empty line between the preceding
paragraph ("3. Provider creates PublicIPPool:") and the opening ```bash to
satisfy MD031 linting.

---

Duplicate comments:
In `@enhancements/unified-networking-prd/README.md`:
- Around line 310-321: Update the R1 acceptance criteria in the README to
reflect v1 single-hosting-cluster scope: locate the Acceptance criteria block
and change the line that currently reads "VMs in the same Subnet are in the same
broadcast domain regardless of which hypervisor node or cluster they run on" so
it either removes "or cluster" or is split into two bullets (one explicit v1
statement limited to the single hosting cluster and a separate future
enhancement bullet for multi-cluster reachability); ensure you reference the R1
acceptance criteria wording in the update so tests/QA validate the
single-cluster promise only for v1.

In `@enhancements/unified-networking/README.md`:
- Around line 37-43: Add a new "Security Model" subsection (e.g., after the
existing intro paragraph or inside the "Proposal" section) that explicitly
states: authentication, authorization, and tenant resource-scoping are enforced
centrally by the Fulfillment Service; individual networking modules and
resources (e.g., NetworkClass, VirtualNetwork, PublicIP, BaremetalInstance) must
rely on Fulfillment Service assurances and must not implement independent access
checks; and that tenant ownership validation for all resources is performed at
the Fulfillment Service API boundary as a cross-cutting concern. Ensure the
statement is unambiguous about who enforces authn/authz, that module
implementations assume those guarantees, and that ownership checks occur at the
Fulfillment Service layer. Include a short note advising contributors to avoid
module-local security redefinition.
- Around line 582-584: The CLI example using "--node-set workers=large,size=3"
is ambiguous about whether "large" overrides the template-defined host_class
(which would violate the CaaS immutability contract); update the docs to
disallow host_class overrides and show only size overrides for node-sets (i.e.,
for the node-set named "workers" allow specifying only size), or explicitly
rename the parameter if "large" is a different field (e.g., host_class=... vs
size=...), and document the exact syntax for the --node-set flag and the
immutable host_class vs overridable size behavior.
- Around line 412-414: Update the comment for the PublicIPAttachmentEndpoint
field `target_endpoint` to reflect proto3 enum semantics: state that when
`target` is `cluster` the `target_endpoint` MUST NOT be `UNSPECIFIED` (i.e. must
be set), and when `target` is `compute_instance` or `baremetal_instance` the
`target_endpoint` MUST be `UNSPECIFIED` (i.e. must be the enum zero value).
Refer to `target_endpoint`, `PublicIPAttachmentEndpoint`, and the `target` enum
values `cluster`, `compute_instance`, and `baremetal_instance` when describing
the required validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 82e3bc8d-0b75-4eec-825e-8a5433968944

📥 Commits

Reviewing files that changed from the base of the PR and between 230e3b4 and 25751d2.

📒 Files selected for processing (2)
  • enhancements/unified-networking-prd/README.md
  • enhancements/unified-networking/README.md

Comment on lines +397 to +404
**Acceptance criteria:**
- PublicIPAttachments can target a resource that doesn't exist yet (Pending
state)
- Cluster creation validates that a PublicIPAttachment for the API endpoint
and a NATGateway exist (ingress attachment is optional — can be added
after provisioning)
- The CaaS template role activates pending attachments at the right time

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

R8 acceptance criteria missing security validation for pending PublicIPAttachments.

Line 398-399 allows PublicIPAttachments to target resources that don't exist yet (Pending state). The acceptance criteria do not specify ownership or authorization validation when pending attachments activate. Without proper validation:

  • A tenant could create a pending attachment for a resource ID they don't own, then activate it when the actual owner creates that resource.
  • Dangling pending attachments could leak PublicIPs or create billing/quota issues.
  • Cross-tenant or cross-namespace attachment could occur if namespace/RBAC validation is missing at activation time.

Risk severity: Major — Missing authorization validation on pending resource attachment creates potential privilege escalation and resource leakage vulnerabilities.

Recommend adding acceptance criterion: "Pending PublicIPAttachments validate tenant ownership, namespace scope, and RBAC permissions at activation time (when target resource is created) in addition to creation-time validation."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/unified-networking-prd/README.md` around lines 397 - 404,
Acceptance criteria for PublicIPAttachment pending activation lack authorization
checks; update the spec and implementation so that when a PublicIPAttachment in
Pending state is activated (i.e., when the target resource is created) the
system validates tenant ownership, namespace scope, and RBAC permissions before
binding the PublicIP. Concretely, add an acceptance criterion like "Pending
PublicIPAttachments must validate tenant ownership, namespace scope, and RBAC
permissions at activation time" and ensure the CaaS template role activation
flow (the component that "activates pending attachments") performs these checks
against the target resource and tenant/namespace metadata before changing state
from Pending to Active to prevent cross-tenant or unauthorized attachments and
dangling/leaked PublicIPs.

Comment thread enhancements/unified-networking/README.md Outdated
@danmanor danmanor force-pushed the unified-networking-design branch 2 times, most recently from f40f848 to 734ca3b Compare June 7, 2026 15:41
@danmanor danmanor requested a review from vladikr June 11, 2026 09:29
region: moc-region-1
fabricManager: netris
k8sManager: cudn_localnet
capabilities:

@trewest trewest Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is capabilities redefined in this CR? It already exists in the netris and cudn_localnet configmaps

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the capabilities are inferred from the managers, I would put them in the status and not in the spec. Then the provider admin does not need to set it thus reducing the chance for error.
With this path, I would consider disable_capabilities in the spec. This will allow the admin to make sure certain capabilities are disabled regardless to whether they are supported by the manager. But, this configuration will also need to be pushed to the manager.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and implemented — capabilities are now inferred from the managers and published in status.capabilities. Added spec.disableCapabilities for providers who need to restrict specific capabilities.

@vladikr

vladikr commented Jun 11, 2026

Copy link
Copy Markdown

@danmanor I had some thoughts after reading the latest update.
I see that the proposal moved away from the driver registration model we had discussed earlier.
The current approach using labeled ConfigMaps for driver discovery is clearly more pragmatic
and it allows adding new drivers without changing the API or the operator, and reuses the existing ansible workflow.
I understand why you chose that direction...

That said, I do feel we lost some of the original cleanliness we were aiming for, at least from my pov.
That's specifically around having a clearer Driver interface, stronger validation of supported actions. That would also make it easier to introduce pure Go drivers for Kubernetes native functionality like CUDN.

However, I’m not strongly attached to the original idea :)
The current proposal is simpler to implement with the existing tooling and patterns, which is a valid and important consideration.
I’d be happy to go with the current direction.
Maybe I was overthinking the driver abstraction.

`network_attachments` are immutable after resource creation — changing
network attachment requires recreating the resource.

The `interface` field is used for bare-metal resources only. It references a

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a comment that seems to have been lost:

Do we imagine having other optional fields for other kinds of resources? Maybe we should do separate types with a kind of inheritance here instead of all trying to share the same NetworkAttachment type?

Maybe BareMetalNetworkAttachment and ClusterNetworkAttachment, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, changed it accordingly

Comment on lines +418 to +424
# spec_defaults.interfaces:
# - name: data-0
# description: "25Gbps data interface"
# - name: data-1
# description: "25Gbps data interface"
# - name: mgmt
# description: "1Gbps management interface"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another comment got lost here from before:

@adriengentil do you think the template is where this kind of thing would be? This means it would be up to the cloud admin to create this rather than us pulling it from the backend, right? (i.e. it could be wrong and we need to deal with that)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waited for @adriengentil response. I think the admin will have to provide this info anyway, so whether it is in the template or via another way should be decided in the BMaaS flow, probably less relevant to this proposal, I just mentioned how it will look like with some assumptions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be part of the HostType that we need to design, templates does not expose any inventory information, and are not tied to a "HostType" in particular.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I think we can just make this less specific since this is TBD on the bare metal side.

For this proposal let's just say that the user will have access to some interface identifier from BMaaS then leave the rest about how someone would specify the network attachment given that they have the interface info and subnet.

Comment thread enhancements/unified-networking/README.md Outdated
Comment on lines +94 to +95
creates the K8s overlay and bridges it to the fabric segment. Only needed
for regions that host VMs.

@akshaynadkarni akshaynadkarni Jun 13, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needed for regions that host VMs

What does this mean? Are you referring to hub's hosting VMs vs. remote VM clusters?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it is only needed in deployments of osac serving VMaaS, someone might not offer that

@danmanor danmanor force-pushed the unified-networking-design branch from 099bd6a to 34c9e52 Compare June 14, 2026 08:09
@danmanor danmanor force-pushed the unified-networking-design branch 2 times, most recently from 563fc25 to e547199 Compare June 14, 2026 08:43

- As a tenant, I want to create isolated VirtualNetworks and Subnets for my
workloads without choosing a networking backend
- As a tenant, I want to define SecurityGroups to control traffic to my

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit. Security groups control both ingress and egress. So:to -> to and from

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!


### BMaaS-Specific Stories

- As a tenant, I want to place my BaremetalInstance on a Subnet in my

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the rest of the document talks about interfaces and networks in plural. So, again, a small nit. on a Subnet -> on Subnets.
Also, can different interfaces be connected to Subnets in different VirtualNetwork? This sentence should clarify the range of options

region: moc-region-1
fabricManager: netris
k8sManager: cudn_localnet
capabilities:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the capabilities are inferred from the managers, I would put them in the status and not in the spec. Then the provider admin does not need to set it thus reducing the chance for error.
With this path, I would consider disable_capabilities in the spec. This will allow the admin to make sure certain capabilities are disabled regardless to whether they are supported by the manager. But, this configuration will also need to be pushed to the manager.

name: k8s-manager-cudn-localnet
namespace: osac
labels:
osac.openshift.io/k8s-manager: "true"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the label more network specific to avoid future collisions. E.g. osac.openshift.io/network/k8s-manager

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, updated to osac.openshift.io/network/fabric-manager and osac.openshift.io/network/k8s-manager.


```yaml
apiVersion: v1
kind: ConfigMap

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ConfigMap? Why not add a CRD?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping ConfigMaps for simplicity — lowest barrier to adding a new manager (just kubectl apply a ConfigMap + deploy an Ansible role). No operator schema changes needed. Can revisit if registration needs evolve.

|-----------|----------------|
| VN create/delete | `fabricManager` |
| Subnet create/delete | `fabricManager` + `k8sManager` (per hosting cluster) |
| SecurityGroup create/delete | `fabricManager` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, once this is in place, VMs will not have an IP address that is internal to the cluster. That is, not through the fabric. Right? I'm asking to make sure we don't miss a different path that the Security Groups are missing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — once VMs are bridged to the fabric (via the k8sManager), they get a fabric IP. All traffic goes through the fabric, so SecurityGroup enforcement via fabric ACLs covers all paths. There is no cluster-internal IP path that bypasses the fabric.

--compute-instance my-vm --name vm-att

# Attach to a BM server (new target type)
osac create externalipattachment --externalip my-ip \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the BM has multiple interfaces, how does this command know to which interface the external IP should be attached?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExternalIPAttachment targets the resource, not a specific interface. The fabric manager DNATs to the resource's primary subnet IP. Clarified in the latest push.

# description: "1Gbps management interface"
```

The tenant then specifies which interface to attach to which subnet. Each

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions:

  1. Can one BareMetalInstance have more than one BareMetalNetworkAttachment without an interface?
  2. When is the matching done? What happens if a interface does not exist? What happens if there are more attachments than interfaces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest push — added explicit validation rules:

  1. Multiple attachments without interface is invalid — if more than one, each must specify interface
  2. Validated at creation time — invalid interface identifiers or exceeding available interfaces are rejected

during provisioning:

```protobuf
message Cluster {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

Suggested change
message Cluster {
message ClusterStatus {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@danmanor danmanor force-pushed the unified-networking-design branch from e547199 to 8fef7e5 Compare June 16, 2026 10:16
@danmanor danmanor enabled auto-merge (squash) June 17, 2026 07:27
@danmanor danmanor dismissed stale reviews from mhrivnak and coderabbitai[bot] June 17, 2026 07:28

handled

One NATGateway per VirtualNetwork. All subnets in the VN use the gateway.
Per-subnet NAT association is a future enhancement.

#### Multi-NIC Support

@adriengentil adriengentil Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can attach to multiple subnets, how we define the default interface (the one with the default gateway)? If dhcp is used, how can we expect ipam to be configured (the "default" subnet will provide a default gateway, while the other one won't distribute it)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Added a note in Multi-NIC Support: the default gateway is provided by the fabric manager on the first (primary) subnet. Additional subnets provide connectivity without a default gateway — the fabric handles inter-subnet routing within the VN.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by first, you mean the first network attachment in the list?

@adriengentil adriengentil Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fabric handles inter-subnet routing within the VN

I case of 2 subnets, the BM will have a directly connected route to each subnet, so I'll bypass the inter-subnet routing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — default gateway and DHCP configuration for multi-homed BM is an implementation detail of the fabric manager. Removed the ambiguous "first" reference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right, a multi-homed BM has directly connected routes to both subnets. Removed the inaccurate claim about fabric inter-subnet routing. The note now defers gateway/DHCP config to the fabric manager implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's important from tenant user point of view to be able to decide which subnet will be the default one, so I think the API should be clear about it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add an explicit default: true field on NetworkAttachment (and make sure that there is only one attachment per BaremetalInstance)
This would make the tenant clearly express their intent through the API

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree — multi-NIC behavior (which subnet gets the default gateway, how the tenant controls it, DHCP configuration per subnet) deserves its own focused discussion. This proposal establishes that multi-NIC is supported via the attachment types, but the specifics of how default gateway selection works for both VMs and BMs will be addressed in a follow-up PR.

Comment thread enhancements/unified-networking/README.md
Comment thread enhancements/unified-networking/README.md Outdated
Introduces a redesigned networking architecture that generalizes the OSAC
Networking API across all three service types with infrastructure-agnostic
subnets where VMs participate in the fabric.

Key design decisions:
- NetworkClass with fabricManager + optional k8sManager (two fields)
- ExternalIP (renamed from PublicIP) — external to VN, not internet-routable
- Fabric manager handles all physical networking as a single product
- K8s manager bridges OVN overlay to fabric (VMs become part of fabric)
- Uniform dispatcher — no VM-vs-BM distinction for ExternalIP or NATGateway
- Fabric-only security enforcement (no separate k8s ACL)
- BM multi-interface selection via template discovery
- CaaS prerequisite ordering handled by template, not API mandate

Signed-off-by: Dan Manor <dmanor@redhat.com>
@danmanor danmanor force-pushed the unified-networking-design branch from a92f576 to e9035d7 Compare June 17, 2026 09:44
cluster. Bare-metal servers use physical VLANs configured on switches in the
fabric. These are fundamentally different L2 domains. A K8s manager using
LocalNet mode can bridge OVN to the physical fabric, making VMs first-class
participants alongside BM servers. The current design does not address how

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in CUDN+LocalNet (where OVN-K still does IPAM for VMs) but Netris does it for BM? Who is the source of truth?

@danmanor danmanor merged commit 3724292 into main Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.