OSAC-1029: Unified Networking API for VMaaS, CaaS, and BMaaS#50
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a PRD and complementary enhancement design defining a provider-only unified networking model (NetworkClass/NetworkBackend/ServiceNetworkClass), API changes (VirtualNetwork.scope, network_attachments), expanded PublicIPAttachment/NATGateway semantics, end-to-end workflows and ordering rules. Risk: Medium — impacts tenant/provider networking semantics and may affect availability and security. ChangesUnified Networking Enhancement
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5 | ❌ 6❌ Failed checks (6 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@enhancements/unified-networking/README.md`:
- Around line 240-252: The fenced code blocks in the README containing the
topology diagram (lines showing "NetworkClass (per region, provider-only)",
"VirtualNetwork ...", "PublicIPPool ...") and the other step-flow blocks
referenced (also at 467-477) are missing language identifiers; update those
fences to include a language tag (use "text" for the hierarchy/diagram blocks
and "text" or "bash" for any step/command flows) so markdown linters (MD040)
pass, i.e., locate the triple-backtick fences around the diagram and step flows
and replace ``` with ```text or ```bash as appropriate.
- Around line 351-356: The PublicIPAttachment schema leaves target_endpoint
optional which allows ambiguous behavior for target=cluster; update the
PublicIPAttachment validation to require target_endpoint whenever target ==
"cluster" and restrict its allowed values to the closed enum {"api", "ingress"}
so callers must explicitly choose the cluster endpoint; locate and change the
validation logic for PublicIPAttachment (the target and target_endpoint fields)
and enforce the conditional required rule and enum constraint at the
schema/validation layer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 274546ac-ac4e-4942-bab1-125f1365a466
📒 Files selected for processing (1)
enhancements/unified-networking/README.md
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
enhancements/unified-networking/README.md (6)
428-430: 🧹 Nitpick | 🔵 TrivialNATGateway VN-wide scoping may be overly broad for compliance scenarios.
Line 429 states "All subnets in the VN use the gateway for outbound traffic." While this simplifies v1, it may create compliance or security posture challenges:
Operational concern:
- Impact: In multi-tier architectures (e.g., web/app/db subnets in same VN), organizations may want only specific subnets to have internet egress
- Compliance risk: Some subnets (e.g., database tier) may require air-gapped isolation from external networks
- Audit complexity: Cannot prove which subnet initiated outbound traffic when all subnets share the same SNAT IP
Current mitigation: Operators can create separate VirtualNetworks per tier, but this increases configuration overhead and complicates cross-tier communication (requires future VN peering).
Consider documenting this as a known limitation with a future enhancement path for per-subnet NAT association (e.g., adding an optional
subnetfield toNATGatewaySpecto scope it to specific subnets within the VN).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 428 - 430, Document the VN-wide scope limitation for NATGateway and add a recommended future enhancement: update the NATGateway documentation and the NATGatewaySpec schema description to note that "All subnets in the VN use the gateway for outbound traffic" is a known limitation for compliance-sensitive deployments, and propose adding an optional subnet scoping field (e.g., an optional subnet reference field on NATGatewaySpec) to allow per-subnet NAT association; reference NATGateway, VirtualNetwork, and NATGatewaySpec in the note and include a short migration/operational workaround recommending separate VirtualNetworks or VN peering until per-subnet scoping is implemented.
304-337:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNATGateway public_ip mutability is not specified.
Line 318 declares
public_ipas "required" but does not specify whether it is mutable or immutable. Allowing runtime changes to the public IP could:Risk impact:
- Severity: Major — Runtime IP changes break active connections and firewall rules
- Security concern: IP switching could bypass audit trails or circumvent IP-based access controls
- Operational risk: Changing the SNAT IP without proper cleanup leaves stale NAT mappings
Recommendation: Mark
public_ipas immutable (like PublicIPAttachment'spublic_ipfield on line 291). If IP changes are needed, the operator should delete and recreate the NATGateway.🔒 Proposed fix
message NATGatewaySpec { string virtual_network = 1; // parent VN ID, required, immutable - string public_ip = 2; // PublicIP ID for outbound NAT, required + string public_ip = 2; // PublicIP ID for outbound NAT, required, immutable }Update the admission webhook or CRD validation to reject updates to
spec.public_ip.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 304 - 337, The NATGatewaySpec.public_ip field must be treated as immutable to avoid runtime IP switches; update the NATGateway documentation and schema to mark spec.public_ip as immutable (consistent with PublicIPAttachment.public_ip) and enforce this in the control plane by adding a validation rule in the CRD or admission webhook that rejects updates which change NATGateway.spec.public_ip (use the NATGateway and NATGatewaySpec identifiers when locating the code that creates validation or webhook logic).
431-443:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDNS ownership inconsistency between API boundary and CaaS workflow.
Lines 440-442 state "DNS record management is a service-integration concern, not part of the networking API," but line 389 (CaaS workflow) states "DNS records created by the CaaS template role."
Impact:
- Minor documentation inconsistency — The boundary is actually clear (DNS is handled by service templates, not the networking API), but the phrasing could be misread as conflicting
- Future API risk: If DNS becomes part of the networking API later, the "not part of" statement creates technical debt
Clarification needed: The DNS section should explicitly state that while DNS is not exposed as a networking API resource, service-specific template roles (like CaaS) integrate with DNS infrastructure as part of their provisioning logic.
📝 Suggested clarification
#### DNS -DNS record management is a service-integration concern, not part of the networking API. CaaS template roles create DNS records. A DNS API is a future enhancement. +DNS record management is a service-integration concern, not exposed as part of the networking API. Service-specific template roles (e.g., CaaS) integrate with the provider's DNS infrastructure during provisioning (as mentioned in the CaaS workflow, line 389). A unified DNS API resource is a future enhancement.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 431 - 443, Update the "DNS" section to remove ambiguity by explicitly stating that DNS is not an exposed networking API resource but is handled by service provisioning logic (e.g., "CaaS template roles create DNS records")—keep both sentences and rephrase so they read together (DNS is managed by service integration, not the networking API; CaaS workflow creates records via templates), and add a short caveat that this boundary may change if DNS becomes a networking API feature in future to avoid technical debt; reference the "DNS" heading and the "CaaS workflow"/"CaaS template roles create DNS records" phrasing when making the edit.
374-391:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftCaaS prerequisite validation mechanism is under-specified, creating bypass risk.
Lines 383-385 state that the system "validates prerequisites exist" and that PublicIPAttachments target the cluster "by name before it exists," but the document does not specify:
Risk impact:
- Severity: Major — Missing or weak validation allows cluster creation without required networking, causing runtime failures or hairpin NAT breakage
- TOCTOU risk: If cluster names are not scoped by tenant/namespace, an attacker could create a cluster with the same name as pending PublicIPAttachments intended for another tenant's cluster, hijacking the public IPs
- Bypass risk: Without explicit validation hooks, operators could skip prerequisites
Gaps:
- Validation mechanism: Admission webhook? Controller reconciliation? Both? When does it run?
- Name scoping: Are cluster names unique per tenant/namespace? The document references
Metadata(line 312) but doesn't show namespace scoping for Cluster resources- Prerequisite enforcement: What happens if prerequisites are deleted after cluster creation? Are they immutable references?
- CIDR overlap: No validation mentioned to prevent
network.pod_cidr/service_cidr(lines 244-246) from overlapping withnetwork_attachmentssubnet CIDRsRecommended mitigations:
- Specify validation mechanism: Document whether validation is performed at admission (webhook) or during controller reconciliation, or both
- Add cluster name uniqueness constraint: State that cluster names must be unique within tenant namespace (assuming standard OSAC Metadata includes namespace)
- Add prerequisite immutability: Once a cluster references PublicIPAttachments/NATGateway, those resources should be immutable or have deletion protection
- Add CIDR overlap validation: Validate that
network.pod_cidrandnetwork.service_cidrdo not overlap with any subnet CIDR innetwork_attachments🔒 Suggested documentation additions
Add to the CaaS workflow section:
**Prerequisite Validation (Cluster Admission Webhook):** - Cluster names must be unique within the tenant's namespace - When cluster.spec.network_attachments references subnets: - Subnets must exist and have scope=cluster - At least one PublicIPAttachment must exist with target=<cluster-name> and target_endpoint="api" - At least one PublicIPAttachment must exist with target=<cluster-name> and target_endpoint="ingress" - At least one NATGateway must exist on the VirtualNetwork - Cluster-internal CIDRs (pod_cidr, service_cidr) must not overlap with subnet CIDRs **Prerequisite Protection:** - Once a Cluster references a PublicIPAttachment or NATGateway, those resources gain a finalizer - Deletion is blocked until the Cluster is deleted🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 374 - 391, Document and implement explicit prerequisite validation for CaaS: specify that Cluster admission uses an admission webhook (and optional controller reconciliation) to validate Cluster.spec.network_attachments and network.pod_cidr/service_cidr; require Cluster names be unique within the tenant namespace (OSAC Metadata namespace) and ensure PublicIPAttachment resources with target=<cluster-name> and target_endpoint="api"/"ingress" plus a NATGateway on the referenced VirtualNetwork exist before admitting the Cluster; mark referenced PublicIPAttachment/NATGateway resources immutable (add finalizer/deletion protection while Cluster exists) and enforce CIDR non-overlap between network.pod_cidr/service_cidr and any subnet CIDRs in network_attachments.
280-303:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTarget-endpoint validation gaps create traffic misrouting risk.
Line 298 leaves
target_endpointoptional, but line 302 states that clusters have multiple endpoints requiring "explicit targeting." This creates a security and reliability risk:Risk impact:
- Severity: Major — Ambiguous defaults could route traffic to the wrong endpoint (API server vs. ingress controller)
- Security exposure: If defaulted incorrectly, could expose cluster API server when only ingress was intended, or vice versa
- User experience: Silent misrouting is hard to debug
Missing validations:
target_endpointshould be required whentarget = clustertarget_endpointvalues should be constrained to a closed enum (api,ingress) at the validation layer- No validation is specified to enforce scope compatibility (e.g., rejecting a PublicIPAttachment targeting a ComputeInstance on a
baremetal-scoped VN)🔒 Proposed validation rules
Add to the protobuf validation or admission webhook logic:
message PublicIPAttachmentSpec { string public_ip = 1; // PublicIP ID, required, immutable oneof target { // EXTENDED: add Cluster and BaremetalInstance string compute_instance = 2; // ComputeInstance ID (existing) string cluster = 3; // Cluster ID (new) string baremetal_instance = 4; // BaremetalInstance ID (new) } - string target_endpoint = 5; // optional: for Cluster, specifies "api" or "ingress" + string target_endpoint = 5; // required when target=cluster; enum: "api" | "ingress" + // validation: must be empty for compute_instance and baremetal_instance } + +// Validation rules (admission webhook or CRD validation): +// 1. If target == cluster: target_endpoint is required and must be in {"api", "ingress"} +// 2. If target != cluster: target_endpoint must be empty +// 3. Target resource's VirtualNetwork.scope must be compatible with target type🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 280 - 303, The PublicIPAttachmentSpec validation is incomplete: update validation (protobuf rules or the admission webhook) to (1) require target_endpoint when the oneof target is cluster (i.e., when cluster field in PublicIPAttachmentSpec is set), (2) restrict target_endpoint to a closed enum with only "api" and "ingress" values, and (3) add scope compatibility checks that reject attachments that mismatch network scope (e.g., prevent compute_instance from being attached to a VN that is marked baremetal-scoped or other incompatible combinations involving baremetal_instance and cluster). Ensure these checks run where PublicIPAttachmentSpec is validated so the oneof (compute_instance, cluster, baremetal_instance) and target_endpoint semantics are enforced atomically.
444-452:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRisks table is incomplete; missing several major security and reliability concerns.
The risks table (lines 446-451) captures backend complexity and scope limitations but omits several major risks identified in the API and workflow sections:
Missing risks:
- PublicIPAttachment target_endpoint ambiguity — Optional
target_endpointfor cluster targets creates traffic misrouting and security exposure risk (severity: major)- NATGateway public_ip mutability — Unspecified mutability allows runtime IP changes that break connections and bypass audit trails (severity: major)
- CaaS prerequisite TOCTOU — Cluster name targeting without namespace scoping allows name collision attacks (severity: major)
- CIDR overlap between cluster-internal and OSAC networks — No validation prevents pod_cidr/service_cidr from overlapping with network_attachments subnet CIDRs (severity: major)
🔒 Proposed risk additions
Add the following rows to the Risks and Mitigations table:
Risk Impact Mitigation PublicIPAttachment cluster target ambiguity Traffic misrouting or security exposure if target_endpoint defaults incorrectly Make target_endpoint required when target=cluster; validate enum values {"api", "ingress"} at admission NATGateway public_ip runtime changes Connection breakage, stale NAT mappings, audit trail bypass Mark public_ip as immutable; require delete/recreate for IP changes CaaS cluster name TOCTOU PublicIPAttachments targeting wrong cluster if names collide across tenants Enforce cluster name uniqueness per tenant namespace; validate scope compatibility Cluster CIDR overlap Routing conflicts between cluster-internal CNI and OSAC tenant network Validate pod_cidr/service_cidr do not overlap with network_attachments subnet CIDRs 🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 444 - 452, Update the Risks and Mitigations table to add four rows covering the missing major issues: (1) PublicIPAttachment cluster target ambiguity — reference the PublicIPAttachment resource and its target_endpoint field and state "Make target_endpoint required when target=cluster; validate enum values {\"api\",\"ingress\"} at admission" as the mitigation; (2) NATGateway public_ip mutability — reference the NATGateway resource and its public_ip field and mark public_ip immutable with "require delete/recreate for IP changes" as the mitigation; (3) CaaS cluster name TOCTOU — reference cluster name targeting in PublicIPAttachment and require cluster name uniqueness per tenant/namespace and scope compatibility validation; (4) Cluster CIDR overlap — reference pod_cidr/service_cidr and network_attachments subnet CIDRs and add validation to prevent overlap. Ensure each new row mirrors the existing table format and uses the exact identifiers: PublicIPAttachment, target_endpoint, target=cluster, NATGateway, public_ip, pod_cidr, service_cidr, and network_attachments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@enhancements/unified-networking/README.md`:
- Around line 428-430: Document the VN-wide scope limitation for NATGateway and
add a recommended future enhancement: update the NATGateway documentation and
the NATGatewaySpec schema description to note that "All subnets in the VN use
the gateway for outbound traffic" is a known limitation for compliance-sensitive
deployments, and propose adding an optional subnet scoping field (e.g., an
optional subnet reference field on NATGatewaySpec) to allow per-subnet NAT
association; reference NATGateway, VirtualNetwork, and NATGatewaySpec in the
note and include a short migration/operational workaround recommending separate
VirtualNetworks or VN peering until per-subnet scoping is implemented.
- Around line 304-337: The NATGatewaySpec.public_ip field must be treated as
immutable to avoid runtime IP switches; update the NATGateway documentation and
schema to mark spec.public_ip as immutable (consistent with
PublicIPAttachment.public_ip) and enforce this in the control plane by adding a
validation rule in the CRD or admission webhook that rejects updates which
change NATGateway.spec.public_ip (use the NATGateway and NATGatewaySpec
identifiers when locating the code that creates validation or webhook logic).
- Around line 431-443: Update the "DNS" section to remove ambiguity by
explicitly stating that DNS is not an exposed networking API resource but is
handled by service provisioning logic (e.g., "CaaS template roles create DNS
records")—keep both sentences and rephrase so they read together (DNS is managed
by service integration, not the networking API; CaaS workflow creates records
via templates), and add a short caveat that this boundary may change if DNS
becomes a networking API feature in future to avoid technical debt; reference
the "DNS" heading and the "CaaS workflow"/"CaaS template roles create DNS
records" phrasing when making the edit.
- Around line 374-391: Document and implement explicit prerequisite validation
for CaaS: specify that Cluster admission uses an admission webhook (and optional
controller reconciliation) to validate Cluster.spec.network_attachments and
network.pod_cidr/service_cidr; require Cluster names be unique within the tenant
namespace (OSAC Metadata namespace) and ensure PublicIPAttachment resources with
target=<cluster-name> and target_endpoint="api"/"ingress" plus a NATGateway on
the referenced VirtualNetwork exist before admitting the Cluster; mark
referenced PublicIPAttachment/NATGateway resources immutable (add
finalizer/deletion protection while Cluster exists) and enforce CIDR non-overlap
between network.pod_cidr/service_cidr and any subnet CIDRs in
network_attachments.
- Around line 280-303: The PublicIPAttachmentSpec validation is incomplete:
update validation (protobuf rules or the admission webhook) to (1) require
target_endpoint when the oneof target is cluster (i.e., when cluster field in
PublicIPAttachmentSpec is set), (2) restrict target_endpoint to a closed enum
with only "api" and "ingress" values, and (3) add scope compatibility checks
that reject attachments that mismatch network scope (e.g., prevent
compute_instance from being attached to a VN that is marked baremetal-scoped or
other incompatible combinations involving baremetal_instance and cluster).
Ensure these checks run where PublicIPAttachmentSpec is validated so the oneof
(compute_instance, cluster, baremetal_instance) and target_endpoint semantics
are enforced atomically.
- Around line 444-452: Update the Risks and Mitigations table to add four rows
covering the missing major issues: (1) PublicIPAttachment cluster target
ambiguity — reference the PublicIPAttachment resource and its target_endpoint
field and state "Make target_endpoint required when target=cluster; validate
enum values {\"api\",\"ingress\"} at admission" as the mitigation; (2)
NATGateway public_ip mutability — reference the NATGateway resource and its
public_ip field and mark public_ip immutable with "require delete/recreate for
IP changes" as the mitigation; (3) CaaS cluster name TOCTOU — reference cluster
name targeting in PublicIPAttachment and require cluster name uniqueness per
tenant/namespace and scope compatibility validation; (4) Cluster CIDR overlap —
reference pod_cidr/service_cidr and network_attachments subnet CIDRs and add
validation to prevent overlap. Ensure each new row mirrors the existing table
format and uses the exact identifiers: PublicIPAttachment, target_endpoint,
target=cluster, NATGateway, public_ip, pod_cidr, service_cidr, and
network_attachments.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4dd1bbea-0ade-40ab-8fb8-943d1a80c983
📒 Files selected for processing (2)
enhancements/unified-networking-prd/README.mdenhancements/unified-networking/README.md
| string region = 1; // required, immutable | ||
| string ipv4_cidr = 2; // optional, immutable | ||
| string ipv6_cidr = 3; // optional, immutable | ||
| string scope = 4; // NEW: "compute", "cluster", or "baremetal" |
There was a problem hiding this comment.
This should be enum, with the accepted values and their documentation.
| message ClusterNetworkSpec { | ||
| string pod_cidr = 1; // default: 10.128.0.0/14 | ||
| string service_cidr = 2; // default: 172.30.0.0/16 | ||
| } |
There was a problem hiding this comment.
Why does this need to be exposed to the user? Can't we internally decide what values to use?
There was a problem hiding this comment.
We can, this is optional in case the user want to set these values to something else
There was a problem hiding this comment.
like assisted lets the user set it
There was a problem hiding this comment.
I understand. But we try (or tried) to keep things simple in the API. Unless there are users already asking for this we shouldn't expose it in the API.
| string cluster = 3; // Cluster ID (new) | ||
| string baremetal_instance = 4; // BaremetalInstance ID (new) | ||
| } | ||
| string target_endpoint = 5; // optional: for Cluster, specifies "api" or "ingress" |
| The design must accommodate future mixed-scope VirtualNetworks (VMs and BM on | ||
| the same network) without requiring API changes. |
There was a problem hiding this comment.
This answers my previous question, but I think this explanation should be earlier in the doc.
| 1. **Create VirtualNetwork** (`--scope baremetal`) → resolves `fabric-managed` → Netris creates VPC (VRF) | ||
| 2. **Create Subnet** → inherits `network: netris` → Netris creates V-Net + IPAM subnet | ||
| 3. **Create SecurityGroup** → reads `acl: netris` → Netris ACL rules on fabric | ||
| 4. **Create BaremetalInstance** with `network_attachments` → operator dispatches to AAP → HostLease CR → baremetal fulfillment component provisions server (switch port config, IP assignment, ACL rules) |
There was a problem hiding this comment.
How is the logic to actually get a host attached to a network defined? Does that end up just being some shared role in aap?
| 1. **Create VirtualNetwork** (`--scope baremetal`) → resolves `fabric-managed` → Netris creates VPC (VRF) | ||
| 2. **Create Subnet** → inherits `network: netris` → Netris creates V-Net + IPAM subnet | ||
| 3. **Create SecurityGroup** → reads `acl: netris` → Netris ACL rules on fabric | ||
| 4. **Create BaremetalInstance** with `network_attachments` → operator dispatches to AAP → HostLease CR → baremetal fulfillment component provisions server (switch port config, IP assignment, ACL rules) |
There was a problem hiding this comment.
Also are network_attachments mutable? I imagine this depends on the backend?
@trewest mentioned something about NICo not being able to change machine networks once they are created, is that a common limitation?
I think it's hard to reason about this if netris is the only backend we know about right now. Are there others that can inform these kinds of choices?
tzumainn
left a comment
There was a problem hiding this comment.
Thanks for writing this up! I left a few comments/questions.
| - **BMaaS** (Bare Metal as a Service): Provisions physical bare-metal | ||
| servers. The API resource is `BaremetalInstance` (defined in | ||
| [PR #45](https://github.com/osac-project/enhancement-proposals/pull/45), | ||
| unmerged). |
There was a problem hiding this comment.
Is there (currently) a distinction between CaaS and BMaaS? If clusters are installed on bare metal, then the former would just be a subset of the latter. I know that down below you describe different workflows for each - but won't the added capabilities needed for CaaS also theoretically be available for BMaaS? Are there cases where someone may want to build out their cluster infrastructure with additional bare metal hosts running other bits of infrastructure?
There was a problem hiding this comment.
The way CaaS will consume BMaaS / VMaaS (if we will run on vms as well) is still not defined very well and probably require an enhancement of its own. In the networking perspective, CaaS requires some things that BMaaS not - for example public ips for SNAT / DNAT to ingress / API
There was a problem hiding this comment.
Those public IPs are pointing at internal IPs on the virtual network's subnet, correct? If so, that's a feature that BMaaS needs as well.
| ```protobuf | ||
| message NetworkAttachment { | ||
| string subnet = 1; // Subnet ID, required, immutable | ||
| repeated string security_groups = 2; // SecurityGroup IDs, optional, mutable |
There was a problem hiding this comment.
This might be obvious if I knew more about how this works for ComputeInstances, but how does this attachment know which host it's associated with?
Would this be the place to specify which bare metal port a network is/should be attached to?
There was a problem hiding this comment.
There's a new proposed field in BaremetalInstanceSpec just above this.
There was a problem hiding this comment.
Whoops, missed that! I think my question regarding bare metal ports stands, though - at the MOC there are cases where we have to attach network to specific bare metal ports that have specific characteristics.
There was a problem hiding this comment.
Right so I guess you're asking if we need another abstraction that includes a MAC or something to indicate which interface maps to which subnet? Good question unless I'm missing something.
There was a problem hiding this comment.
Yep, that's exactly what I mean!
There was a problem hiding this comment.
I think this is addressed by the optional interface field in NetworkAttachments but I have questions in those relevant sections about how that will work.
vladikr
left a comment
There was a problem hiding this comment.
Thank you! Looks like a great step forward!
| and other resources. Both service types build ad-hoc networking outside the | ||
| API. | ||
|
|
||
| ### 2. Tenants must choose networking backends |
There was a problem hiding this comment.
Yeah, indeed.
One approach I'm thinking about is to make NetworkClass the main extension point (cluster-scoped), where it declares capabilities (I support: L2, L3, DPU support, etc.) and points to a driver implementation.
The osac-operator would then act as a dispatcher:
when reconciling a VirtualNetwork, it reads the NetworkClass and selects the right driver (CUDN/Netris/physical backends, and more to come).
This way tenants don't choose backends but they just get the right implementation based on what the provider configured in the NetworkClass.
I think it would also make it much easier to add new backends in the future without changing the API.
There was a problem hiding this comment.
Discussion continued here - https://github.com/osac-project/enhancement-proposals/pull/50/changes#r3350027836
| for PublicIP, and Netris for DNAT. The current design has no way to express | ||
| this composition — each resource independently declares its backend with no | ||
| validation that the combination is coherent. | ||
|
|
There was a problem hiding this comment.
This gap is pretty much related to how we model the NetworkClass.
If we will turn the NetworkClass into a self-describing object that advertises its capabilities (L2/L3, Public IPs, ...) and declares which driver implements it, then the operator can both discover capabilities and route work to the correct driver.
That was my thinking with the above comment.
This would also help with composition. Instead of every resource declaring its own implementation_strategy, the driver (referenced via NetworkClass) can be responsible for coordinating the right backends for VirtualNetwork, Subnet, SecurityGroup, and so on.
There was a problem hiding this comment.
I M curios to understand how is this sill look like, can you perhaps share an example ?
There was a problem hiding this comment.
Sure. Similar to what we've discussed offline, but now with multiple drivers/backends, as you pointed out, via actions (as in this proposal)
The main idea is that the NetworkClass becomes the main extension point. It's provider-only and cluster-scoped.
- Provides capabilities that will help with discovery and future UI/validation.
- This replaces the need for a separate ServiceNetworkClass.
- Can map different actions to different drivers.
apiVersion: network.osac.io/v1alpha1
kind: NetworkClass
metadata:
name: k8s-on-fabric
spec:
description: "VMs on OVN with fabric-managed external access"
drivers:
network: cudn_net # VirtualNetwork + Subnet
acl: network_policy # SecurityGroup rules
externalAccess: netris # PublicIP, DNAT, SNAT
fabricBridge: metallb_l2 # Bridge OVN to physical fabric
capabilities:
supportsL2: true
supportsL3: true
supportsPublicIPs: true
supportsDPU: false
The VirtualNetwork CR:
- Tenants mostly use scope.
- They can optionally pin a specific networkClass if needed.
- The NetworkClass is resolved by the operator (via scope + region, or explicit reference).
apiVersion: network.osac.io/v1alpha1
kind: VirtualNetwork
metadata:
name: my-cluster-network
namespace: tenant-project
spec:
region: moc-region-1
scope: cluster # compute | cluster | baremetal
networkClass: k8s-on-fabric # optional – can be defaulted by scope + region
ipv4_cidr: 10.0.0.0/16
When any networking resource is created/updated, the Dispatcher in the osac-operator does this:
- Resolves the NetworkClass for the resource
- Looks at the action required by the resource:
- -VirtualNetwork / Subnet: action network
-
- SecurityGroup: action acl
-
- PublicIPAttachment / NATGateway: action externalAccess
- Calls the registered Driver for each action
- The Driver executes and updates status/conditions.
In terms of validation, we could have a simple wenhook that validates the structure, but it's the controller eventually that will validate whether a given driver supports an action and raise a relevant condition.
How we register the driver is an implementation detail, I think, but I can share my thoughts.
You proposal is already great, I just believe that we can simpify the both the mental model and have less CRDs.
With this the NetworkClass becomes the single place where providers define capabilities and select drivers.
There was a problem hiding this comment.
so if i understand you correctly you are saying:
- Drop the
NetworkBackendCRs - This means we don't have a way to specify whatactionsbackend support. - Drop the
ServiceNetworkClassCRs - The ones composing the drivers. - Use
NetworkClassto compose capabilities and drivers instead, instead of using oneNetwokClassuse one per scope and infer the right one by the scope.
There was a problem hiding this comment.
I was thinking that drivers wouldn’t need to be represented as CRDs.
A driver is basically an implementation (either a Go struct or a thin wrapper that triggers an AAP job).
There’s nothing to reconcile for the driver itself, so I don’t see a strong reason to introduce a NetworkBackend CRD just for registration.
I thought that each driver could declare the list of actions it supports.
This can be done in code (for Go drivers) or via a small metadata/config for ansible drivers.
Ideally, we could have a thing go layer that represents and registers the driver
Just a thought:
type NetworkDriver interface {
SupportedActions() []string
ReconcileVirtualNetwork(ctx, vn) error
ReconcileSubnet(ctx, subnet) error
ReconcileSecurityGroup(ctx, sg) error
....
And then the registration (for example, for cudn_driver.go):
func init() {
dispatcher.MustRegister("cudn_net", &CUDNDriver{})
}
The NetworkClass then maps actions to driver names:
drivers:
network: cudn_net
acl: network_policy
externalAccess: netris
fabricBridge: metallb_l2
Regarding discoverability, I agree it’s useful to know what drivers are available.
We could expose the registered drivers and their supported actions via the status of the operator, or come up with something else, instead of requiring a CRD per driver.
The idea of a NetworkBackend CRD is nice if it brings real value but I’d avoid it until we have more use cases (dynamic registration/pluggable)
| Tenants must not choose networking backends. The system must select the correct | ||
| backend based on the provider's infrastructure and the tenant's intent. |
There was a problem hiding this comment.
ah yes! +1
I think making NetworkClass provider-only and using it as the driver/capability selector (instead of something tenants pick) is the right direction.
The operator can then dispatch to the appropriate driver based on the NetworkClass.
There was a problem hiding this comment.
Same response as #50 (comment), lets chat about it
|
|
||
| **Key distinction:** `network_attachments` is the OSAC tenant network (where nodes physically sit). `network.pod_cidr`/`service_cidr` is the cluster-internal CNI — separate, non-overlapping address space. | ||
|
|
||
| #### BMaaS — End-to-End Flow |
There was a problem hiding this comment.
So does that mean that BareMetalPool will consume VirtualNetwork as well, right? :)
There was a problem hiding this comment.
AFAIU, BareMetalPool is not meant for the tenants, but for internal use when interacting with the fulfillment-service. The tenants will user BaremetalInstnace. Is that correct @adriengentil @tzumainn ?
There was a problem hiding this comment.
Ok, maybe I don't have a full understanding, but why not to reuse the same mechanism across the project? I mean, the fulfillment-service could create a VirtualNetowrk + Subnet and BareMetalPool will consume it?
There was a problem hiding this comment.
I might be missing something, but my thinking was that we want to make networking a reusable layer across VMaaS, CaaS, and BMaaS, which aligns with the direction of the unified networking proposal.
Instead of having BareMetalPool create networking internally via pool-network-create, I think we should move toward BareMetalPool consuming a VirtualNetwork (and Subnet) that was created either explicitly by the tenant or automatically by a higher-level component (such as the fulfillment-service or ClusterOrder Operator).
The BareMetalPool would then focus on host selection, attachment, and lifecycle, while delegating network creation/provisioning to the networking layer.
There was a problem hiding this comment.
@vladikr You are raising good points which weren't mentioned in the doc. I am not entirely sure what are the plans for BMaaS, if I would have to take a guess I would assume - BaremetalInstance -> HostLease / Host. The BareMetalPool controller will need to adapt to using the given network resources rather than creating them. cc @tzumainn @carbonin What do you think ?
There was a problem hiding this comment.
There is no issue with the BareMetalPool controller adapting to that workflow; the current network creation step is simply defined as a template that's passed in by the profile, and it's not a problem to create a new template that doesn't do that.
230e3b4 to
2ca0fda
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@enhancements/unified-networking-prd/README.md`:
- Line 203: In the README sentence "VMs use OVN (Layer2 UDN) for isolation — a
software-defined overlay on the", change the networking term "Layer2" to the
spaced, standards-consistent form "Layer 2" so the line reads "VMs use OVN
(Layer 2 UDN) for isolation — a software-defined overlay on the"; update that
exact phrase wherever it appears to ensure consistent protocol-layer naming.
- Around line 324-335: The acceptance criterion currently labeled R1 in the
"Acceptance criteria" section is too strict compared to the Goals note; update
the README.md acceptance criteria by scoping R1 to the current phase (e.g., "v1
/ current phase") and either remove or rephrase the sentence "VMs in the same
Subnet are in the same broadcast domain regardless of which hypervisor node or
OpenShift cluster they run on" to explicitly state OVN-scoped behavior (or mark
it as a future/fabric-integration requirement). Locate the "Acceptance criteria"
block and the "Goals" note references to ensure consistency, and if helpful
split the list into "v1 (current scope)" vs "future scope" so validation tests
target only the deliverable behavior.
In `@enhancements/unified-networking/README.md`:
- Around line 304-307: The comment is ambiguous about enum semantics: update the
README validation text for the PublicIPAttachmentEndpoint field
`target_endpoint` to use proto3 enum semantics—specifically state that
`target_endpoint` MUST be UNSPECIFIED for `target=compute_instance` and
`target=baremetal_instance`, and MUST NOT be UNSPECIFIED for `target=cluster`
(i.e., reject if `target=cluster` and `target_endpoint==UNSPECIFIED`); apply the
same clear wording to the other occurrence on lines 310-311 so controllers
validate against UNSPECIFIED rather than an impossible “empty” value.
- Around line 624-631: The example CLI and docs currently permit
tenant-specified host class via the `--node-set` flag (e.g., `--node-set
workers=large,size=3`), which violates the immutable `host_class` contract;
update the CLI parsing/validation for `--node-set` to reject or ignore any
`host_class`/host-class token and only accept `size` (and other allowed
overrides), and update the README example to remove `workers=large` so it only
shows `workers=size=3`; reference the `--node-set` flag, the `host_class` term,
and the contract documented in `enhancements/caas/README.md` when making
changes.
- Line 43: Add a clear, explicit security statement in the README that
authentication, authorization, and tenant resource-scoping are handled centrally
by the Fulfillment Service as a cross-cutting concern and not by individual
modules; update sections referencing NetworkBackend, ServiceNetworkClass,
NetworkClass, and VirtualNetwork to state that the Fulfillment Service enforces
tenant ownership, RBAC/policies, and request-level authentication for all
Fulfillment validation flows, and require downstream modules to rely on those
assurances rather than reimplementing local access checks; apply the same
clarification to the other referenced sections (around lines 400-401 and
635-640) so the document consistently mandates centralized authn/authz and
tenant-scoping ownership.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ac13e8e2-a23e-48ca-80cc-9069159169ae
📒 Files selected for processing (2)
enhancements/unified-networking-prd/README.mdenhancements/unified-networking/README.md
| **Acceptance criteria:** | ||
| - Resources in different VirtualNetworks cannot communicate (full isolation) | ||
| - Resources in the same Subnet are in the same L2 broadcast domain (can | ||
| communicate at Layer 2) | ||
| - Resources in different Subnets within the same VirtualNetwork can | ||
| communicate via Layer 3 routing | ||
| - SecurityGroups control which traffic is permitted within these boundaries | ||
| - Bare-metal servers in the same Subnet are in the same broadcast domain | ||
| regardless of their physical location (rack, switch) | ||
| - VMs in the same Subnet are in the same broadcast domain regardless of | ||
| which hypervisor node or OpenShift cluster they run on | ||
|
|
There was a problem hiding this comment.
Major: R1 acceptance criteria are stricter than the documented current architecture, creating a release-gating mismatch.
R1 currently reads as an immediate requirement (“VMs in the same Subnet … regardless of which hypervisor node or OpenShift cluster they run on”), but the Goals note states full VM/fabric integration is future work and current isolation remains OVN-scoped. This makes acceptance ambiguous and high-risk for delivery sign-off. Please scope R1 explicitly to what is required in this phase (or split into v1 vs future criteria) so validation is testable and non-contradictory.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@enhancements/unified-networking-prd/README.md` around lines 324 - 335, The
acceptance criterion currently labeled R1 in the "Acceptance criteria" section
is too strict compared to the Goals note; update the README.md acceptance
criteria by scoping R1 to the current phase (e.g., "v1 / current phase") and
either remove or rephrase the sentence "VMs in the same Subnet are in the same
broadcast domain regardless of which hypervisor node or OpenShift cluster they
run on" to explicitly state OVN-scoped behavior (or mark it as a
future/fabric-integration requirement). Locate the "Acceptance criteria" block
and the "Goals" note references to ensure consistency, and if helpful split the
list into "v1 (current scope)" vs "future scope" so validation tests target only
the deliverable behavior.
| ```bash | ||
| osac create cluster \ | ||
| --template ocp_4_17_small \ | ||
| --network-attachment subnet=cluster-nodes-subnet,security-groups=cluster-sg \ | ||
| --node-set workers=large,size=3 \ | ||
| -p pull_secret="$(cat pull-secret.json)" \ | ||
| -p ssh_public_key="$(cat ~/.ssh/id_rsa.pub)" \ | ||
| --name my-cluster |
There was a problem hiding this comment.
Major security/contract drift: cluster workflow appears to allow tenant host-class override.
--node-set workers=large,size=3 implies tenant-controlled host class selection, which conflicts with the CaaS contract where template-defined node set host_class is immutable. This can lead to unauthorized infrastructure shape changes and inconsistent scheduler/provisioning behavior across services.
Based on learnings: in enhancements/caas/README.md, template-defined node-set host_class is immutable and tenants may only override size.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@enhancements/unified-networking/README.md` around lines 624 - 631, The
example CLI and docs currently permit tenant-specified host class via the
`--node-set` flag (e.g., `--node-set workers=large,size=3`), which violates the
immutable `host_class` contract; update the CLI parsing/validation for
`--node-set` to reject or ignore any `host_class`/host-class token and only
accept `size` (and other allowed overrides), and update the README example to
remove `workers=large` so it only shows `workers=size=3`; reference the
`--node-set` flag, the `host_class` term, and the contract documented in
`enhancements/caas/README.md` when making changes.
| configure NetworkClasses, and manage PublicIPPools. Tenants do not see | ||
| provider-level configuration. | ||
|
|
||
| - **Region**: A management boundary representing a deployment location (e.g., |
There was a problem hiding this comment.
Keep in mind that the did not agree yet on what is a region in the scope of OSAC: #20
There was a problem hiding this comment.
I understand, Region was in the old enhancement so I kept it here as a datacenter concept. If we will need to change it, I will
|
|
||
| ### R9: Future mixed-scope VNs without API changes | ||
|
|
||
| The design must accommodate future mixed-scope VirtualNetworks (VMs and BM on |
There was a problem hiding this comment.
Be aware that VMs might be running on a dedicated cluster. Today, the UDN implementation is aware of that and creates the network pieces in that dedicated cluster, so depending on your plans, you might have to think about this use case.
https://github.com/osac-project/osac-operator/blob/main/docs/vmaas-dedicated-cluster/README.md
There was a problem hiding this comment.
Yes, this suggestion mentions in its goals that the virtual network / subnet concepts must apply regardless of which cluster / node the VM run in (e.g. the VMs multi-tenancy need to participate in the fabric)
mhrivnak
left a comment
There was a problem hiding this comment.
In general there are a lot of technical designs and decisions baked into the PRD. We should be as disciplined as we can to keep PRDs focused on requirements in terms of user story outcomes without prescribing how to implement them.
|
|
||
| - **OVN / UDN**: Open Virtual Network / User Defined Network. The | ||
| software-defined networking layer in OpenShift that provides VM isolation. | ||
| VMs on OVN have internal IPs that are not visible on the physical fabric. |
There was a problem hiding this comment.
not visible by default, but can be made visible via bgp and vfr-lite from a CUDN.
There was a problem hiding this comment.
This is mentioned elsewhere in the doc
| ### 7. VMs and bare metal cannot share a network | ||
|
|
||
| VMs use OVN (Layer2 UDN) for isolation — a software-defined overlay on the | ||
| OpenShift cluster. Bare-metal servers use physical VLANs configured on |
There was a problem hiding this comment.
usually vxlans used together with vlans as a tagging mechanism
|
|
||
| ### 9. CaaS has unique prerequisite ordering | ||
|
|
||
| Cluster worker nodes reach the hosted control plane API server via hairpin |
There was a problem hiding this comment.
Is there a more complete description somewhere of this network setup?
There was a problem hiding this comment.
| exposing implementation details to tenants | ||
| - As a provider, I want to compose different backends for different networking | ||
| functions (isolation, ACLs, external access, fabric bridging) | ||
| - As a provider, I want to register new backends by deploying NetworkBackend |
There was a problem hiding this comment.
Prescribing the specific solution doesn't work so well in a user story. "As a user, I want to use the thing proposed in this document" doesn't add value. I would leave it at just "As a provider, I need to register backends."
| of choosing implementation details | ||
| - Define clear direction for PublicIPAttachment (inbound DNAT) and NATGateway | ||
| (outbound SNAT) | ||
| - Support air-gapped environments where "Public" IPs are data-center-routable, |
There was a problem hiding this comment.
Even in a connected environment, having IPs that are routable within a shared provider network enables one tenant to offer services to other tenants. I wouldn't over-complicate things right now by trying to have both public and semi-public IPs, but it would be reasonable to keep that in mind. I think it would be better to at least keep those concepts distinct and not necessarily assume that "public IPs" can easily become non-public when there's no internet connectivity.
There was a problem hiding this comment.
I'm not sure OSAC should care how "public" the IPs are. The provider defines the pools. Where the IPs reside doesn't change anything. They are public because they are not within the VirtualNetwork to which the network interface is attached.
| - ComputeInstance, Cluster, and BaremetalInstance all have a `network_attachments` | ||
| field using the same `NetworkAttachment` message | ||
| - PublicIPAttachment supports all three as targets | ||
| - The tenant workflow for creating networking resources is identical regardless |
There was a problem hiding this comment.
This would be nice to have, but might not be practical. For example a cluster may need a load balancer as the endpoint where an IP gets attached.
There was a problem hiding this comment.
Load Balancer is out of scope at the moment, we will add it in the future
|
|
||
| **Acceptance criteria:** | ||
| - NetworkClass is not exposed in the tenant API | ||
| - VirtualNetwork has a `scope` field that describes what the network is for |
There was a problem hiding this comment.
let's avoid prescribing specific API changes in PRDs. We can discuss that in the solution doc.
| **Acceptance criteria:** | ||
| - The bridge plugin is specified in the provider configuration, not by tenants | ||
| - The operator orchestrates bridging and external access as independent steps | ||
| - Backend roles are atomic — no cross-dependencies between roles |
There was a problem hiding this comment.
I'm not sure how this relates to the rest of this section.
| - "Public" IP means external to the VirtualNetwork, not internet-routable | ||
| - The API and workflow are identical for internet-facing and air-gapped | ||
| deployments | ||
| - CaaS clusters can provision using data-center-routable IPs for API server |
There was a problem hiding this comment.
why? Why would we not always default to CaaS using tenant-private IPs, and they're only exposed outside the tenant's network if they opt-in to attaching an IP?
There was a problem hiding this comment.
We will, but this was the implementation we decided on when implementing netris, so I have mentioned it here. I will remove it from the prd
|
|
||
| ### R7: CaaS prerequisite support | ||
|
|
||
| CaaS must be able to create PublicIPAttachments before the cluster exists, so |
There was a problem hiding this comment.
Why would a cluster require a public IP?
There was a problem hiding this comment.
limitation of the current netris implementation, hosted cluster nodes reach hostinh cluster via SNAT / DNAT instead of VPC peering
There was a problem hiding this comment.
This shouldn't require a public ip for the mere reason that you don't want to force the user to take an address from the pool unless they actually need it
ygalblum
left a comment
There was a problem hiding this comment.
I've only reviewed the PRD for now.
|
|
||
| - As a tenant, I want to place my cluster's worker nodes on a Subnet in my | ||
| VirtualNetwork | ||
| - As a tenant, I want to attach PublicIPs to my cluster's API server and |
There was a problem hiding this comment.
Do we mandate all or nothing here? Can the tenant attach a PublicIP only to the ingress but not to the API server?
|
|
||
| ### BMaaS-Specific Stories | ||
|
|
||
| - As a tenant, I want to place my BaremetalInstance on a Subnet in my |
There was a problem hiding this comment.
This sentence is written in singular. What about multiple network interfaces? Can a single instance be placed in multiple VirtualNetworks?
Once we support multiple, there will need to be a way to match the physical interface with the attached network.
This also affects the next item. If there are multiple interfaces, to which one am I attaching the public ip?
These questions are also relevant for VMaaS.
There was a problem hiding this comment.
I see the rest of the document talks about interfaces and networks in plural. So, again, a small nit. on a Subnet -> on Subnets.
Also, can different interfaces be connected to Subnets in different VirtualNetwork? This sentence should clarify the range of options
There was a problem hiding this comment.
Fixed. Regarding cross-VN: all subnets in a multi-attachment must belong to the same VirtualNetwork — clarified in the validation rules.
| physical location (rack, switch) | ||
| - These guarantees must hold for VMs regardless of which hypervisor node | ||
| or OpenShift cluster they run on — two VMs on the same Subnet must be | ||
| in the same broadcast domain even if they are on different nodes |
There was a problem hiding this comment.
WDYM by different nodes. What are nodes in this context?
| of choosing implementation details | ||
| - Define clear direction for PublicIPAttachment (inbound DNAT) and NATGateway | ||
| (outbound SNAT) | ||
| - Support air-gapped environments where "Public" IPs are data-center-routable, |
There was a problem hiding this comment.
I'm not sure OSAC should care how "public" the IPs are. The provider defines the pools. Where the IPs reside doesn't change anything. They are public because they are not within the VirtualNetwork to which the network interface is attached.
| - Bare-metal servers in the same Subnet are in the same broadcast domain | ||
| regardless of their physical location (rack, switch) | ||
| - VMs in the same Subnet are in the same broadcast domain regardless of |
There was a problem hiding this comment.
You list these two separately. Is this because for the first version they will not share a network?
| **Acceptance criteria:** | ||
| - PublicIPAttachment handles inbound traffic only (DNAT) | ||
| - NATGateway handles outbound traffic only (SNAT) | ||
| - A resource needing both directions requires both resources explicitly |
There was a problem hiding this comment.
Does this mean that unless explicitly attached, the resource (e.g. VM) will not have any egress? Isn't this a network level configuration? Or maybe I'm misunderstanding the NATGateway resource.
|
|
||
| ### R7: Air-gapped environment support | ||
|
|
||
| The API must work in air-gapped environments where IPs are data-center-routable |
There was a problem hiding this comment.
As I mentioned in the goals section, I don't think OSAC should care about it. Furthermore, the environment might not be air-gapped but the "public" IPs may still not be routable from the internet but rather only from the intranet. Think about services you need VPN connection to access but they themselves can access the internet.
|
|
||
| ### R7: CaaS prerequisite support | ||
|
|
||
| CaaS must be able to create PublicIPAttachments before the cluster exists, so |
There was a problem hiding this comment.
This shouldn't require a public ip for the mere reason that you don't want to force the user to take an address from the pool unless they actually need it
2ca0fda to
25751d2
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
enhancements/unified-networking-prd/README.md (1)
310-321:⚠️ Potential issue | 🟠 MajorFix R1 acceptance criteria scope to match v1 single-hosting-cluster reality
R1 currently requires that VMs in the same Subnet share the same L2 broadcast domain “regardless of which hypervisor node or cluster they run on” (README.md lines 319-320), but the companion design states “V1 assumes one hosting cluster per region” and that multi-hosting-cluster support is a future enhancement. This makes the “or cluster” portion non-testable for v1.
Risk (medium): ambiguous validation can lead to tenants expecting cross-cluster L2 reachability/isolation guarantees that v1 doesn’t promise.
Scope the acceptance criteria explicitly to the single hosting cluster for v1 (remove/guard the “or cluster” text), or split into v1 vs future multi-cluster requirements.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking-prd/README.md` around lines 310 - 321, Update the R1 acceptance criteria in the README to reflect v1 single-hosting-cluster scope: locate the Acceptance criteria block and change the line that currently reads "VMs in the same Subnet are in the same broadcast domain regardless of which hypervisor node or cluster they run on" so it either removes "or cluster" or is split into two bullets (one explicit v1 statement limited to the single hosting cluster and a separate future enhancement bullet for multi-cluster reachability); ensure you reference the R1 acceptance criteria wording in the update so tests/QA validate the single-cluster promise only for v1.enhancements/unified-networking/README.md (3)
37-43:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMajor security gap: missing explicit authentication, authorization, and tenant-scoping ownership statement.
The document describes provider/tenant workflows and validation but does not explicitly state that authentication, authorization, and tenant resource-scoping are handled centrally by the Fulfillment Service as cross-cutting concerns. Without this, downstream modules may implement inconsistent or missing access controls.
Add a clear security statement (e.g., in a new "Security Model" subsection after line 43 or in the "Proposal" section) that:
- Authentication, authorization, and tenant resource-scoping are enforced by the Fulfillment Service layer for all networking resources
- Individual networking modules (NetworkClass, VirtualNetwork, PublicIP, etc.) rely on Fulfillment Service assurances rather than implementing local access checks
- Tenant ownership validation for all resources is performed at the Fulfillment Service API boundary
Based on learnings: OSAC enhancement proposals should document authn/authz/resource scoping as Fulfillment Service cross-cutting concerns and avoid module-local security redefinition.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 37 - 43, Add a new "Security Model" subsection (e.g., after the existing intro paragraph or inside the "Proposal" section) that explicitly states: authentication, authorization, and tenant resource-scoping are enforced centrally by the Fulfillment Service; individual networking modules and resources (e.g., NetworkClass, VirtualNetwork, PublicIP, BaremetalInstance) must rely on Fulfillment Service assurances and must not implement independent access checks; and that tenant ownership validation for all resources is performed at the Fulfillment Service API boundary as a cross-cutting concern. Ensure the statement is unambiguous about who enforces authn/authz, that module implementations assume those guarantees, and that ownership checks occur at the Fulfillment Service layer. Include a short note advising contributors to avoid module-local security redefinition.Source: Learnings
582-584:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPotential CaaS contract violation: CLI example may allow tenant host_class override.
The example
--node-set workers=large,size=3is ambiguous. Ifworkers=largeallows tenants to specify or override thehost_classof a node set, this violates the CaaS immutability contract where template-defined node-sethost_classis immutable and tenants may only overridesize.Clarify the CLI syntax:
- If
workers=largemeans "node-set named 'workers' using host_class 'large'", remove thelargeparameter and show only--node-set workers,size=3or--node-set workers=size:3- If
largeis a different parameter (not host_class), document the syntax clearly to avoid confusionBased on learnings: In the CaaS enhancement proposal, template-defined node-set
host_classis immutable and cannot be overridden by tenants at cluster creation time. Tenants may only overridesize.🔒 Proposed fix if this is a host_class override
osac create cluster --template ocp_4_17_small \ --network-attachment subnet=cluster-subnet,security-groups=cluster-sg \ - --node-set workers=large,size=3 --name my-cluster + --node-set workers,size=3 --name my-cluster🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 582 - 584, The CLI example using "--node-set workers=large,size=3" is ambiguous about whether "large" overrides the template-defined host_class (which would violate the CaaS immutability contract); update the docs to disallow host_class overrides and show only size overrides for node-sets (i.e., for the node-set named "workers" allow specifying only size), or explicitly rename the parameter if "large" is a different field (e.g., host_class=... vs size=...), and document the exact syntax for the --node-set flag and the immutable host_class vs overridable size behavior.Source: Learnings
412-414:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIncorrect enum validation semantics in comment.
The comment states
target_endpoint"must be empty otherwise" (for non-cluster targets), but proto3 enums cannot be "empty" — the zero value isUNSPECIFIED. The validation logic should state:target_endpointMUST beUNSPECIFIEDwhentargetiscompute_instanceorbaremetal_instance, and MUST NOT beUNSPECIFIEDwhentargetiscluster.This ensures controllers validate against the correct proto3 enum semantics.
📝 Proposed fix for validation comment
PublicIPAttachmentEndpoint target_endpoint = 5; - // Required when target=cluster; must be empty otherwise. + // Required (non-UNSPECIFIED) when target=cluster; + // must be UNSPECIFIED when target=compute_instance or baremetal_instance. }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/unified-networking/README.md` around lines 412 - 414, Update the comment for the PublicIPAttachmentEndpoint field `target_endpoint` to reflect proto3 enum semantics: state that when `target` is `cluster` the `target_endpoint` MUST NOT be `UNSPECIFIED` (i.e. must be set), and when `target` is `compute_instance` or `baremetal_instance` the `target_endpoint` MUST be `UNSPECIFIED` (i.e. must be the enum zero value). Refer to `target_endpoint`, `PublicIPAttachmentEndpoint`, and the `target` enum values `cluster`, `compute_instance`, and `baremetal_instance` when describing the required validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@enhancements/unified-networking-prd/README.md`:
- Around line 397-404: Acceptance criteria for PublicIPAttachment pending
activation lack authorization checks; update the spec and implementation so that
when a PublicIPAttachment in Pending state is activated (i.e., when the target
resource is created) the system validates tenant ownership, namespace scope, and
RBAC permissions before binding the PublicIP. Concretely, add an acceptance
criterion like "Pending PublicIPAttachments must validate tenant ownership,
namespace scope, and RBAC permissions at activation time" and ensure the CaaS
template role activation flow (the component that "activates pending
attachments") performs these checks against the target resource and
tenant/namespace metadata before changing state from Pending to Active to
prevent cross-tenant or unauthorized attachments and dangling/leaked PublicIPs.
In `@enhancements/unified-networking/README.md`:
- Around line 455-462: Add a blank line before the fenced code block that starts
with the triple backticks and contains the "osac admin create publicippool"
command in the README (the fenced code block beginning at the line with
"```bash"); update the markdown so there is an empty line between the preceding
paragraph ("3. Provider creates PublicIPPool:") and the opening ```bash to
satisfy MD031 linting.
---
Duplicate comments:
In `@enhancements/unified-networking-prd/README.md`:
- Around line 310-321: Update the R1 acceptance criteria in the README to
reflect v1 single-hosting-cluster scope: locate the Acceptance criteria block
and change the line that currently reads "VMs in the same Subnet are in the same
broadcast domain regardless of which hypervisor node or cluster they run on" so
it either removes "or cluster" or is split into two bullets (one explicit v1
statement limited to the single hosting cluster and a separate future
enhancement bullet for multi-cluster reachability); ensure you reference the R1
acceptance criteria wording in the update so tests/QA validate the
single-cluster promise only for v1.
In `@enhancements/unified-networking/README.md`:
- Around line 37-43: Add a new "Security Model" subsection (e.g., after the
existing intro paragraph or inside the "Proposal" section) that explicitly
states: authentication, authorization, and tenant resource-scoping are enforced
centrally by the Fulfillment Service; individual networking modules and
resources (e.g., NetworkClass, VirtualNetwork, PublicIP, BaremetalInstance) must
rely on Fulfillment Service assurances and must not implement independent access
checks; and that tenant ownership validation for all resources is performed at
the Fulfillment Service API boundary as a cross-cutting concern. Ensure the
statement is unambiguous about who enforces authn/authz, that module
implementations assume those guarantees, and that ownership checks occur at the
Fulfillment Service layer. Include a short note advising contributors to avoid
module-local security redefinition.
- Around line 582-584: The CLI example using "--node-set workers=large,size=3"
is ambiguous about whether "large" overrides the template-defined host_class
(which would violate the CaaS immutability contract); update the docs to
disallow host_class overrides and show only size overrides for node-sets (i.e.,
for the node-set named "workers" allow specifying only size), or explicitly
rename the parameter if "large" is a different field (e.g., host_class=... vs
size=...), and document the exact syntax for the --node-set flag and the
immutable host_class vs overridable size behavior.
- Around line 412-414: Update the comment for the PublicIPAttachmentEndpoint
field `target_endpoint` to reflect proto3 enum semantics: state that when
`target` is `cluster` the `target_endpoint` MUST NOT be `UNSPECIFIED` (i.e. must
be set), and when `target` is `compute_instance` or `baremetal_instance` the
`target_endpoint` MUST be `UNSPECIFIED` (i.e. must be the enum zero value).
Refer to `target_endpoint`, `PublicIPAttachmentEndpoint`, and the `target` enum
values `cluster`, `compute_instance`, and `baremetal_instance` when describing
the required validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 82e3bc8d-0b75-4eec-825e-8a5433968944
📒 Files selected for processing (2)
enhancements/unified-networking-prd/README.mdenhancements/unified-networking/README.md
| **Acceptance criteria:** | ||
| - PublicIPAttachments can target a resource that doesn't exist yet (Pending | ||
| state) | ||
| - Cluster creation validates that a PublicIPAttachment for the API endpoint | ||
| and a NATGateway exist (ingress attachment is optional — can be added | ||
| after provisioning) | ||
| - The CaaS template role activates pending attachments at the right time | ||
|
|
There was a problem hiding this comment.
R8 acceptance criteria missing security validation for pending PublicIPAttachments.
Line 398-399 allows PublicIPAttachments to target resources that don't exist yet (Pending state). The acceptance criteria do not specify ownership or authorization validation when pending attachments activate. Without proper validation:
- A tenant could create a pending attachment for a resource ID they don't own, then activate it when the actual owner creates that resource.
- Dangling pending attachments could leak PublicIPs or create billing/quota issues.
- Cross-tenant or cross-namespace attachment could occur if namespace/RBAC validation is missing at activation time.
Risk severity: Major — Missing authorization validation on pending resource attachment creates potential privilege escalation and resource leakage vulnerabilities.
Recommend adding acceptance criterion: "Pending PublicIPAttachments validate tenant ownership, namespace scope, and RBAC permissions at activation time (when target resource is created) in addition to creation-time validation."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@enhancements/unified-networking-prd/README.md` around lines 397 - 404,
Acceptance criteria for PublicIPAttachment pending activation lack authorization
checks; update the spec and implementation so that when a PublicIPAttachment in
Pending state is activated (i.e., when the target resource is created) the
system validates tenant ownership, namespace scope, and RBAC permissions before
binding the PublicIP. Concretely, add an acceptance criterion like "Pending
PublicIPAttachments must validate tenant ownership, namespace scope, and RBAC
permissions at activation time" and ensure the CaaS template role activation
flow (the component that "activates pending attachments") performs these checks
against the target resource and tenant/namespace metadata before changing state
from Pending to Active to prevent cross-tenant or unauthorized attachments and
dangling/leaked PublicIPs.
f40f848 to
734ca3b
Compare
| region: moc-region-1 | ||
| fabricManager: netris | ||
| k8sManager: cudn_localnet | ||
| capabilities: |
There was a problem hiding this comment.
Why is capabilities redefined in this CR? It already exists in the netris and cudn_localnet configmaps
There was a problem hiding this comment.
If the capabilities are inferred from the managers, I would put them in the status and not in the spec. Then the provider admin does not need to set it thus reducing the chance for error.
With this path, I would consider disable_capabilities in the spec. This will allow the admin to make sure certain capabilities are disabled regardless to whether they are supported by the manager. But, this configuration will also need to be pushed to the manager.
There was a problem hiding this comment.
Agreed and implemented — capabilities are now inferred from the managers and published in status.capabilities. Added spec.disableCapabilities for providers who need to restrict specific capabilities.
|
@danmanor I had some thoughts after reading the latest update. That said, I do feel we lost some of the original cleanliness we were aiming for, at least from my pov. However, I’m not strongly attached to the original idea :) |
| `network_attachments` are immutable after resource creation — changing | ||
| network attachment requires recreating the resource. | ||
|
|
||
| The `interface` field is used for bare-metal resources only. It references a |
There was a problem hiding this comment.
I had a comment that seems to have been lost:
Do we imagine having other optional fields for other kinds of resources? Maybe we should do separate types with a kind of inheritance here instead of all trying to share the same NetworkAttachment type?
Maybe BareMetalNetworkAttachment and ClusterNetworkAttachment, etc.
There was a problem hiding this comment.
Agree, changed it accordingly
| # spec_defaults.interfaces: | ||
| # - name: data-0 | ||
| # description: "25Gbps data interface" | ||
| # - name: data-1 | ||
| # description: "25Gbps data interface" | ||
| # - name: mgmt | ||
| # description: "1Gbps management interface" |
There was a problem hiding this comment.
I think another comment got lost here from before:
@adriengentil do you think the template is where this kind of thing would be? This means it would be up to the cloud admin to create this rather than us pulling it from the backend, right? (i.e. it could be wrong and we need to deal with that)
There was a problem hiding this comment.
Waited for @adriengentil response. I think the admin will have to provide this info anyway, so whether it is in the template or via another way should be decided in the BMaaS flow, probably less relevant to this proposal, I just mentioned how it will look like with some assumptions
There was a problem hiding this comment.
That should be part of the HostType that we need to design, templates does not expose any inventory information, and are not tied to a "HostType" in particular.
There was a problem hiding this comment.
Okay so I think we can just make this less specific since this is TBD on the bare metal side.
For this proposal let's just say that the user will have access to some interface identifier from BMaaS then leave the rest about how someone would specify the network attachment given that they have the interface info and subnet.
| creates the K8s overlay and bridges it to the fabric segment. Only needed | ||
| for regions that host VMs. |
There was a problem hiding this comment.
Only needed for regions that host VMs
What does this mean? Are you referring to hub's hosting VMs vs. remote VM clusters?
There was a problem hiding this comment.
I mean it is only needed in deployments of osac serving VMaaS, someone might not offer that
099bd6a to
34c9e52
Compare
563fc25 to
e547199
Compare
|
|
||
| - As a tenant, I want to create isolated VirtualNetworks and Subnets for my | ||
| workloads without choosing a networking backend | ||
| - As a tenant, I want to define SecurityGroups to control traffic to my |
There was a problem hiding this comment.
Small nit. Security groups control both ingress and egress. So:to -> to and from
|
|
||
| ### BMaaS-Specific Stories | ||
|
|
||
| - As a tenant, I want to place my BaremetalInstance on a Subnet in my |
There was a problem hiding this comment.
I see the rest of the document talks about interfaces and networks in plural. So, again, a small nit. on a Subnet -> on Subnets.
Also, can different interfaces be connected to Subnets in different VirtualNetwork? This sentence should clarify the range of options
| region: moc-region-1 | ||
| fabricManager: netris | ||
| k8sManager: cudn_localnet | ||
| capabilities: |
There was a problem hiding this comment.
If the capabilities are inferred from the managers, I would put them in the status and not in the spec. Then the provider admin does not need to set it thus reducing the chance for error.
With this path, I would consider disable_capabilities in the spec. This will allow the admin to make sure certain capabilities are disabled regardless to whether they are supported by the manager. But, this configuration will also need to be pushed to the manager.
| name: k8s-manager-cudn-localnet | ||
| namespace: osac | ||
| labels: | ||
| osac.openshift.io/k8s-manager: "true" |
There was a problem hiding this comment.
I would make the label more network specific to avoid future collisions. E.g. osac.openshift.io/network/k8s-manager
There was a problem hiding this comment.
Good call, updated to osac.openshift.io/network/fabric-manager and osac.openshift.io/network/k8s-manager.
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: ConfigMap |
There was a problem hiding this comment.
Why ConfigMap? Why not add a CRD?
There was a problem hiding this comment.
Keeping ConfigMaps for simplicity — lowest barrier to adding a new manager (just kubectl apply a ConfigMap + deploy an Ansible role). No operator schema changes needed. Can revisit if registration needs evolve.
| |-----------|----------------| | ||
| | VN create/delete | `fabricManager` | | ||
| | Subnet create/delete | `fabricManager` + `k8sManager` (per hosting cluster) | | ||
| | SecurityGroup create/delete | `fabricManager` | |
There was a problem hiding this comment.
Question, once this is in place, VMs will not have an IP address that is internal to the cluster. That is, not through the fabric. Right? I'm asking to make sure we don't miss a different path that the Security Groups are missing
There was a problem hiding this comment.
Correct — once VMs are bridged to the fabric (via the k8sManager), they get a fabric IP. All traffic goes through the fabric, so SecurityGroup enforcement via fabric ACLs covers all paths. There is no cluster-internal IP path that bypasses the fabric.
| --compute-instance my-vm --name vm-att | ||
|
|
||
| # Attach to a BM server (new target type) | ||
| osac create externalipattachment --externalip my-ip \ |
There was a problem hiding this comment.
In the BM has multiple interfaces, how does this command know to which interface the external IP should be attached?
There was a problem hiding this comment.
ExternalIPAttachment targets the resource, not a specific interface. The fabric manager DNATs to the resource's primary subnet IP. Clarified in the latest push.
| # description: "1Gbps management interface" | ||
| ``` | ||
|
|
||
| The tenant then specifies which interface to attach to which subnet. Each |
There was a problem hiding this comment.
Some questions:
- Can one
BareMetalInstancehave more than oneBareMetalNetworkAttachmentwithout aninterface? - When is the matching done? What happens if a interface does not exist? What happens if there are more attachments than interfaces?
There was a problem hiding this comment.
Addressed in the latest push — added explicit validation rules:
- Multiple attachments without
interfaceis invalid — if more than one, each must specifyinterface - Validated at creation time — invalid interface identifiers or exceeding available interfaces are rejected
| during provisioning: | ||
|
|
||
| ```protobuf | ||
| message Cluster { |
There was a problem hiding this comment.
Shouldn't this be:
| message Cluster { | |
| message ClusterStatus { |
e547199 to
8fef7e5
Compare
| One NATGateway per VirtualNetwork. All subnets in the VN use the gateway. | ||
| Per-subnet NAT association is a future enhancement. | ||
|
|
||
| #### Multi-NIC Support |
There was a problem hiding this comment.
If we can attach to multiple subnets, how we define the default interface (the one with the default gateway)? If dhcp is used, how can we expect ipam to be configured (the "default" subnet will provide a default gateway, while the other one won't distribute it)?
There was a problem hiding this comment.
Good question. Added a note in Multi-NIC Support: the default gateway is provided by the fabric manager on the first (primary) subnet. Additional subnets provide connectivity without a default gateway — the fabric handles inter-subnet routing within the VN.
There was a problem hiding this comment.
by first, you mean the first network attachment in the list?
There was a problem hiding this comment.
the fabric handles inter-subnet routing within the VN
I case of 2 subnets, the BM will have a directly connected route to each subnet, so I'll bypass the inter-subnet routing
There was a problem hiding this comment.
Updated — default gateway and DHCP configuration for multi-homed BM is an implementation detail of the fabric manager. Removed the ambiguous "first" reference.
There was a problem hiding this comment.
Good catch — you're right, a multi-homed BM has directly connected routes to both subnets. Removed the inaccurate claim about fabric inter-subnet routing. The note now defers gateway/DHCP config to the fabric manager implementation.
There was a problem hiding this comment.
I still think it's important from tenant user point of view to be able to decide which subnet will be the default one, so I think the API should be clear about it.
There was a problem hiding this comment.
Would it make sense to add an explicit default: true field on NetworkAttachment (and make sure that there is only one attachment per BaremetalInstance)
This would make the tenant clearly express their intent through the API
There was a problem hiding this comment.
I think it makes sense
There was a problem hiding this comment.
Agree — multi-NIC behavior (which subnet gets the default gateway, how the tenant controls it, DHCP configuration per subnet) deserves its own focused discussion. This proposal establishes that multi-NIC is supported via the attachment types, but the specifics of how default gateway selection works for both VMs and BMs will be addressed in a follow-up PR.
8fef7e5 to
942316a
Compare
942316a to
a92f576
Compare
Introduces a redesigned networking architecture that generalizes the OSAC Networking API across all three service types with infrastructure-agnostic subnets where VMs participate in the fabric. Key design decisions: - NetworkClass with fabricManager + optional k8sManager (two fields) - ExternalIP (renamed from PublicIP) — external to VN, not internet-routable - Fabric manager handles all physical networking as a single product - K8s manager bridges OVN overlay to fabric (VMs become part of fabric) - Uniform dispatcher — no VM-vs-BM distinction for ExternalIP or NATGateway - Fabric-only security enforcement (no separate k8s ACL) - BM multi-interface selection via template discovery - CaaS prerequisite ordering handled by template, not API mandate Signed-off-by: Dan Manor <dmanor@redhat.com>
a92f576 to
e9035d7
Compare
| cluster. Bare-metal servers use physical VLANs configured on switches in the | ||
| fabric. These are fundamentally different L2 domains. A K8s manager using | ||
| LocalNet mode can bridge OVN to the physical fabric, making VMs first-class | ||
| participants alongside BM servers. The current design does not address how |
There was a problem hiding this comment.
What happens in CUDN+LocalNet (where OVN-K still does IPAM for VMs) but Netris does it for BM? Who is the source of truth?
Summary
Introduces a redesigned networking architecture that generalizes the OSAC Networking API across all three service types: VMaaS (ComputeInstance), CaaS (Cluster), and BMaaS (BaremetalInstance).
The core premise is that VMs are part of the fabric. OSAC runs VMs on OpenShift using KubeVirt (VMs encapsulated in pods, networking managed by OVN-Kubernetes). A K8s manager bridges the OVN overlay to the physical fabric, making VMs first-class participants alongside bare-metal servers and cluster nodes. Once on the fabric, all resource types are treated uniformly.
This enhancement replaces the original networking enhancement which listed CaaS and BMaaS as non-goals.
Key Design Decisions
fabricManager(handles all physical networking) and optionalk8sManager(bridges VMs to the fabric)clusterandbaremetal_instancetargets +target_endpointenum for CaaS (API/INGRESS)Related