Skip to content

Connect-V2, SDK acceptance-test framework, and the 2.0/v2 migration#943

Merged
plorenz merged 19 commits into
mainfrom
connect-v2
Jun 22, 2026
Merged

Connect-V2, SDK acceptance-test framework, and the 2.0/v2 migration#943
plorenz merged 19 commits into
mainfrom
connect-v2

Conversation

@plorenz

@plorenz plorenz commented Jun 6, 2026

Copy link
Copy Markdown
Member

Implements the Connect-V2 sessionless SDK dial protocol — dials are authorized at the router via RDM instead of requiring a controller-issued service session token.

  • Adds CtrlClient.GetServiceEdgeRouters wrapping the existing GET /edge/client/v1/services/{id}/edge-routers endpoint for sessionless edge-router discovery.
  • Adds a per-service ER cache on ContextImpl.serviceEdgeRouters, refreshed on the sessionRefreshTimer cadence and warming router connections the same way the session cache does. Invalidates on service removal and on dial failure.
  • Refactors dialSession to skip service-session creation on the V2 path; V1 fallback now creates the service session lazily, only when the V1 wire is actually taken.
  • Forces UseXgressToSdkHeader=true on Connect-V2 dials (the go-SDK V2 path is always edgeConnXgress); the router continues to support both flow-control modes for other SDKs.
  • Adds DialOptions.ForceConnectV1 escape hatch to route through the V1 path even when the router advertises V2.
  • Removes route_circuit / pending_dials. The xgress CircuitStart handshake already serializes data flow, so the race window the module was guarding against does not exist. Reads the circuit ID from state_connected's CircuitIdHeader instead.
  • Uses the sessionless ER-list endpoint on controllers 1.0.0 and newer (0.0.0 dev builds included), gated by CtrlClient.supportsServiceEdgeRouterList. Older controllers don't expose the endpoint on the client API; for those the SDK falls back to creating a dial session and using its attached edge routers, with the session cache as the source of truth. While the controller version is unknown, dials take the V1 path and the version load is retried on the next capability check. If the sessionless endpoint errors despite the version saying it should exist, the dial falls back to the V1 session path at runtime and drops the cached ER list. Documents the supported controller set (the 1.6.x and 2.0.0 LTS releases) in CHANGELOG and bumps the version from 1.7 to 2.0.
  • Creates fallback sessions on the dial path via createSessionWithBackoff, so foreground dials get ctx cancellation, retry with backoff, re-auth on 401 and service-recreation handling; the background refresh loop uses a plain cached lookup.
  • Restructures createSessionWithBackoff to run the retry before returning the session, removing a dead pre-retry cache call and a return statement that relied on operand evaluation order.
  • Removes GetRouterId() from edge.Conn and edge.MsgChannel; documents the removal as a breaking change in CHANGELOG.
  • Bounds-checks splitMultipart length prefixes; returns descriptive errors instead of panicking on truncated input.
  • Fixes handlePayloadWithNoSink to send the constructed ack instead of the original message.
  • Removes the conn.Close() calls from setupXgressFlowControl's header-validation error paths so the conn no longer NPEs before xg / writeAdapter are populated.

For openziti/ziti#3884.

Also in this PR (the 2.0 release)

This PR is now the 2.0 release. On top of Connect-V2 it lands the SDK acceptance-test framework, the xgress half-close back-compat fix, and the /v2 module migration.

  • SDK acceptance-test framework — exercises the SDK against real, versioned controllers/routers (release downloads and source builds), with a GitHub Actions matrix and a dedicated required-mode Connect-V2 coverage job. Design and decisions in acceptance-tests.md. Fixes Add an SDK acceptance-test framework #951.
  • Half-close back-compat — an xgress (Connect-V2) client half-closing to a router-bridged legacy edge host now emits the legacy edge FIN when the peer hasn't negotiated native xgress EOF, restoring half-close delivery to hosts that read to EOF. Covered by a red/green acceptance regression. Fixes xgress client half-close not delivered to legacy edge hosts #952.
  • 2.0 / v2 module path — the module is now github.com/openziti/sdk-golang/v2; consumers update their imports to the /v2 form. See CHANGELOG.

Note: the dedicated Connect-V2 CI job builds the ziti connect-v2 branch against this SDK, so it needs that branch to also import sdk-golang/v2.

Fixes #936
Fixes #951
Fixes #952

@plorenz plorenz requested a review from a team as a code owner June 6, 2026 17:05
@plorenz plorenz force-pushed the connect-v2 branch 5 times, most recently from 6b8dafb to 5c5a659 Compare June 12, 2026 00:32
Comment thread ziti/edge/conn.go
Comment thread ziti/client.go
Comment thread ziti/client.go
Comment on lines +604 to +607
ers := resp.Payload.Data.EdgeRouters
for _, er := range ers {
self.sanitizeEdgeRouterUrls(er)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably nil check resp.Payload/Data

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread version Outdated
@@ -1 +1 @@
1.8

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused now. 1.9 here but 2.0 in the changelog?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, it's all v2.0.0 now

@andrewpmartinez andrewpmartinez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2.0 changelog, vs the 1.9 version file, vs the breaking changes need to be resolved.

@plorenz plorenz changed the title Implement Connect-V2. Fixes #936 Connect-V2, SDK acceptance-test framework, and the 2.0/v2 migration Jun 16, 2026
plorenz added 17 commits June 18, 2026 16:20
Implements the Connect-V2 sessionless SDK dial protocol — dials are
authorized at the router via RDM instead of requiring a controller-issued
service session token.

- Adds `CtrlClient.GetServiceEdgeRouters` wrapping the existing
  `GET /edge/client/v1/services/{id}/edge-routers` endpoint for
  sessionless edge-router discovery.
- Adds a per-service ER cache on `ContextImpl.serviceEdgeRouters`,
  refreshed on the `sessionRefreshTimer` cadence and warming router
  connections the same way the session cache does. Invalidates on
  service removal and on dial failure.
- Refactors `dialSession` to skip service-session creation on the V2
  path; V1 fallback now creates the service session lazily, only when
  the V1 wire is actually taken.
- Forces `UseXgressToSdkHeader=true` on Connect-V2 dials (the go-SDK
  V2 path is always `edgeConnXgress`); the router continues to support
  both flow-control modes for other SDKs.
- Adds `DialOptions.ForceConnectV1` escape hatch to route through the
  V1 path even when the router advertises V2.
- Removes `route_circuit` / `pending_dials`. The xgress `CircuitStart`
  handshake already serializes data flow, so the race window the
  module was guarding against does not exist. Reads the circuit ID
  from `state_connected`'s `CircuitIdHeader` instead.
- Registers a buffering sink with the mux before each dial request goes
  out (all three dial paths), then atomically swaps in the built conn
  via the new `ConnMux.Replace`, replaying anything buffered in order.
  The hosting side sends its e2e crypto header the moment it accepts,
  and that data could arrive before the dialing goroutine resumed from
  `SendForReply` and registered the conn — the mux dropped it, killing
  the read side with "failed to receive crypto header bytes".
- Uses the sessionless ER-list endpoint on controllers 1.0.0 and newer
  (`0.0.0` dev builds included), gated by
  `CtrlClient.supportsServiceEdgeRouterList`. Older controllers don't
  expose the endpoint on the client API; for those the SDK falls back
  to creating a dial session and using its attached edge routers, with
  the session cache as the source of truth. While the controller
  version is unknown, dials take the V1 path and the version load is
  retried on the next capability check. If the sessionless endpoint
  errors despite the version saying it should exist, the dial falls
  back to the V1 session path at runtime and drops the cached ER list.
  Documents the supported controller set (the 1.6.x and 2.0.0 LTS
  releases) in CHANGELOG and bumps the version from 1.7 to 2.0.
- Creates fallback sessions on the dial path via
  `createSessionWithBackoff`, so foreground dials get ctx
  cancellation, retry with backoff, re-auth on 401 and
  service-recreation handling; the background refresh loop uses a
  plain cached lookup.
- Restructures `createSessionWithBackoff` to run the retry before
  returning the session, removing a dead pre-retry cache call and a
  return statement that relied on operand evaluation order.
- Adds `EventDial` / `AddDialListener`: one event per dial attempt,
  successful or not, emitted from the dial path where the V1/V2 decision
  is made — so the negotiated protocol, target router, forced-V1 flag,
  timing, circuit id, and failure cause are observable without tracking
  any state on connections.
- Removes `GetRouterId()` from `edge.Conn` and `edge.MsgChannel`;
  documents the removal as a breaking change in CHANGELOG.
- Bounds-checks `splitMultipart` length prefixes; returns descriptive
  errors instead of panicking on truncated input.
- Fixes `handlePayloadWithNoSink` to send the constructed ack instead
  of the original message.
- Removes the `conn.Close()` calls from `setupXgressFlowControl`'s
  header-validation error paths so the conn no longer NPEs before
  `xg` / `writeAdapter` are populated.

For openziti/ziti#3884.
- adds acceptance-tests.md, the design for correctness-testing the SDK against multiple OpenZiti versions (LTS lines, latest release, branches/commits)
- adds the acceptance/ module scaffold with its own go.mod and a replace directive targeting the local SDK
- adds versions.yaml (label -> release pointers, source repo) with a strict-decode loader
- classifies ZITI_ACCEPTANCE_VERSION selectors into labels, release versions, and git refs
- resolves labels and release versions to concrete tags via the GitHub releases API, excluding drafts and prereleases and handling vM.m.x minor wildcards
- uses golang.org/x/mod/semver for version comparison
- tests resolution against unsorted, prerelease, draft, and paginated release fixtures
- ignores local review-tooling files (mercurius, .mcp.json)
- extends the GitHub client with release-by-tag asset lookup and download, resolving asset URLs from the API rather than constructing filenames
- extracts the ziti binary from release tar.gz archives at any path depth
- caches binaries keyed by immutable id (tag or SHA) with atomic install; ZITI_ACCEPTANCE_CACHE overrides the location
- selects platform assets with arch aliases (amd64/x86_64, arm64/aarch64), reporting zip-only matches distinctly
- tests download/extract/cache behavior against a counting fake release server
- adds an opt-in live test against the real GitHub API, gated on ZITI_ACCEPTANCE_LIVE=1
- adds an open item: promote internal/acquire to a sibling nested module (acquire/vX.Y.Z tag line) once the harness and source-build phases prove its API, so ziti/zititest can replace stageziti's GH-release fetch core with it
- records the hard rule that acquire imports no SDK packages, keeping the extraction mechanical
- adds the ziticli exec wrapper: every invocation runs with an isolated ZITI_CONFIG_DIR and a scrubbed ZITI_* environment, so harness logins never collide with the developer's own CLI state
- launches `ziti edge quickstart --no-router` as a long-lived controller-only child process with dynamic port allocation and captured logs
- gates readiness on the bootstrap contract: HTTPS 200 plus admin login plus a harmless admin operation, reporting an admin-gate failure as a directed bootstrap contract violation naming the version, with a controller log tail
- adds the harness package: StartShared/Start, Cli, Version with AtLeast (source builds satisfy every minimum), RequireMinVersion
- tags the bring-up test with the acceptance build tag so the default suite stays network-free; verified live against latest (v2.0.0), active-lts (v2.0.x wildcard), and maint-lts (v1.6.17)
- adds an open item to source LTS labels from the ziti repo's lts-versions.json once openziti/ziti#3962 merges, keeping our own resolution logic and versions.yaml for source/repo and overrides
- adds CreateIdentity: creates and enrolls identities via the versioned CLI per the setup contract, with per-test unique names and best-effort cleanup per the isolation contract
- adds NewSdkContext: builds an authenticated ziti.Context from a CLI-enrolled identity using the SDK in this tree via the module replace directive
- adds the first SDK smoke test: authenticate, list services, and current-identity round trip; verified live against latest (v2.0.0) and maint-lts (v1.6.17)
- adds cmd/matrix: runs the tagged acceptance suite once per version selector with a per-version pass/fail summary and non-zero exit on any failure
- defaults the selector list to the versions.yaml labels plus latest, so the matrix has a single source of truth; arguments select a subset and anything after -- passes through to go test (e.g. -run for one test across all versions)
- forces -count=1 since go test's cache cannot see controller-side state
- supports -fail-fast to stop at the first failing version
- exports acquire.FindVersionsFile and reuses it from the harness
- documents quick start, version selection, the matrix runner, the binary cache, opt-in live tests, module layout, and platform notes
- links the acceptance module from the top-level README package list
- adds AddRouter: creates, enrolls, and runs an edge router as its own child process via the version-stable CLI sequence, with config generation driven by ZITI_* env vars verified identical on 1.6 and main
- gates router readiness on TCP listen plus controller-reported online status, and supports Stop/Start for failover tests without re-enrollment
- adds service and policy helpers (CreateService, GrantDial/GrantBind, GrantRouterAccess, GrantServiceRouterAccess) that target entities by name per the isolation contract
- adds Test_DialHostEcho: SDK hosts and dials through the router with bounded first-dial retry, exercising half-close and EOF propagation in both directions
- verified live against latest (v2.0.0) and maint-lts (v1.6.17); full-matrix runs surfaced an intermittent SDK data-plane race, tracked separately
- skips the API entirely for selectors that pin a concrete tag (explicit versions and non-wildcard label values) when the binary is already cached; the cache entry is proof the release exists, so warm pinned runs make zero API calls
- adds acquire.ZitiMemoized, used by the harness: one resolution per selector per process, so a suite whose tests each start a harness stays off the rate limit and on a consistent version; failures are not cached
- renames acquire.Acquire to acquire.Ziti so call sites read without stutter (acquire.Ziti, acquire.ZitiMemoized)
- directs rate-limit failures: a 403 rate-limit response now names GITHUB_TOKEN as the fix
- documents the rate-limit behavior in the README
- tests the shortcut and memo against an all-erroring source (proving zero API calls) and the 403 hint against a fake server
- adds acceptance/tests with TestMain bringing up one shared environment per package via StartShared, per the design's Layer 5 model; per-test cost drops from a controller boot each to under a few seconds
- StartShared now materializes the default topology (controller plus the edge1 router); AddRouter and Router.Start become thin testing.TB wrappers over error-returning internals so TestMain has a TB-free bring-up path
- P0 #1: the dial/host echo smoke gains service-discovery content assertions (each identity sees the service with exactly its granted permissions, lookup by name agrees) alongside half-close/EOF and the dial-event protocol check
- P0 #1b: adds the SDK enrollment round-trip test, the one place enroll.Enroll is the system under test; adds CreateUnenrolledIdentity to support it
- P0 #3: adds the auth-modes tests: OIDC and forced-legacy contexts each complete an echo round trip with the session type asserted via the new NewLegacySdkContext and ApiSessionType helpers, and ext-JWT works as a primary credential via a fully headless flow (locally generated signer registered through the versioned CLI, locally minted JWT, JwtCredentials login, identity-match and GetExternalSigners assertions)
- extracts shared test helpers (echo server, echo round trip, dial retry); migrates the smoke tests from the harness package, which keeps only the per-version bring-up canary
- verified live against latest (v2.0.0) and maint-lts (v1.6.17); both lines negotiate OIDC by default and force legacy correctly
- resolves branch/tag refs to a full commit SHA via git ls-remote before any cache interaction, then shallow-fetches exactly that SHA so a moving branch cannot change what gets built
- builds the ziti binary from the checkout and installs it into the cache keyed by SHA
- adds co-development mode (ZITI_ACCEPTANCE_BUILD_WITH_LOCAL_SDK=true): replaces the ref's pinned sdk-golang with the local SDK tree, for ziti branches developed in lockstep with SDK branches; cache keys then include the local SDK commit, and a dirty SDK tree bypasses the cache so iteration never serves a stale binary
- pure-build compile failures carry a directed hint naming the co-development env var
- carries SourceBuilt through ResolvedID and Version (short-SHA display; source builds satisfy every version minimum)
- verified live: built openziti/ziti@connect-v2 against the local SDK and ran the bring-up canary against it
- adds Test_DialProtocolNegotiation: the SDK must take ConnectV2 exactly when the router advertises the capability and the session is OIDC, else legacy V1, asserted against the observed dial event (never inferred from a version), with the ForceConnectV1 escape hatch checked too
- adds ZITI_ACCEPTANCE_REQUIRE_V2: required mode fails if the environment can't exercise ConnectV2, so the dedicated CI job can't go green by adaptively passing on V1
- adds harness support: RouterSupportsConnectV2 (from the new SupportsConnectV2 field in router inspect), RequireV2, dialWithOptionsRetry, expectedDialProtocol
- the echo host now advertises SDK-hosted xgress on bind, so dials to it run SDK xgress on both ends where the router supports it (older routers fall back to a legacy terminator); the smoke test's protocol expectation is now capability-driven
- adds ZITI_ACCEPTANCE_DEBUG for SDK and router debug logging, and bounded read deadlines so a data-plane stall fails in seconds with a directed message rather than a test timeout
- replaces design-doc shorthand in test comments with descriptions of the actual behavior
- verified against latest and maint-lts (full suite, legacy terminator fallback) and source-built connect-v2 (required-V2 mode)
- splits startEchoServerFC out of startEchoServer with an explicit sdkXgress
  parameter, so tests can host a legacy (non-xgress) terminator the router
  bridges to
- keeps startEchoServer defaulting to SDK-hosted xgress, the path the suite
  primarily exercises
…ixes #952

Moving half-close into xgress replaced the edge FIN flag with the native
xgress EOF flag. Peers that don't honor native EOF, an older router bridging
to a legacy edge host or an older SDK, never see the half-close, so a host
that reads to EOF starves and the circuit stalls.

- restores the legacy signal in edgeConnXgress.CloseWrite: when the peer has
  not negotiated native EOF, half-close rides as a payload header the
  terminating router maps back onto edge.FlagsHeader, so the host sees an
  ordinary edge FIN; the native EOF flag is still used when the peer supports
  it
- exports Xgress.PeerSupportsEOF and splits CloseSendBufferWhenEmpty out of
  CloseRxTimeout so the edge layer can end its send half without emitting the
  native EOF that would tear the whole circuit down
- adds Test_HalfClose_XgressClientToLegacyHost, an acceptance regression for an
  xgress client half-closing to a router-bridged legacy host
- adds .github/workflows/acceptance.yml, running the SDK acceptance suite on
  pull requests and pushes to main
- runs a compatibility matrix over the active-lts, maint-lts, latest, and main
  selectors in adaptive mode, guaranteeing V1 compatibility across the supported
  lines; the label-to-version mapping stays in the versions.yaml manifest
- adds a dedicated connect-v2 job that builds the connect-v2 ziti branch against
  the SDK under test and runs in required mode, so V2 + OIDC coverage can't
  silently downgrade to a V1 pass
- caches acquired ziti binaries by their immutable id and serializes package
  test binaries so they share one acquisition
- records the V2 job's co-development build rationale in acceptance-tests.md
plorenz added 2 commits June 18, 2026 16:20
- changes the module path to github.com/openziti/sdk-golang/v2 and rewrites all
  import paths accordingly, per Go semantic import versioning
- repoints the acceptance and example modules' SDK require/replace and the
  acceptance co-development build to the /v2 path
- regenerates edge_client.pb.go for the updated go_package option
- bumps the version file to 2.0
- records the /v2 path change as a breaking change and lists the connect-v2,
  acceptance-framework (#951), and half-close (#952) issues in the changelog
The SDK moved to channel/v5 (the 2.0 line) and shares the sdk-golang/xgress
package with ziti, so the co-development build cannot compile ziti connect-v2,
which is still on channel/v4, against the SDK under test. ziti's channel/v5
migration is in progress but not yet landed or merged with connect-v2.

- disables the connect-v2 V2-coverage job with `if: false`, documenting how and
  when to re-enable it
@plorenz plorenz dismissed andrewpmartinez’s stale review June 18, 2026 21:41

I've added the AT framework as well, so starting fresh

@plorenz plorenz requested review from a team and andrewpmartinez June 18, 2026 21:41
@plorenz plorenz merged commit 5a7afe8 into main Jun 22, 2026
25 of 30 checks passed
@plorenz plorenz deleted the connect-v2 branch June 22, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xgress client half-close not delivered to legacy edge hosts Add an SDK acceptance-test framework Implement Connect-V2: sessionless SDK dial

2 participants