[controller] Fix NPE in /job when parent version is absent#2835
Open
jingy-li wants to merge 2 commits into
Open
[controller] Fix NPE in /job when parent version is absent#2835jingy-li wants to merge 2 commits into
jingy-li wants to merge 2 commits into
Conversation
VeniceParentHelixAdmin#getOffLineJobStatus looks up the requested version in parent store metadata. When the version is no longer present — retention eviction, supersession, controller leadership change, or manual cleanup between two consecutive VPJ polls — the lookup returns null and a downstream branch (the KILLED check in the non-terminal path) dereferences it, escaping the framework's catch-all as HTTP 500. VPJ retries on 5xx, error alarms fire, oncall is paged. The protocol already defines first-class terminal values (ARCHIVED, NOT_CREATED) for this state, and an adjacent branch in the same method already null-guards the same value. Short-circuit at the point of lookup and return ARCHIVED (if versionNum <= largestUsedVersionNumber) or NOT_CREATED (otherwise) with HTTP 200 so VPJ exits its poll loop cleanly. Adds two tests covering both branches.
misyel
reviewed
May 28, 2026
| // letting an unguarded deref below escape as HTTP 500. | ||
| if (version == null) { | ||
| ExecutionStatus terminalStatus = versionNum <= parentStore.getLargestUsedVersionNumber() | ||
| ? ExecutionStatus.ARCHIVED |
Contributor
There was a problem hiding this comment.
For archived status, will VPJ stop polling and mark as failed?
Contributor
Author
There was a problem hiding this comment.
Yes, VPJ will fail the push when it sees ARCHIVED
sofiaz11
reviewed
May 28, 2026
Address review: NOT_CREATED is non-terminal (isTerminal=false), so VPJ would keep polling rather than exiting its loop. Use the terminal ERROR status for the versionNum > largestUsedVersionNumber branch — a version that was never created is a genuine inconsistency, not a transient absence. Keep ARCHIVED for the retired branch (versionNum <= largestUsedVersionNumber) and differentiate the two cases via distinct status-details strings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
Fixes VENG-12673 —
/jobvenice-controller returns HTTP 500 from an unguarded NPE when the requested version is no longer present in parent store metadata.VeniceParentHelixAdmin#getOffLineJobStatuslooks upparentStore.getVersion(versionNum)(annotated@Nullable) and then dereferences it in the non-terminal branch (version.getStatus().equals(KILLED)). The version can legitimately be absent between two consecutive VPJ polls for several operational reasons:The unguarded deref escapes the framework's recognized-exception path as HTTP 500. VPJ retries on 5xx, error-rate alarms fire, oncall is paged — all for a state the protocol already has terminal values for. An adjacent branch in the same method already null-guards the same value, confirming the omission is an oversight.
Solution
Short-circuit at the point of lookup: if
version == null, return a terminal status with HTTP 200 so VPJ exits its poll loop cleanly. Distinguish:ARCHIVEDifversionNum <= parentStore.getLargestUsedVersionNumber()(version existed once and was retired).NOT_CREATEDifversionNum > parentStore.getLargestUsedVersionNumber()(version was never created).This also eliminates a second unguarded deref further down at
version.getStatus()in the deferred-swap terminal branch, which would have been reachable whenisTargetRegionPushWithDeferredSwap=truewithversion==null.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?