Skip to content

[controller] Fix NPE in /job when parent version is absent#2835

Open
jingy-li wants to merge 2 commits into
linkedin:mainfrom
jingy-li:jingyli/veng-12673-fix-job-npe
Open

[controller] Fix NPE in /job when parent version is absent#2835
jingy-li wants to merge 2 commits into
linkedin:mainfrom
jingy-li:jingyli/veng-12673-fix-job-npe

Conversation

@jingy-li

Copy link
Copy Markdown
Contributor

Problem Statement

Fixes VENG-12673/job venice-controller returns HTTP 500 from an unguarded NPE when the requested version is no longer present in parent store metadata.

VeniceParentHelixAdmin#getOffLineJobStatus looks up parentStore.getVersion(versionNum) (annotated @Nullable) and then dereferences it in the non-terminal branch (version.getStatus().equals(KILLED)). The version can legitimately be absent between two consecutive VPJ polls for several operational reasons:

  • Retention eviction (version exceeded retain-count).
  • Supersession (newer push completed, older retired).
  • Controller leadership change (new leader's metadata view briefly differs during refresh).
  • Manual cleanup or repush truncation.

The unguarded deref escapes the framework's recognized-exception path as HTTP 500. VPJ retries on 5xx, error-rate alarms fire, oncall is paged — all for a state the protocol already has terminal values for. An adjacent branch in the same method already null-guards the same value, confirming the omission is an oversight.

Solution

Short-circuit at the point of lookup: if version == null, return a terminal status with HTTP 200 so VPJ exits its poll loop cleanly. Distinguish:

  • ARCHIVED if versionNum <= parentStore.getLargestUsedVersionNumber() (version existed once and was retired).
  • NOT_CREATED if versionNum > parentStore.getLargestUsedVersionNumber() (version was never created).

This also eliminates a second unguarded deref further down at version.getStatus() in the deferred-swap terminal branch, which would have been reachable when isTargetRegionPushWithDeferredSwap=true with version==null.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

VeniceParentHelixAdmin#getOffLineJobStatus looks up the requested version in
parent store metadata. When the version is no longer present — retention
eviction, supersession, controller leadership change, or manual cleanup
between two consecutive VPJ polls — the lookup returns null and a downstream
branch (the KILLED check in the non-terminal path) dereferences it,
escaping the framework's catch-all as HTTP 500. VPJ retries on 5xx, error
alarms fire, oncall is paged.

The protocol already defines first-class terminal values (ARCHIVED,
NOT_CREATED) for this state, and an adjacent branch in the same method
already null-guards the same value. Short-circuit at the point of lookup
and return ARCHIVED (if versionNum <= largestUsedVersionNumber) or
NOT_CREATED (otherwise) with HTTP 200 so VPJ exits its poll loop cleanly.

Adds two tests covering both branches.
@jingy-li jingy-li requested review from misyel and sofiaz11 May 28, 2026 21:43
// letting an unguarded deref below escape as HTTP 500.
if (version == null) {
ExecutionStatus terminalStatus = versionNum <= parentStore.getLargestUsedVersionNumber()
? ExecutionStatus.ARCHIVED

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For archived status, will VPJ stop polling and mark as failed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, VPJ will fail the push when it sees ARCHIVED

Address review: NOT_CREATED is non-terminal (isTerminal=false), so VPJ would
keep polling rather than exiting its loop. Use the terminal ERROR status for
the versionNum > largestUsedVersionNumber branch — a version that was never
created is a genuine inconsistency, not a transient absence. Keep ARCHIVED for
the retired branch (versionNum <= largestUsedVersionNumber) and differentiate
the two cases via distinct status-details strings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jingy-li jingy-li requested a review from sofiaz11 June 4, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants