Skip to content

Conversation

@pthirun
Copy link
Contributor

@pthirun pthirun commented Feb 3, 2026

Problem Statement

Venice currently doesn't provide a way to configure unclean.leader.election.enable for Kafka topics. For realtime topics, disabling unclean leader election is important to prevent data loss at the cost of availability. Operators need fine-grained control to disable unclean leader election specifically for RT topics while keeping the cluster default for version topics.

Solution

Added new configuration kafka.unclean.leader.election.enable.rt.topics that allows operators to control unclean leader election for realtime topics only:

  • When set to false, prevents data loss by ensuring only in-sync replicas can become leaders for RT topics
  • If not set, uses Kafka cluster's default configuration
  • Configuration is RT-topic specific and doesn't affect version topics

Implementation details:

  • Added uncleanLeaderElectionEnable field to PubSubTopicConfiguration
  • Updated ApacheKafkaAdminAdapter to marshall/unmarshall the Kafka topic config property
  • Modified TopicManager to accept and propagate the config when creating topics
  • Updated VeniceHelixAdmin and RealTimeTopicSwitcher to read from cluster config and pass to RT topic creation
  • Maintained backward compatibility with existing constructors

Testing:

  • Added comprehensive unit tests for config storage, marshalling, and propagation
  • Added integration test verifying the config is correctly applied to actual Kafka topics
  • All existing tests continue to pass

Code changes

  • Added new code behind a config. Config name: kafka.unclean.leader.election.enable.rt.topics, default: uses Kafka cluster default (not explicitly set)
  • Introduced new log lines.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues - only adds config propagation, no new concurrency
  • Proper synchronization mechanisms - N/A, no new concurrent code
  • No blocking calls inside critical sections - N/A
  • Verified thread-safe collections - N/A
  • Validated proper exception handling - Uses existing exception handling patterns

How was this PR tested?

  • New unit tests added:
    • PubSubTopicConfigurationTest.testUncleanLeaderElectionEnableConfiguration - tests field storage and retrieval
    • ApacheKafkaAdminAdapterTest.testUncleanLeaderElectionEnableConfig - tests marshalling/unmarshalling
    • RealTimeTopicSwitcherTest.testUncleanLeaderElectionConfigForRTTopics - tests config propagation
  • New integration tests added:
    • TopicManagerE2ETest.testUncleanLeaderElectionConfigForRealtimeTopic - end-to-end test verifying config is correctly applied to actual Kafka topics
  • Modified or extended existing tests - Updated RealTimeTopicSwitcherTest.testEnsurePreconditions to handle new parameter
  • Verified backward compatibility - existing constructors delegate to new ones with Optional.empty()

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

This is a new optional configuration that defaults to not being set (uses cluster defaults). Existing behavior is unchanged unless operators explicitly configure it.

.put(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, Integer.toString(minIsrConfig)));
pubSubTopicConfiguration.getUncleanLeaderElectionEnable()
.ifPresent(
uncleanLeaderElection -> topicProperties
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting used here. unmarshallProperties is used in the createTopic method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only set if setting is present. This will make sure we use cluster configs by default.

@pthirun pthirun changed the title [Controller] Add RT topic creation config to set unclean leader election [controller][common] Add support for configuring unclean leader election for realtime topics Feb 3, 2026
@pthirun pthirun changed the title [controller][common] Add support for configuring unclean leader election for realtime topics [controller] Add support for configuring unclean leader election for realtime topics Feb 3, 2026
@mynameborat
Copy link
Contributor

Do we need to synchronize this configuration across real time topic and version topic? If not, it might be better clarify the decision in this PR or through code comments as part of the configuration.

  • My gut says, we should have this configuration for both RT and VT. However, I do see the other side of it where VT is relatively ephemeral compared to RT but nevertheless, cannot afford data loss.
  • What about RT repartitioning feature? Aren't we creating new RT for it in which case, shouldn't it respect what the original RT topic was configured as opposed to looking at this configuration from controller cluster configuration? Former is needed for maintaining functional parity while latter would could regress the behavior (likely with store migration + RT repartitioning or rollback + RT repartitioning)

@pthirun pthirun marked this pull request as draft February 3, 2026 21:17
@pthirun
Copy link
Contributor Author

pthirun commented Feb 3, 2026

TODO: This PR needs to add the ULE config to the store configuration so we may keep track of which store has this setting enabled/disabled. Currently, the config only shows the setting new RTs will adopt.

@pthirun pthirun force-pushed the kafka-add-unclean-leader-election-config branch from 13938a8 to 2f0d39c Compare February 12, 2026 01:22
Add uncleanLeaderElectionEnabledForRTTopics as a store-level config using
the tri-state pattern (NOT_SPECIFIED/ENABLED/DISABLED). When NOT_SPECIFIED,
falls back to the cluster-level config. This enables per-store tracking of
ULE settings and preserves the setting during store migration.

Changes:
- New Avro schema versions (StoreMetaValue v41, AdminOperation v96)
- Store interface/impl (Store, ZKStore, ReadOnlyStore, SystemStore, StoreInfo)
- UpdateStoreQueryParams with migration constructor support
- Controller logic (VeniceParentHelixAdmin, AdminExecutionTask,
VeniceHelixAdmin)
- RealTimeTopicSwitcher store-level override with cluster-level fallback
- resolveUncleanLeaderElection helper for store-then-cluster resolution
- Tests for override, fallback, and resolution logic

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants