Fixed checking if `append_entries_request` batches are already present in follower log #25018

mmaslankaprv · 2025-02-04T12:23:14Z

Background

When follower receives an append entries request that prev_log_index is smaller than its own prev_log_index it validates if the batches from the request matches (by checking a batch offset and corresponding term) its own log. It that is the case the batches are skipped to prevent truncation of valid batches and avoid the data loss.

Negative `append_entries_request::prev_log_index`

The validation of already matching batches was broken if they happened to be at the beginning of the log. In this case the prev_log_index is not initialised. This case was not correctly handled by the logic calculating the next offset when checking matching batches.

Replying with success when all request batches match

When follower receives an append entries request with the vector of records that are all present in its own log and their offsets and terms match it should reply with success and correct last_dirty_log_index.
This way a leader instead of moving the follower next_offset backward can start recovery process and deliver batches which the follower is missing.

Backports Required

Release Notes

Bug Fixes

fixes a very rare situation in which Raft leader can enter into infinite loop trying to recover follower.

bashtanov · 2025-02-04T14:12:07Z

src/v/raft/consensus.cc

+
+                reply.last_dirty_log_index = adjusted_prev_log_index;
+                // limit the last flushed offset as the adjusted_prev_log_index
+                // may have not yet been flushed.


Does it mean _flushed_offset may store an offset that has not been flushed? In this case, could you add a comment to _flushed_offset to explain what it actually denotes?

the adjusted_prev_log_index may be greater than _flushed_offset

yeah, but why can't we reply with _flushed_offset if it is larger than adjusted_prev_log_index?

we do not want leader to see flushed offset which is larger then last log offset.

so is reply.last_flushed_log_index the latest flushed log index that matches with the leader's log?

no, not really. This check is to hold an invariant of flushed_offset <= log_end_offset

What is log_end_offset here? Is it a field of any structure?

reply.last_flushed_log_index = std::min(adjusted_prev_log_index, _flushed_offset);

does this run a risk of last_flushed_log_index moving backwards, as seen by the leader? (which I think will have implications on leader commit index computation?

Perhaps we should just set

reply.last_flushed_log_index = _flushed_offset;

and let the _flushed_offset computation be monotonic

vbotbuildovich · 2025-02-04T19:54:40Z

CI test results

test results on build#61563

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61563#0194d20c-ff99-4b94-b2f7-a64d44ed7679	FLAKY	1/3
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61563#0194d20c-ff98-4ec0-9f56-38befe604032	FLAKY	1/2

test results on build#61682

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dcb6-b3e7-4275-b585-63769e3a91eb	FLAKY	1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dc9a-225b-4365-b41c-e42b927c3e92	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dcb6-b3e7-4275-b585-63769e3a91eb	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dcb6-b3e4-449e-a254-c66f8797a6ea	FLAKY	1/2
rptest.tests.datalake.custom_partitioning_test.DatalakeCustomPartitioningTest.test_basic.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dcb6-b3e5-4007-9090-5e5e97766310	FLAKY	1/2
rptest.tests.partition_movement_test.PartitionMovementTest.test_availability_when_one_node_down	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dc9a-225a-472c-827d-daaa26f07098	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dcb6-b3e6-4ddf-b17d-2a71ef0b0f40	FLAKY	1/2
rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_crash_all	ducktape	https://buildkite.com/redpanda/redpanda/builds/61682#0194dc9a-225b-4ae0-9014-9e69b7cda65e	FLAKY	1/2

bashtanov · 2025-02-05T09:08:00Z

src/v/raft/tests/raft_fixture.h

 private:
    model::node_id _id;
    model::revision_id _revision;
    prefix_logger _logger;
    ss::sstring _base_directory;
    config::mock_property<size_t> _max_inflight_requests{16};
    config::mock_property<size_t> _max_queued_bytes{1_MiB};
+    config::mock_property<size_t> _default_recovery_read_size{32_KiB};


any reason we change it for existing tests?

no particular reason, i will make sure it is the same as before

src/v/raft/tests/basic_raft_fixture_test.cc

bashtanov · 2025-02-05T12:04:19Z

Assertion triggered in function body are not propagated to the test itself

Why is that? Anything wrong with the macro? AFAIK it's meant to work with both gtest and boost.

bharathv

lgtm modulo one question, took me a bit to digest the change, had to dig up Alexey's change that added these checks. Would be nice to get a blessing from @ztlpn too.

bharathv · 2025-02-05T19:41:59Z

src/v/raft/consensus.cc

+
+                reply.last_dirty_log_index = adjusted_prev_log_index;
+                // limit the last flushed offset as the adjusted_prev_log_index
+                // may have not yet been flushed.


reply.last_flushed_log_index = std::min(adjusted_prev_log_index, _flushed_offset);

does this run a risk of last_flushed_log_index moving backwards, as seen by the leader? (which I think will have implications on leader commit index computation?

Perhaps we should just set

reply.last_flushed_log_index = _flushed_offset;

and let the _flushed_offset computation be monotonic

src/v/raft/tests/raft_fixture.cc

When follower receives an append entries request that `prev_log_index` is smaller than its own `prev_log_index` it validates if the batches from the request matches (by checking a batch offset and corresponding term) its own log. It that is the case the batches are skipped to prevent truncation of valid batches and avoid the data loss. The validation of already matching batches was broken if they happened to be at the beginning of the log. In this case the `prev_log_index` is not initialized being negative. This case was not correctly handled by the logic calculating the next offset when checking matching batches. That lead to a situation in which a range of batches starting with 0 was never matching. Fixed the issue by correctly adjusting the `prev_log_index` if it is uninitialized. Signed-off-by: Michał Maślanka <[email protected]>

When follower receives an append entries request with the vector of records that are all present in its own log and their offsets and terms match it should reply with success and correct `last_dirty_log_index`. This way a leader instead of moving the follower `next_offset` backward can start recovery process and deliver batches which the follower is missing. Signed-off-by: Michał Maślanka <[email protected]>

Signed-off-by: Michał Maślanka <[email protected]>

Assertion triggered in function body are not propagated to the test itself. Change the method to throw an exception in case of timeout instead of using an assertion. Signed-off-by: Michał Maślanka <[email protected]>

Signed-off-by: Michał Maślanka <[email protected]>

The reply interceptor allows test creator to modify or drop the reply that is about to be processed by the RPC requester. This allow tests to take more control over the Raft protocol behavior and test some rare edge cases which might be hard to trigger otherwise. Signed-off-by: Michał Maślanka <[email protected]>

src/v/raft/tests/basic_raft_fixture_test.cc

Signed-off-by: Michał Maślanka <[email protected]>

bashtanov

A few questions as I'm not sure I understand the test.

bashtanov · 2025-02-07T13:50:01Z

src/v/raft/tests/raft_fixture.cc

+    std::ranges::copy(
+      _nodes | std::views::keys
+        | std::views::filter(
+          [leader_id](model::node_id id) { return id != leader_id; }),
+      std::back_inserter(followers));


nit: use copy_if?

bashtanov · 2025-02-07T14:05:32Z

src/v/raft/tests/basic_raft_fixture_test.cc

+    /**
+     * Recover communication and wait for the intercept to trigger
+     */
+    new_leader_node.reset_dispatch_handlers();


This will enable the new leader to send vote requests to the old leader. I guess it won't anyway, as it has been elected already. Do we need this?

bashtanov · 2025-02-07T14:10:15Z

src/v/raft/tests/raft_fixture.h

@@ -395,7 +395,8 @@ class raft_fixture
    chunked_vector<model::record_batch> make_batches(
      size_t batch_count,
      size_t batch_record_count,
-      size_t record_payload_size) {
+      size_t record_payload_size,
+      model::term_id term = model::term_id(0)) {


did you decide to keep it just in case it is needed in future?

bashtanov · 2025-02-07T14:38:30Z

src/v/raft/tests/basic_raft_fixture_test.cc

+     * Recover communication and wait for the intercept to trigger
+     */
+    new_leader_node.reset_dispatch_handlers();
+    co_await reply_intercepted.wait([&] { return intercept_count > 5; });


We don't produce anything after the second election. What are the 20+ messages that are replicated from the new leader to the old one?

mmaslankaprv requested a review from ztlpn February 4, 2025 12:23

github-actions bot added the area/redpanda label Feb 4, 2025

mmaslankaprv requested review from bharathv and bashtanov February 4, 2025 12:23

mmaslankaprv force-pushed the fix-matching-entries-check branch from d7a60fa to 153a9c2 Compare February 4, 2025 12:46

bashtanov reviewed Feb 4, 2025

View reviewed changes

mmaslankaprv marked this pull request as ready for review February 4, 2025 16:19

bashtanov reviewed Feb 5, 2025

View reviewed changes

src/v/raft/tests/basic_raft_fixture_test.cc Outdated Show resolved Hide resolved

bashtanov reviewed Feb 5, 2025

View reviewed changes

src/v/raft/tests/basic_raft_fixture_test.cc Show resolved Hide resolved

bharathv reviewed Feb 5, 2025

View reviewed changes

mmaslankaprv added 2 commits February 6, 2025 12:25

mmaslankaprv force-pushed the fix-matching-entries-check branch from e67b4df to 12b3024 Compare February 6, 2025 12:49

mmaslankaprv added 4 commits February 6, 2025 13:57

r/tests: made recovery read size configurable in tests

e05e5b3

Signed-off-by: Michał Maślanka <[email protected]>

r/tests: fixed waiting for offsets in tests

8644d35

Assertion triggered in function body are not propagated to the test itself. Change the method to throw an exception in case of timeout instead of using an assertion. Signed-off-by: Michał Maślanka <[email protected]>

r/tests: added ability to set term on test batches

17665e4

Signed-off-by: Michał Maślanka <[email protected]>

mmaslankaprv force-pushed the fix-matching-entries-check branch from 12b3024 to 5c4c17b Compare February 6, 2025 12:57

bashtanov reviewed Feb 6, 2025

View reviewed changes

src/v/raft/tests/basic_raft_fixture_test.cc Outdated Show resolved Hide resolved

r/tests: added test validating processing all matching batches

ed45488

Signed-off-by: Michał Maślanka <[email protected]>

mmaslankaprv force-pushed the fix-matching-entries-check branch from 5c4c17b to ed45488 Compare February 6, 2025 17:35

mmaslankaprv requested review from bharathv and bashtanov February 7, 2025 13:32

bashtanov reviewed Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed checking if `append_entries_request` batches are already present in follower log #25018

Fixed checking if `append_entries_request` batches are already present in follower log #25018

mmaslankaprv commented Feb 4, 2025 •

edited

Loading

bashtanov Feb 4, 2025

mmaslankaprv Feb 4, 2025

bashtanov Feb 4, 2025

mmaslankaprv Feb 4, 2025

bashtanov Feb 4, 2025

mmaslankaprv Feb 5, 2025

bashtanov Feb 5, 2025

bharathv Feb 5, 2025

vbotbuildovich commented Feb 4, 2025 •

edited

Loading

bashtanov Feb 5, 2025

mmaslankaprv Feb 5, 2025

bashtanov commented Feb 5, 2025

bharathv left a comment

bharathv Feb 5, 2025

bashtanov left a comment

bashtanov Feb 7, 2025

bashtanov Feb 7, 2025

bashtanov Feb 7, 2025

bashtanov Feb 7, 2025

Fixed checking if append_entries_request batches are already present in follower log #25018

Are you sure you want to change the base?

Fixed checking if append_entries_request batches are already present in follower log #25018

Conversation

mmaslankaprv commented Feb 4, 2025 • edited Loading

Background

Negative append_entries_request::prev_log_index

Replying with success when all request batches match

Backports Required

Release Notes

Bug Fixes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Feb 4, 2025 • edited Loading

CI test results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtanov commented Feb 5, 2025

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fixed checking if `append_entries_request` batches are already present in follower log #25018

Fixed checking if `append_entries_request` batches are already present in follower log #25018

mmaslankaprv commented Feb 4, 2025 •

edited

Loading

Negative `append_entries_request::prev_log_index`

vbotbuildovich commented Feb 4, 2025 •

edited

Loading