Test `find_end_of_wal_segment` and fix its contrecord handling #1574

yeputons · 2022-04-25T17:49:32Z

Related to #544: fixes some troubles with handling WAL records that cross a segment boundary.

This PR starts fixing them:

Add a separate library/binary for carefully crafting weird WALs using a Postgres server. Right now it can craft three kinds of WAL: simple, one ending in a record that crosses segment boundary, and a similar one with a small message appended.
- The latter two were not handled correctly, and prior to Create timeline on safekeepers explicitly #1280 (or somewhere around), it was even possible to crash Safekeeper. It is harder now because it starts find_end_of_wal from commit lsn.
Add lots of tracing to find_end_of_wal_segment. It's called on Safekeeper startup only, so should not yield a big performance penalty.
Fix find_end_of_wal_segment's bug: it worked incorrectly when the first record in the last segment is a contrecord. The old logic was to crash, new logic is to skip this record and find at least one full valid record in the segment.

I think it's better to get the barebones in first, but there are some todos left:

Handle corner cases in find_end_of_wal_segment marked as TODO in this PR. I believe they never happen in our tests.
Backtrack if there is no full valid record in the last segment file. This test is currently #[ignore]d.

libs/postgres_ffi/src/xlog_utils.rs

antons-antons

I was looking at #932 and had a similar change to yours, but you have beaten me to the CR and have tests, wanna link the issue with this PR?

Also, what about the case when there're no Log Records in the segment? it's certainly a corner case but nonetheless possible (e.g. with xact_commit, or with logical)

antons-antons · 2022-04-28T18:26:15Z

libs/postgres_ffi/src/xlog_utils.rs

    let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS];

+    trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset);


Should we validate here that

start_offset is < wal_seg_size?

wal_seg_size is a power of 2 and between 1Mb and 1Gb?

antons-antons · 2022-04-28T18:35:03Z

libs/postgres_ffi/src/xlog_utils.rs

+                offs += XLOG_SIZE_OF_XLOG_LONG_PHD;
                if (xlp_info & XLP_FIRST_IS_CONTRECORD) != 0 {
-                    offs += ((xlp_rem_len + 7) & !7) as usize;
+                    trace!("  first record is contrecord");
+                    skipping_first_contrecord = true;
+                    contlen = xlp_rem_len as usize;
+                    // we will immediately skip to start_offset, so adjust contlen
+                    if offs < start_offset {
+                        assert!(start_offset < XLOG_BLCKSZ);
+                        // TODO: test this case
+                        if offs + contlen <= start_offset {
+                            contlen = 0;
+                            // keep skipping_first_contrecord even if contlen == 0, the flag will become false on the next iteration
+                        } else {
+                            warn!("trying to find_end_of_wal with start_offset ({}) in the middle of the first contrecord (xlp_rem_len={})", start_offset, xlp_rem_len);
+                            contlen -= start_offset - offs;
+                        }
+                    }
+                } else {
+                    trace!("  first record is not contrecord");
                }
            } else {
                offs += XLOG_SIZE_OF_XLOG_SHORT_PHD;
            }


we're handling the first page in the segment with XLP_FIRST_IS_CONTRECORD while it's possible that a page in the middle of the segment with this flag, should we handle it?

antons-antons · 2022-04-28T18:36:56Z

libs/postgres_ffi/src/xlog_utils.rs

        } else if contlen == 0 {
            let page_offs = offs % XLOG_BLCKSZ;


I find this function has unnecessary nesting, do you think we can simplify this?

libs/postgres_ffi/src/xlog_utils.rs

antons-antons · 2022-04-28T18:45:24Z

libs/postgres_ffi/src/xlog_utils.rs

        // beginning of the next record
        } else if contlen == 0 {
            let page_offs = offs % XLOG_BLCKSZ;
            let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize;
+            trace!("offs=0x{:x}: new record, xl_tot_len={}", offs, xl_tot_len);
            if xl_tot_len == 0 {


I think we should handle the case of xl_tot_len < XLOG_SIZE_OF_XLOG_RECORD and at least warn about this.

antons-antons · 2022-04-28T18:49:53Z

libs/postgres_ffi/src/xlog_utils.rs

+                if skipping_first_contrecord {
+                    // do nothing, the flag will go down on next iteration
+                    trace!("  first conrecord has been just completed");
+                } else if crc == xl_crc {


In addition to the crc validation we can check if xl_prev points to last_valid_rec_pos (with the only exception of the first log record received by SK)

libs/postgres_ffi/src/xlog_utils.rs

libs/postgres_ffi/wal_generate/src/lib.rs

This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here.

This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer.

This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect plain `xlogflush()` name may collide in the future, hence the prefix.

This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.

yeputons · 2022-05-18T15:01:34Z

Removing complex WAL generation logic from 891d052 in favor of a simpler strategy: emit necessary logical message, flush it to the disk via neon_xlogflush from neondatabase/postgres#160, kill the server. No COMMIT record.

Disabling autovacuum just in case on Anton's suggestion. Maybe some other stuff needs to be disabled as well.

yeputons · 2022-05-18T23:48:09Z

Will rebase 8e5562bf2c2f419482bd75cfda995f03568823db onto main and squash all commits, so it's ready to merge after review.

libs/postgres_ffi/src/xlog_utils.rs

* Actual generation logic is in a separate crate `postgres_ffi/wal_generate` * The create also provides a binary for debug purposes akin to `initdb` * Two tests currently fail and are ignored * There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn. So testing would require disconnecting Safekeeper just after it has received the WAL, but before it is committed.

Previous invariant: `crc` contains an "unfinalized" CRC32 value, its one complement, like in postgres before FIN_CRC32C. New invariant: `crc` always contains a "finalized" CRC32 value, this is the semantics of crc32c_append, so we don't need to invert CRC manually.

Now it reflects the field it's mirroring.

Also enable corresponding test.

yeputons · 2022-05-19T01:13:22Z

Code is rebased, ready for review and rebase-merge. One can review commit-by-commit or everything at once.

I think @antons-antons' concerns are valid but are better addressed in a separate PR. I prefer to focus this one on adding a test framework and fixing the issue I already know how to reproduce. After all, the find_end_of_wal_segment function will probably need some re-thinking to handle all these cases and support trailing multi-segment records and avoid failing because of invalid data like in #932 (incorrect length caused integer overflow and panic).

yeputons · 2022-05-19T11:07:48Z

libs/utils/src/lib.rs

+macro_rules! const_assert {
+    ($($args:tt)*) => {
+        const _: () = assert!($($args)*);
+    };
+}


There is a popular static_assertions crate, but it wasn't updated for a while.

yeputons · 2022-05-20T20:54:48Z

Wow, it's a real flaky failure introduced (or reproduced?) by this PR in test_restarts_frequent_checkpoints! Investigating. Does not reproduce locally. Logs of a single Safekeeper:

2022-05-20T17:11:33.180580Z ERROR {tid=8}: query handler for 'START_WAL_PUSH postgresql://no_user:@localhost:18106' failed: failed to restore shared state

Caused by:
    Condition failed: `rec_offs + n <= XLOG_RECORD_CRC_OFFS` (23 vs 20)

Stack backtrace:
   0: anyhow::ensure::render
             at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.53/src/ensure.rs:97:20
   1: <(A,B) as anyhow::ensure::BothDebug>::__dispatch_ensure
             at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.53/src/ensure.rs:20:9
      postgres_ffi::xlog_utils::find_end_of_wal_segment
             at /home/circleci/project/libs/postgres_ffi/src/xlog_utils.rs:331:17
      postgres_ffi::xlog_utils::find_end_of_wal
             at /home/circleci/project/libs/postgres_ffi/src/xlog_utils.rs:439:25
   2: <safekeeper::wal_storage::PhysicalStorage as safekeeper::wal_storage::Storage>::init_storage
             at /home/circleci/project/safekeeper/src/wal_storage.rs:329:17
   3: safekeeper::safekeeper::SafeKeeper<CTRL,WAL>::new
             at /home/circleci/project/safekeeper/src/safekeeper.rs:554:9
   4: safekeeper::timeline::SharedState::restore
             at /home/circleci/project/safekeeper/src/timeline.rs:130:17
      safekeeper::timeline::GlobalTimelines::get
             at /home/circleci/project/safekeeper/src/timeline.rs:598:21
   5: <core::option::Option<alloc::sync::Arc<safekeeper::timeline::Timeline>> as safekeeper::timeline::TimelineTools>::set
             at /home/circleci/project/safekeeper/src/timeline.rs:513:22
      <safekeeper::handler::SafekeeperPostgresHandler as utils::postgres_backend::Handler>::process_query
             at /home/circleci/project/safekeeper/src/handler.rs:108:13
   6: utils::postgres_backend::PostgresBackend::process_message
             at /home/circleci/project/libs/utils/src/postgres_backend.rs:431:33
   7: utils::postgres_backend::PostgresBackend::run_message_loop
             at /home/circleci/project/libs/utils/src/postgres_backend.rs:283:31
      utils::postgres_backend::PostgresBackend::run
             at /home/circleci/project/libs/utils/src/postgres_backend.rs:265:19
   8: safekeeper::wal_service::handle_socket
             at /home/circleci/project/safekeeper/src/wal_service.rs:55:5
   9: safekeeper::wal_service::thread_main::{{closure}}
             at /home/circleci/project/safekeeper/src/wal_service.rs:26:43
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:123:18
  10: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/thread/mod.rs:484:17
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panic/unwind_safe.rs:271:9
      std::panicking::try::do_call
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
      std::panicking::try
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
      std::panic::catch_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
      std::thread::Builder::spawn_unchecked::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/thread/mod.rs:483:30
      core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
  11: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/boxed.rs:1694:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/boxed.rs:1694:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys/unix/thread.rs:106:17
  12: start_thread
  13: clone
2022-05-20T17:11:33.444770Z  INFO Got SIGQUIT. Terminating in immediate shutdown mode

…rieval It would be better to not update xl_crc/rec_hdr at all when skipping contrecord, but I would prefer to keep PR #1574 small. Better audit of `find_end_of_wal_segment` is coming anyway in #544.

yeputons · 2022-05-20T23:51:55Z

Gosh, I love programming. No sarcasm.

Unable to reproduce locally.
Re-running the test in CI made it pass, so it's flaky, and I got fortunate it failed the first time. Another test failed, though: test_replace_safekeeper, and for the same reason. After another re-run the latter just hang.
Compiling with --release and running the test multiple times got me the local reproduction.

So, the corner case: if a record gets split between two segments, find_end_of_wal_segment does not care and still reads the second half of the record as if it was a correct record with a header, CRC, and stuff, but just does not check it. However, that behavior changed in my last commit: I ensured that the record never stops inside the xl_crc field. That would be correct for if we were reading a valid record, but we are reading a tail of a record. Hence the failure.

Nice coincidence, IMHO:

A record got split between two pages; and
Its second half has a length of either 21, 22, or 23 bytes exactly; and
It got caught by a completely unrelated test in or CI which actually provided useful logs suggesting that this PR is the reason.

Quick fix (amended the last commit): check less when skipping the first contrecord. A probably cleaner one would to clearly specify the current state of find_end_of_wal_segment (e.g. "skipping the first cont record", "reading the header", "reading the record body"), but I'd leave it for a separate PR, see #544 (comment)

arssher

IOW, rec_offs is unknown in case of record split between segments.

This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.

yeputons requested review from knizhnik and arssher April 25, 2022 17:49

knizhnik approved these changes Apr 25, 2022

View reviewed changes

libs/postgres_ffi/src/xlog_utils.rs Outdated Show resolved Hide resolved

libs/postgres_ffi/src/xlog_utils.rs Outdated Show resolved Hide resolved

antons-antons reviewed Apr 28, 2022

View reviewed changes

arssher requested changes Apr 29, 2022

View reviewed changes

libs/postgres_ffi/src/xlog_utils.rs Show resolved Hide resolved

libs/postgres_ffi/wal_generate/src/lib.rs Outdated Show resolved Hide resolved

libs/postgres_ffi/wal_generate/src/lib.rs Outdated Show resolved Hide resolved

yeputons mentioned this pull request May 10, 2022

Add test_end_of_wal_multiple_segments #1255

Closed

yeputons mentioned this pull request May 10, 2022

zenith_test_utils extension: add neon_xlogflush() neondatabase/postgres#160

Merged

yeputons force-pushed the find-end-of-wal-test branch from 37a7366 to 5eb43b5 Compare May 18, 2022 19:13

yeputons force-pushed the find-end-of-wal-test branch from 8e5562b to 51b34d6 Compare May 19, 2022 00:24

yeputons commented May 19, 2022

View reviewed changes

libs/postgres_ffi/src/xlog_utils.rs Outdated Show resolved Hide resolved

yeputons and others added 6 commits May 19, 2022 03:31

postgres_ffi: find_end_of_wal_segment: improve name of wal_crc variable

9fea880

Now it reflects the field it's mirroring.

utils: add const_assert! macro

29c804f

postgres_ffi: find_end_of_wal_segment: add lots of trace

0898d5c

postgres_ffi: find_end_of_wal_segment: fix contrecord skipping

2c4a86e

Also enable corresponding test.

yeputons force-pushed the find-end-of-wal-test branch from 51b34d6 to cd87f85 Compare May 19, 2022 00:31

yeputons requested a review from arssher May 19, 2022 01:13

yeputons commented May 19, 2022

View reviewed changes

yeputons mentioned this pull request May 20, 2022

deal with wal records bigger than wal_seg_size #544

Closed

arssher approved these changes May 20, 2022

View reviewed changes

yeputons force-pushed the find-end-of-wal-test branch from cd87f85 to 3400f1b Compare May 20, 2022 16:56

postgres_ffi: find_end_of_wal_segment: clarify code around xl_crc ret…

aeac127

…rieval It would be better to not update xl_crc/rec_hdr at all when skipping contrecord, but I would prefer to keep PR #1574 small. Better audit of `find_end_of_wal_segment` is coming anyway in #544.

yeputons force-pushed the find-end-of-wal-test branch from 3400f1b to aeac127 Compare May 20, 2022 23:49

yeputons requested a review from arssher May 21, 2022 00:46

arssher approved these changes May 21, 2022

View reviewed changes

yeputons merged commit 73187bf into main May 21, 2022

yeputons deleted the find-end-of-wal-test branch May 21, 2022 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test `find_end_of_wal_segment` and fix its contrecord handling #1574

Test `find_end_of_wal_segment` and fix its contrecord handling #1574

yeputons commented Apr 25, 2022

antons-antons left a comment •

edited

Loading

antons-antons Apr 28, 2022

antons-antons Apr 28, 2022

antons-antons Apr 28, 2022

antons-antons Apr 28, 2022

antons-antons Apr 28, 2022

yeputons commented May 18, 2022

yeputons commented May 18, 2022

yeputons commented May 19, 2022

yeputons May 19, 2022

yeputons commented May 20, 2022 •

edited

Loading

yeputons commented May 20, 2022 •

edited

Loading

arssher left a comment

		let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS];

		trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset);

Test find_end_of_wal_segment and fix its contrecord handling #1574

Test find_end_of_wal_segment and fix its contrecord handling #1574

Conversation

yeputons commented Apr 25, 2022

antons-antons left a comment • edited Loading

Choose a reason for hiding this comment

antons-antons Apr 28, 2022

Choose a reason for hiding this comment

antons-antons Apr 28, 2022

Choose a reason for hiding this comment

antons-antons Apr 28, 2022

Choose a reason for hiding this comment

antons-antons Apr 28, 2022

Choose a reason for hiding this comment

antons-antons Apr 28, 2022

Choose a reason for hiding this comment

yeputons commented May 18, 2022

yeputons commented May 18, 2022

yeputons commented May 19, 2022

yeputons May 19, 2022

Choose a reason for hiding this comment

yeputons commented May 20, 2022 • edited Loading

yeputons commented May 20, 2022 • edited Loading

arssher left a comment

Choose a reason for hiding this comment

Test `find_end_of_wal_segment` and fix its contrecord handling #1574

Test `find_end_of_wal_segment` and fix its contrecord handling #1574

antons-antons left a comment •

edited

Loading

yeputons commented May 20, 2022 •

edited

Loading

yeputons commented May 20, 2022 •

edited

Loading