Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test find_end_of_wal_segment and fix its contrecord handling #1574

Merged
merged 7 commits into from
May 21, 2022

Conversation

yeputons
Copy link
Contributor

Related to #544: fixes some troubles with handling WAL records that cross a segment boundary.

This PR starts fixing them:

  • Add a separate library/binary for carefully crafting weird WALs using a Postgres server. Right now it can craft three kinds of WAL: simple, one ending in a record that crosses segment boundary, and a similar one with a small message appended.
  • Add lots of tracing to find_end_of_wal_segment. It's called on Safekeeper startup only, so should not yield a big performance penalty.
  • Fix find_end_of_wal_segment's bug: it worked incorrectly when the first record in the last segment is a contrecord. The old logic was to crash, new logic is to skip this record and find at least one full valid record in the segment.

I think it's better to get the barebones in first, but there are some todos left:

  • Handle corner cases in find_end_of_wal_segment marked as TODO in this PR. I believe they never happen in our tests.
  • Backtrack if there is no full valid record in the last segment file. This test is currently #[ignore]d.

@yeputons yeputons requested review from knizhnik and arssher April 25, 2022 17:49
Copy link
Contributor

@antons-antons antons-antons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at #932 and had a similar change to yours, but you have beaten me to the CR and have tests, wanna link the issue with this PR?

Also, what about the case when there're no Log Records in the segment? it's certainly a corner case but nonetheless possible (e.g. with xact_commit, or with logical)

Comment on lines 156 to +167
let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS];

trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate here that

  1. start_offset is < wal_seg_size?
  2. wal_seg_size is a power of 2 and between 1Mb and 1Gb?

Comment on lines 192 to 219
offs += XLOG_SIZE_OF_XLOG_LONG_PHD;
if (xlp_info & XLP_FIRST_IS_CONTRECORD) != 0 {
offs += ((xlp_rem_len + 7) & !7) as usize;
trace!(" first record is contrecord");
skipping_first_contrecord = true;
contlen = xlp_rem_len as usize;
// we will immediately skip to start_offset, so adjust contlen
if offs < start_offset {
assert!(start_offset < XLOG_BLCKSZ);
// TODO: test this case
if offs + contlen <= start_offset {
contlen = 0;
// keep skipping_first_contrecord even if contlen == 0, the flag will become false on the next iteration
} else {
warn!("trying to find_end_of_wal with start_offset ({}) in the middle of the first contrecord (xlp_rem_len={})", start_offset, xlp_rem_len);
contlen -= start_offset - offs;
}
}
} else {
trace!(" first record is not contrecord");
}
} else {
offs += XLOG_SIZE_OF_XLOG_SHORT_PHD;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're handling the first page in the segment with XLP_FIRST_IS_CONTRECORD while it's possible that a page in the middle of the segment with this flag, should we handle it?

Comment on lines 219 to 225
} else if contlen == 0 {
let page_offs = offs % XLOG_BLCKSZ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this function has unnecessary nesting, do you think we can simplify this?

// beginning of the next record
} else if contlen == 0 {
let page_offs = offs % XLOG_BLCKSZ;
let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize;
trace!("offs=0x{:x}: new record, xl_tot_len={}", offs, xl_tot_len);
if xl_tot_len == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle the case of xl_tot_len < XLOG_SIZE_OF_XLOG_RECORD and at least warn about this.

if skipping_first_contrecord {
// do nothing, the flag will go down on next iteration
trace!(" first conrecord has been just completed");
} else if crc == xl_crc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the crc validation we can check if xl_prev points to last_valid_rec_pos (with the only exception of the first log record received by SK)

yeputons added a commit to neondatabase/postgres that referenced this pull request May 10, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574
`pg_logical_emit_message` is the easiest way to get a big WAL record, but:

* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here.
yeputons added a commit to neondatabase/postgres that referenced this pull request May 10, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.
yeputons added a commit to neondatabase/postgres that referenced this pull request May 10, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect plain `xlogflush()` name may collide in the future, hence the prefix.
yeputons added a commit to neondatabase/postgres that referenced this pull request May 10, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
yeputons added a commit to neondatabase/postgres that referenced this pull request May 12, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
@yeputons
Copy link
Contributor Author

Removing complex WAL generation logic from 891d052 in favor of a simpler strategy: emit necessary logical message, flush it to the disk via neon_xlogflush from neondatabase/postgres#160, kill the server. No COMMIT record.

Disabling autovacuum just in case on Anton's suggestion. Maybe some other stuff needs to be disabled as well.

@yeputons yeputons force-pushed the find-end-of-wal-test branch from 37a7366 to 5eb43b5 Compare May 18, 2022 19:13
@yeputons
Copy link
Contributor Author

Will rebase 8e5562bf2c2f419482bd75cfda995f03568823db onto main and squash all commits, so it's ready to merge after review.

@yeputons yeputons force-pushed the find-end-of-wal-test branch from 8e5562b to 51b34d6 Compare May 19, 2022 00:24
yeputons and others added 6 commits May 19, 2022 03:31
* Actual generation logic is in a separate crate `postgres_ffi/wal_generate`
* The create also provides a binary for debug purposes akin to `initdb`
* Two tests currently fail and are ignored
* There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn.
  So testing would require disconnecting Safekeeper just after it has received the WAL,
  but before it is committed.
Previous invariant: `crc` contains an "unfinalized" CRC32 value,
its one complement, like in postgres before FIN_CRC32C.

New invariant: `crc` always contains a "finalized" CRC32 value,
this is the semantics of crc32c_append, so we don't need to invert CRC manually.
@yeputons yeputons force-pushed the find-end-of-wal-test branch from 51b34d6 to cd87f85 Compare May 19, 2022 00:31
@yeputons
Copy link
Contributor Author

Code is rebased, ready for review and rebase-merge. One can review commit-by-commit or everything at once.

I think @antons-antons' concerns are valid but are better addressed in a separate PR. I prefer to focus this one on adding a test framework and fixing the issue I already know how to reproduce. After all, the find_end_of_wal_segment function will probably need some re-thinking to handle all these cases and support trailing multi-segment records and avoid failing because of invalid data like in #932 (incorrect length caused integer overflow and panic).

@yeputons yeputons requested a review from arssher May 19, 2022 01:13
Comment on lines +101 to +105
macro_rules! const_assert {
($($args:tt)*) => {
const _: () = assert!($($args)*);
};
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a popular static_assertions crate, but it wasn't updated for a while.

@yeputons yeputons force-pushed the find-end-of-wal-test branch from cd87f85 to 3400f1b Compare May 20, 2022 16:56
@yeputons
Copy link
Contributor Author

yeputons commented May 20, 2022

Wow, it's a real flaky failure introduced (or reproduced?) by this PR in test_restarts_frequent_checkpoints! Investigating. Does not reproduce locally. Logs of a single Safekeeper:

2022-05-20T17:11:33.180580Z ERROR {tid=8}: query handler for 'START_WAL_PUSH postgresql://no_user:@localhost:18106' failed: failed to restore shared state

Caused by:
    Condition failed: `rec_offs + n <= XLOG_RECORD_CRC_OFFS` (23 vs 20)

Stack backtrace:
   0: anyhow::ensure::render
             at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.53/src/ensure.rs:97:20
   1: <(A,B) as anyhow::ensure::BothDebug>::__dispatch_ensure
             at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.53/src/ensure.rs:20:9
      postgres_ffi::xlog_utils::find_end_of_wal_segment
             at /home/circleci/project/libs/postgres_ffi/src/xlog_utils.rs:331:17
      postgres_ffi::xlog_utils::find_end_of_wal
             at /home/circleci/project/libs/postgres_ffi/src/xlog_utils.rs:439:25
   2: <safekeeper::wal_storage::PhysicalStorage as safekeeper::wal_storage::Storage>::init_storage
             at /home/circleci/project/safekeeper/src/wal_storage.rs:329:17
   3: safekeeper::safekeeper::SafeKeeper<CTRL,WAL>::new
             at /home/circleci/project/safekeeper/src/safekeeper.rs:554:9
   4: safekeeper::timeline::SharedState::restore
             at /home/circleci/project/safekeeper/src/timeline.rs:130:17
      safekeeper::timeline::GlobalTimelines::get
             at /home/circleci/project/safekeeper/src/timeline.rs:598:21
   5: <core::option::Option<alloc::sync::Arc<safekeeper::timeline::Timeline>> as safekeeper::timeline::TimelineTools>::set
             at /home/circleci/project/safekeeper/src/timeline.rs:513:22
      <safekeeper::handler::SafekeeperPostgresHandler as utils::postgres_backend::Handler>::process_query
             at /home/circleci/project/safekeeper/src/handler.rs:108:13
   6: utils::postgres_backend::PostgresBackend::process_message
             at /home/circleci/project/libs/utils/src/postgres_backend.rs:431:33
   7: utils::postgres_backend::PostgresBackend::run_message_loop
             at /home/circleci/project/libs/utils/src/postgres_backend.rs:283:31
      utils::postgres_backend::PostgresBackend::run
             at /home/circleci/project/libs/utils/src/postgres_backend.rs:265:19
   8: safekeeper::wal_service::handle_socket
             at /home/circleci/project/safekeeper/src/wal_service.rs:55:5
   9: safekeeper::wal_service::thread_main::{{closure}}
             at /home/circleci/project/safekeeper/src/wal_service.rs:26:43
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:123:18
  10: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/thread/mod.rs:484:17
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panic/unwind_safe.rs:271:9
      std::panicking::try::do_call
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
      std::panicking::try
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
      std::panic::catch_unwind
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
      std::thread::Builder::spawn_unchecked::{{closure}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/thread/mod.rs:483:30
      core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
  11: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/boxed.rs:1694:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/boxed.rs:1694:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys/unix/thread.rs:106:17
  12: start_thread
  13: clone
2022-05-20T17:11:33.444770Z  INFO Got SIGQUIT. Terminating in immediate shutdown mode

…rieval

It would be better to not update xl_crc/rec_hdr at all when skipping contrecord,
but I would prefer to keep PR #1574 small.
Better audit of `find_end_of_wal_segment` is coming anyway in #544.
@yeputons yeputons force-pushed the find-end-of-wal-test branch from 3400f1b to aeac127 Compare May 20, 2022 23:49
@yeputons
Copy link
Contributor Author

yeputons commented May 20, 2022

Gosh, I love programming. No sarcasm.

  1. Unable to reproduce locally.
  2. Re-running the test in CI made it pass, so it's flaky, and I got fortunate it failed the first time. Another test failed, though: test_replace_safekeeper, and for the same reason. After another re-run the latter just hang.
  3. Compiling with --release and running the test multiple times got me the local reproduction.

So, the corner case: if a record gets split between two segments, find_end_of_wal_segment does not care and still reads the second half of the record as if it was a correct record with a header, CRC, and stuff, but just does not check it. However, that behavior changed in my last commit: I ensured that the record never stops inside the xl_crc field. That would be correct for if we were reading a valid record, but we are reading a tail of a record. Hence the failure.

Nice coincidence, IMHO:

  1. A record got split between two pages; and
  2. Its second half has a length of either 21, 22, or 23 bytes exactly; and
  3. It got caught by a completely unrelated test in or CI which actually provided useful logs suggesting that this PR is the reason.

Quick fix (amended the last commit): check less when skipping the first contrecord. A probably cleaner one would to clearly specify the current state of find_end_of_wal_segment (e.g. "skipping the first cont record", "reading the header", "reading the record body"), but I'd leave it for a separate PR, see #544 (comment)

@yeputons yeputons requested a review from arssher May 21, 2022 00:46
Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW, rec_offs is unknown in case of record split between segments.

@yeputons yeputons merged commit 73187bf into main May 21, 2022
@yeputons yeputons deleted the find-end-of-wal-test branch May 21, 2022 03:25
MMeent pushed a commit to neondatabase/postgres that referenced this pull request Jul 7, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
MMeent pushed a commit to neondatabase/postgres that referenced this pull request Aug 18, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
lubennikovaav pushed a commit to neondatabase/postgres that referenced this pull request Nov 21, 2022
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
MMeent pushed a commit to neondatabase/postgres that referenced this pull request Feb 10, 2023
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
MMeent pushed a commit to neondatabase/postgres that referenced this pull request Feb 10, 2023
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
MMeent pushed a commit to neondatabase/postgres that referenced this pull request May 11, 2023
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request Aug 10, 2023
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request Nov 8, 2023
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request Nov 8, 2023
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request Feb 5, 2024
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request Feb 5, 2024
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request Feb 6, 2024
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
tristan957 pushed a commit to neondatabase/postgres that referenced this pull request May 10, 2024
This function is to simplify complex WAL generation in neondatabase/neon#1574

`pg_logical_emit_message` is the easiest way to get a big WAL record, but:
* If it's transactional, it gets `COMMIT` record right after
* If it's not, WAL is not flushed at all. The function helps here, so we
  don't rely on the background WAL writer.

I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants