-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test find_end_of_wal_segment
and fix its contrecord handling
#1574
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at #932 and had a similar change to yours, but you have beaten me to the CR and have tests, wanna link the issue with this PR?
Also, what about the case when there're no Log Records in the segment? it's certainly a corner case but nonetheless possible (e.g. with xact_commit, or with logical)
let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS]; | ||
|
||
trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate here that
- start_offset is < wal_seg_size?
- wal_seg_size is a power of 2 and between 1Mb and 1Gb?
libs/postgres_ffi/src/xlog_utils.rs
Outdated
offs += XLOG_SIZE_OF_XLOG_LONG_PHD; | ||
if (xlp_info & XLP_FIRST_IS_CONTRECORD) != 0 { | ||
offs += ((xlp_rem_len + 7) & !7) as usize; | ||
trace!(" first record is contrecord"); | ||
skipping_first_contrecord = true; | ||
contlen = xlp_rem_len as usize; | ||
// we will immediately skip to start_offset, so adjust contlen | ||
if offs < start_offset { | ||
assert!(start_offset < XLOG_BLCKSZ); | ||
// TODO: test this case | ||
if offs + contlen <= start_offset { | ||
contlen = 0; | ||
// keep skipping_first_contrecord even if contlen == 0, the flag will become false on the next iteration | ||
} else { | ||
warn!("trying to find_end_of_wal with start_offset ({}) in the middle of the first contrecord (xlp_rem_len={})", start_offset, xlp_rem_len); | ||
contlen -= start_offset - offs; | ||
} | ||
} | ||
} else { | ||
trace!(" first record is not contrecord"); | ||
} | ||
} else { | ||
offs += XLOG_SIZE_OF_XLOG_SHORT_PHD; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're handling the first page in the segment with XLP_FIRST_IS_CONTRECORD while it's possible that a page in the middle of the segment with this flag, should we handle it?
} else if contlen == 0 { | ||
let page_offs = offs % XLOG_BLCKSZ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this function has unnecessary nesting, do you think we can simplify this?
// beginning of the next record | ||
} else if contlen == 0 { | ||
let page_offs = offs % XLOG_BLCKSZ; | ||
let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize; | ||
trace!("offs=0x{:x}: new record, xl_tot_len={}", offs, xl_tot_len); | ||
if xl_tot_len == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should handle the case of xl_tot_len < XLOG_SIZE_OF_XLOG_RECORD and at least warn about this.
if skipping_first_contrecord { | ||
// do nothing, the flag will go down on next iteration | ||
trace!(" first conrecord has been just completed"); | ||
} else if crc == xl_crc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the crc validation we can check if xl_prev points to last_valid_rec_pos (with the only exception of the first log record received by SK)
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
Removing complex WAL generation logic from 891d052 in favor of a simpler strategy: emit necessary logical message, flush it to the disk via Disabling autovacuum just in case on Anton's suggestion. Maybe some other stuff needs to be disabled as well. |
37a7366
to
5eb43b5
Compare
Will rebase 8e5562bf2c2f419482bd75cfda995f03568823db onto |
8e5562b
to
51b34d6
Compare
* Actual generation logic is in a separate crate `postgres_ffi/wal_generate` * The create also provides a binary for debug purposes akin to `initdb` * Two tests currently fail and are ignored * There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn. So testing would require disconnecting Safekeeper just after it has received the WAL, but before it is committed.
Previous invariant: `crc` contains an "unfinalized" CRC32 value, its one complement, like in postgres before FIN_CRC32C. New invariant: `crc` always contains a "finalized" CRC32 value, this is the semantics of crc32c_append, so we don't need to invert CRC manually.
Now it reflects the field it's mirroring.
Also enable corresponding test.
51b34d6
to
cd87f85
Compare
Code is rebased, ready for review and rebase-merge. One can review commit-by-commit or everything at once. I think @antons-antons' concerns are valid but are better addressed in a separate PR. I prefer to focus this one on adding a test framework and fixing the issue I already know how to reproduce. After all, the |
macro_rules! const_assert { | ||
($($args:tt)*) => { | ||
const _: () = assert!($($args)*); | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a popular static_assertions
crate, but it wasn't updated for a while.
cd87f85
to
3400f1b
Compare
Wow, it's a real flaky failure introduced (or reproduced?) by this PR in
|
3400f1b
to
aeac127
Compare
Gosh, I love programming. No sarcasm.
So, the corner case: if a record gets split between two segments, Nice coincidence, IMHO:
Quick fix (amended the last commit): check less when skipping the first contrecord. A probably cleaner one would to clearly specify the current state of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW, rec_offs is unknown in case of record split between segments.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
This function is to simplify complex WAL generation in neondatabase/neon#1574 `pg_logical_emit_message` is the easiest way to get a big WAL record, but: * If it's transactional, it gets `COMMIT` record right after * If it's not, WAL is not flushed at all. The function helps here, so we don't rely on the background WAL writer. I suspect the plain `xlogflush()` name may collide in the future, hence the prefix.
Related to #544: fixes some troubles with handling WAL records that cross a segment boundary.
This PR starts fixing them:
find_end_of_wal
from commit lsn.find_end_of_wal_segment
. It's called on Safekeeper startup only, so should not yield a big performance penalty.find_end_of_wal_segment
's bug: it worked incorrectly when the first record in the last segment is a contrecord. The old logic was to crash, new logic is to skip this record and find at least one full valid record in the segment.I think it's better to get the barebones in first, but there are some todos left:
find_end_of_wal_segment
marked asTODO
in this PR. I believe they never happen in our tests.#[ignore]
d.