deal with wal records bigger than wal_seg_size #544

kelvich · 2021-09-06T21:24:14Z

No description provided.

hlinnaka · 2021-09-16T19:16:10Z

Got some context on this? Where do we not handle them correctly?

kelvich · 2021-09-17T07:01:47Z

That was about find_end_of_wal() in the safekeeper -- when it faces continuation record it tries to iterate backwards on WAL to reconstruct whole record and check CRC and assumes that record starts in that previous segment. IIUC there is a code to that fixes that behavior for cases when we need deeper look behind: https://github.com/zenithdb/zenith/commits/safe_flush_ptr That issue about extracting and testing that

stepashka · 2022-01-10T14:21:24Z

cc @petuhovskiy , you've mentioned that you're testing this already, but the test hadn't failed :)
can you share some more details? thanks!

petuhovskiy · 2022-01-10T14:24:57Z

cc @petuhovskiy , you've mentioned that you're testing this already, but the test hadn't failed :) can you share some more details? thanks!

4468797 This is the commit with the test, that wasn't failing for me at the time, one can look at it to create failing test

yeputons · 2022-02-05T16:43:00Z

I'm wondering if there is some bug in find_end_of_wal_segment's logic here? In my test xlp_rem_len in adjacent pages in a partial WAL record differ by either 8152 bytes (8192 - 40 = XLOG_BLCKSZ - XLOG_SIZE_OF_XLG_LONG_PHD) 8168 bytes (8192 - 24 = XLOG_BLCKSZ - XLOG_SIZE_OF_XLG_SHORT_PHD), not 8192 exactly. So it seems to me that xlp_rem_len does not account for page headers, while the code assumes it does and it's safe to just offset inside the WAL file instead of re-reading all pages/blocks (not sure what's the right term here).

So my test does not actually need to make records bigger than wal_seg_size, bigger than XLOG_BLCKSZ is already enough.

Needs further investigation.

yeputons · 2022-05-20T15:33:25Z

Progress so far:

Add some corner case unit tests for find_end_of_wal
Correctly skip contrecords in the last segment
Handle WAL ending in a big multi-segment record. That requires reading previous segments (only the first few bytes of each, though) if we want to validate this last record.
Address bunch of review comments from Test find_end_of_wal_segment and fix its contrecord handling #1574 (including my own, one can even craft a test case for it), they may reveal more bugs and corner cases. Maybe reuse of Pageserver's code is a good idea, unless it does lots of parsing which we want to avoid in Safekeeper.

…rieval It would be better to not update xl_crc/rec_hdr at all when skipping contrecord, but I would prefer to keep PR #1574 small. Better audit of `find_end_of_wal_segment` is coming anyway in #544.

@yeputons

We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes #544 neondatabase/cloud#2004 Supersedes #2066

@yeputons

We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes #544 neondatabase/cloud#2004 Supersedes #2066

@banks

* github/workflows: Fix git dubious ownership (#2223) * Move relation size cache from WalIngest to DatadirTimeline (#2094) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <[email protected]> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]> * refactor: replace lazy-static with once-cell (#2195) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <[email protected]> * Add more buckets to pageserver latency metrics (#2225) * ignore record property warning to fix benchmarks * increase statement timeout * use event so it fires only if workload thread successfully finished * remove debug log * increase timeout to pass test with real s3 * avoid duplicate parameter, increase timeout * Major migration script (#2073) This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview. Co-authored-by: Anastasia Lubennikova <[email protected]> * Fix etcd typos * Fix links to safekeeper protocol docs. (#2188) safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in commit 0b14fdb, as part of reorganizing the docs into 'mdbook' format. Fixes issue #1475. Thanks to @banks for spotting the outdated references. In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb. See #2188 (review). Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Thang Pham <[email protected]> * Update CONTRIBUTING.md * Update CONTRIBUTING.md * support node id and remote storage params in docker_entrypoint.sh * Safe truncate (#2218) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <[email protected]> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <[email protected]> * Check if relation exists before trying to truncat it refer #1932 * Add test reporducing FSM truncate problem Co-authored-by: Heikki Linnakangas <[email protected]> * Fix exponential backoff values * Update back `vendor/postgres` back; it was changed accidentally. (#2251) Commit 4227cfc accidentally reverted vendor/postgres to an older version. Update it back. * Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok. * Share exponential backoff code and fix logic for delete task failure (#2252) * Fix bug when import large (>1GB) relations (#2172) Resolves #2097 - use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get` - update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. + `small` is the old test + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build * Fix timeline physical size flaky tests (#2244) Resolves #2212. - use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests ## Context Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk. ### Note The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly * postgres_ffi/waldecoder: validate more header fields * postgres_ffi/waldecoder: remove unused startlsn * postgres_ffi/waldecoder: introduce explicit `enum State` Previously it was emulated with a combination of nullable fields. This change should make the logic more readable. * disable `test_import_from_pageserver_multisegment` (#2258) This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure. See: #2255, #2256 * get_binaries uses DOCKER_TAG taken from docker image build step (#2260) * [proxy] Rework wire format of the password hack and some errors (#2236) The new format has a few benefits: it's shorter, simpler and human-readable as well. We don't use base64 anymore, since url encoding got us covered. We also show a better error in case we couldn't parse the payload; the users should know it's all about passing the correct project name. * test_runner/pg_clients: collect docker logs (#2259) * get_binaries script fix (#2263) * get_binaries uses DOCKER_TAG taken from docker image build step * remove docker tag discovery at all and fix get_binaries for version variable * Better storage sync logs (#2268) * Find end of WAL on safekeepers using WalStreamDecoder. We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes #544 neondatabase/cloud#2004 Supersedes #2066 * Improve walreceiver logic (#2253) This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers. - There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237 - Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down. - Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second. - `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast. - `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck. * increase timeout in wait_for_upload to avoid spurious failures when testing with real s3 * Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274) * Set up a workflow to run pgbench against captest (#2077) Signed-off-by: Ankur Srivastava <[email protected]> Co-authored-by: Alexander Bayandin <[email protected]> Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Ankur Srivastava <[email protected]> Co-authored-by: bojanserafimov <[email protected]> Co-authored-by: Dmitry Rodionov <[email protected]> Co-authored-by: Anastasia Lubennikova <[email protected]> Co-authored-by: Kirill Bulatov <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]> Co-authored-by: Thang Pham <[email protected]> Co-authored-by: Stas Kelvich <[email protected]> Co-authored-by: Arseny Sher <[email protected]> Co-authored-by: Egor Suvorov <[email protected]> Co-authored-by: Andrey Taranik <[email protected]> Co-authored-by: Dmitry Ivanov <[email protected]>

kelvich mentioned this issue Sep 20, 2021

Fix find_end_of_wal for wal records longer than WAL segment #634

Closed

arssher added the c/storage/safekeeper Component: storage: safekeeper label Sep 20, 2021

stepashka added this to the Limited Preview milestone Dec 27, 2021

stepashka added the t/bug Issue Type: Bug label Dec 27, 2021

stepashka assigned yeputons Jan 10, 2022

stepashka modified the milestones: Limited Preview, Technical preview Jan 24, 2022

yeputons mentioned this issue Feb 10, 2022

Add test_end_of_wal_multiple_segments #1255

Closed

yeputons mentioned this issue Apr 25, 2022

Test find_end_of_wal_segment and fix its contrecord handling #1574

Merged

stepashka modified the milestones: 1.0 Technical preview, 2022/06 Jun 17, 2022

yeputons mentioned this issue Jun 24, 2022

Use wal_generate in tests to test complex WAL situations #1989

Merged

stepashka modified the milestones: 2022/06, 2022/07 Jul 14, 2022

stepashka modified the milestones: 2022/07, 2022/08 Aug 11, 2022

arssher mentioned this issue Aug 12, 2022

Find end of WAL on safekeepers using WalStreamDecoder. #2262

Merged

arssher closed this as completed in #2262 Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deal with wal records bigger than wal_seg_size #544

deal with wal records bigger than wal_seg_size #544

kelvich commented Sep 6, 2021

hlinnaka commented Sep 16, 2021

kelvich commented Sep 17, 2021 •

edited

Loading

stepashka commented Jan 10, 2022

petuhovskiy commented Jan 10, 2022

yeputons commented Feb 5, 2022

yeputons commented May 20, 2022 •

edited

Loading

deal with wal records bigger than wal_seg_size #544

deal with wal records bigger than wal_seg_size #544

Comments

kelvich commented Sep 6, 2021

hlinnaka commented Sep 16, 2021

kelvich commented Sep 17, 2021 • edited Loading

stepashka commented Jan 10, 2022

petuhovskiy commented Jan 10, 2022

yeputons commented Feb 5, 2022

yeputons commented May 20, 2022 • edited Loading

kelvich commented Sep 17, 2021 •

edited

Loading

yeputons commented May 20, 2022 •

edited

Loading