Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deal with wal records bigger than wal_seg_size #544

Closed
kelvich opened this issue Sep 6, 2021 · 6 comments · Fixed by #2262
Closed

deal with wal records bigger than wal_seg_size #544

kelvich opened this issue Sep 6, 2021 · 6 comments · Fixed by #2262
Assignees
Labels
c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug
Milestone

Comments

@kelvich
Copy link
Contributor

kelvich commented Sep 6, 2021

No description provided.

@hlinnaka
Copy link
Contributor

Got some context on this? Where do we not handle them correctly?

@kelvich
Copy link
Contributor Author

kelvich commented Sep 17, 2021

That was about find_end_of_wal() in the safekeeper -- when it faces continuation record it tries to iterate backwards on WAL to reconstruct whole record and check CRC and assumes that record starts in that previous segment. IIUC there is a code to that fixes that behavior for cases when we need deeper look behind: https://github.com/zenithdb/zenith/commits/safe_flush_ptr That issue about extracting and testing that

@arssher arssher added the c/storage/safekeeper Component: storage: safekeeper label Sep 20, 2021
@stepashka stepashka added this to the Limited Preview milestone Dec 27, 2021
@stepashka stepashka added the t/bug Issue Type: Bug label Dec 27, 2021
@stepashka
Copy link
Member

cc @petuhovskiy , you've mentioned that you're testing this already, but the test hadn't failed :)
can you share some more details? thanks!

@petuhovskiy
Copy link
Member

cc @petuhovskiy , you've mentioned that you're testing this already, but the test hadn't failed :) can you share some more details? thanks!

4468797 This is the commit with the test, that wasn't failing for me at the time, one can look at it to create failing test

@yeputons
Copy link
Contributor

yeputons commented Feb 5, 2022

I'm wondering if there is some bug in find_end_of_wal_segment's logic here? In my test xlp_rem_len in adjacent pages in a partial WAL record differ by either 8152 bytes (8192 - 40 = XLOG_BLCKSZ - XLOG_SIZE_OF_XLG_LONG_PHD) 8168 bytes (8192 - 24 = XLOG_BLCKSZ - XLOG_SIZE_OF_XLG_SHORT_PHD), not 8192 exactly. So it seems to me that xlp_rem_len does not account for page headers, while the code assumes it does and it's safe to just offset inside the WAL file instead of re-reading all pages/blocks (not sure what's the right term here).

So my test does not actually need to make records bigger than wal_seg_size, bigger than XLOG_BLCKSZ is already enough.

Needs further investigation.

@yeputons
Copy link
Contributor

yeputons commented May 20, 2022

Progress so far:

  • Add some corner case unit tests for find_end_of_wal
  • Correctly skip contrecords in the last segment
  • Handle WAL ending in a big multi-segment record. That requires reading previous segments (only the first few bytes of each, though) if we want to validate this last record.
  • Address bunch of review comments from Test find_end_of_wal_segment and fix its contrecord handling #1574 (including my own, one can even craft a test case for it), they may reveal more bugs and corner cases. Maybe reuse of Pageserver's code is a good idea, unless it does lots of parsing which we want to avoid in Safekeeper.

yeputons added a commit that referenced this issue May 20, 2022
…rieval

It would be better to not update xl_crc/rec_hdr at all when skipping contrecord,
but I would prefer to keep PR #1574 small.
Better audit of `find_end_of_wal_segment` is coming anyway in #544.
yeputons added a commit that referenced this issue May 21, 2022
…rieval

It would be better to not update xl_crc/rec_hdr at all when skipping contrecord,
but I would prefer to keep PR #1574 small.
Better audit of `find_end_of_wal_segment` is coming anyway in #544.
@stepashka stepashka modified the milestones: 2022/06, 2022/07 Jul 14, 2022
@stepashka stepashka modified the milestones: 2022/07, 2022/08 Aug 11, 2022
arssher added a commit that referenced this issue Aug 12, 2022
We could make it inside wal_storage.rs, but taking into account that
 - wal_storage.rs reading is async
 - we don't need s3 here
 - error handling is different; error during decoding is normal
I decided to put it separately.

Test
cargo test test_find_end_of_wal_last_crossing_segment
prepared earlier by @yeputons passes now.

Fixes #544
      neondatabase/cloud#2004
Supersedes #2066
arssher added a commit that referenced this issue Aug 14, 2022
We could make it inside wal_storage.rs, but taking into account that
 - wal_storage.rs reading is async
 - we don't need s3 here
 - error handling is different; error during decoding is normal
I decided to put it separately.

Test
cargo test test_find_end_of_wal_last_crossing_segment
prepared earlier by @yeputons passes now.

Fixes #544
      neondatabase/cloud#2004
Supersedes #2066
petuhovskiy added a commit that referenced this issue Aug 15, 2022
* github/workflows: Fix git dubious ownership (#2223)

* Move relation size cache from WalIngest to DatadirTimeline (#2094)

* Move relation sie cache to layered timeline

* Fix obtaining current LSN for relation size cache

* Resolve merge conflicts

* Resolve merge conflicts

* Reestore 'lsn' field in DatadirModification

* adjust DatadirModification lsn in ingest_record

* Fix formatting

* Pass lsn to get_relsize

* Fix merge conflict

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <[email protected]>

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <[email protected]>

Co-authored-by: Heikki Linnakangas <[email protected]>

* refactor: replace lazy-static with once-cell (#2195)

- Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy`
- fixes #1147

Signed-off-by: Ankur Srivastava <[email protected]>

* Add more buckets to pageserver latency metrics (#2225)

* ignore record property warning to fix benchmarks

* increase statement timeout

* use event so it fires only if workload thread successfully finished

* remove debug log

* increase timeout to pass test with real s3

* avoid duplicate parameter, increase timeout

* Major migration script (#2073)

This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview.

Co-authored-by: Anastasia Lubennikova <[email protected]>

* Fix etcd typos

* Fix links to safekeeper protocol docs. (#2188)

safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in
commit 0b14fdb, as part of reorganizing the docs into 'mdbook' format.

Fixes issue #1475. Thanks to @banks for spotting the outdated references.

In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb. See #2188 (review).

Co-authored-by: Heikki Linnakangas <[email protected]>
Co-authored-by: Thang Pham <[email protected]>

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* support node id and remote storage params in docker_entrypoint.sh

* Safe truncate (#2218)

* Move relation sie cache to layered timeline

* Fix obtaining current LSN for relation size cache

* Resolve merge conflicts

* Resolve merge conflicts

* Reestore 'lsn' field in DatadirModification

* adjust DatadirModification lsn in ingest_record

* Fix formatting

* Pass lsn to get_relsize

* Fix merge conflict

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <[email protected]>

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <[email protected]>

* Check if relation exists before trying to truncat it

refer #1932

* Add test reporducing FSM truncate problem

Co-authored-by: Heikki Linnakangas <[email protected]>

* Fix exponential backoff values

* Update back `vendor/postgres` back; it was changed accidentally. (#2251)

Commit 4227cfc accidentally reverted vendor/postgres to an older
version. Update it back.

* Add pageserver checkpoint_timeout option.

To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.

* Share exponential backoff code and fix logic for delete task failure (#2252)

* Fix bug when import large (>1GB) relations (#2172)

Resolves #2097 

- use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get`
- update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. 
  + `small` is the old test
  + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build

* Fix timeline physical size flaky tests (#2244)

Resolves #2212.

- use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests

## Context
Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk.

### Note
The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly

* postgres_ffi/waldecoder: validate more header fields

* postgres_ffi/waldecoder: remove unused startlsn

* postgres_ffi/waldecoder: introduce explicit `enum State`

Previously it was emulated with a combination of nullable fields.
This change should make the logic more readable.

* disable `test_import_from_pageserver_multisegment` (#2258)

This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure.

See: #2255, #2256

* get_binaries uses DOCKER_TAG taken from docker image build step (#2260)

* [proxy] Rework wire format of the password hack and some errors (#2236)

The new format has a few benefits: it's shorter, simpler and
human-readable as well. We don't use base64 anymore, since
url encoding got us covered.

We also show a better error in case we couldn't parse the
payload; the users should know it's all about passing the
correct project name.

* test_runner/pg_clients: collect docker logs (#2259)

* get_binaries script fix (#2263)

* get_binaries uses DOCKER_TAG taken from docker image build step

* remove docker tag discovery at all and fix get_binaries for version variable

* Better storage sync logs (#2268)

* Find end of WAL on safekeepers using WalStreamDecoder.

We could make it inside wal_storage.rs, but taking into account that
 - wal_storage.rs reading is async
 - we don't need s3 here
 - error handling is different; error during decoding is normal
I decided to put it separately.

Test
cargo test test_find_end_of_wal_last_crossing_segment
prepared earlier by @yeputons passes now.

Fixes #544
      neondatabase/cloud#2004
Supersedes #2066

* Improve walreceiver logic (#2253)

This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers.

- There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237
- Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second.
- `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
- `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.

* increase timeout in wait_for_upload to avoid spurious failures when testing with real s3

* Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274)

* Set up a workflow to run pgbench against captest (#2077)

Signed-off-by: Ankur Srivastava <[email protected]>
Co-authored-by: Alexander Bayandin <[email protected]>
Co-authored-by: Konstantin Knizhnik <[email protected]>
Co-authored-by: Heikki Linnakangas <[email protected]>
Co-authored-by: Ankur Srivastava <[email protected]>
Co-authored-by: bojanserafimov <[email protected]>
Co-authored-by: Dmitry Rodionov <[email protected]>
Co-authored-by: Anastasia Lubennikova <[email protected]>
Co-authored-by: Kirill Bulatov <[email protected]>
Co-authored-by: Heikki Linnakangas <[email protected]>
Co-authored-by: Thang Pham <[email protected]>
Co-authored-by: Stas Kelvich <[email protected]>
Co-authored-by: Arseny Sher <[email protected]>
Co-authored-by: Egor Suvorov <[email protected]>
Co-authored-by: Andrey Taranik <[email protected]>
Co-authored-by: Dmitry Ivanov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants