Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge RocksDB 8.9.1 to master #318

Merged
merged 749 commits into from
Feb 23, 2024
Merged

Merge RocksDB 8.9.1 to master #318

merged 749 commits into from
Feb 23, 2024

Conversation

igorcanadi
Copy link
Collaborator

No description provided.

jowlyzhang and others added 30 commits August 21, 2023 12:14
Summary:
This piggy back the existing last level file temperature statistics test to test the default temperature becoming effective.

While adding this unit test, I found that the approach to swap out and use default temperature in `VersionBuilder::LoadTableHandlers` will miss the L0 files created from flush, and only work for existing SST files, SST files created by compaction. So this PR moves that logic to `TableCache::GetTableReader`.

Pull Request resolved: facebook#11722

Test Plan:
```
./db_test2 --gtest_filter="*LastLevelStatistics*"
make all check
```

Reviewed By: pdillinger

Differential Revision: D48489171

Pulled By: jowlyzhang

fbshipit-source-id: ac29f7d484916f3218729594c5bb35c4f2979ac2
…bench_tool (facebook#11727)

Summary:
As the new API to wait for compaction is available (facebook#11436), we can now replace the existing logic of waiting in db_bench_tool with the new API.

Pull Request resolved: facebook#11727

Test Plan:
```
./db_bench --benchmarks="fillrandom,compactall,waitforcompaction,readrandom"
```
**Before change**
```
Set seed to 1692635571470041 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 8.6.0
Date:       Mon Aug 21 09:33:40 2023
CPU:        80 * Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
CPUCache:   28160 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Optimization is disabled: benchmarks unnecessarily slow
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/tmp/rocksdbtest-226125/dbbench]
fillrandom   :      51.826 micros/op 19295 ops/sec 51.826 seconds 1000000 operations;    2.1 MB/s
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished
DB path: [/tmp/rocksdbtest-226125/dbbench]
readrandom   :      39.042 micros/op 25613 ops/sec 39.042 seconds 1000000 operations;    1.8 MB/s (632886 of 1000000 found)
```
**After change**
```
Set seed to 1692636574431745 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 8.6.0
Date:       Mon Aug 21 09:49:34 2023
CPU:        80 * Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
CPUCache:   28160 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Optimization is disabled: benchmarks unnecessarily slow
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/tmp/rocksdbtest-226125/dbbench]
fillrandom   :      51.271 micros/op 19504 ops/sec 51.271 seconds 1000000 operations;    2.2 MB/s
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished with status (OK)
DB path: [/tmp/rocksdbtest-226125/dbbench]
readrandom   :      39.264 micros/op 25468 ops/sec 39.264 seconds 1000000 operations;    1.8 MB/s (632921 of 1000000 found)
```

Reviewed By: ajkr

Differential Revision: D48524667

Pulled By: jaykorean

fbshipit-source-id: 1052a15b2ed79a35165ec4d9998d0454b2552ef4
Summary: Pull Request resolved: facebook#11728

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D48527100

Pulled By: anand1976

fbshipit-source-id: c48baa44e538fb6bfd3fe7f19046746d3540763f
Summary:
For some ldb commands that doesn't need to open the DB, it's still useful to use the DB's existing OPTIONS file if it's available.

Pull Request resolved: facebook#11721

Reviewed By: pdillinger

Differential Revision: D48485540

Pulled By: jowlyzhang

fbshipit-source-id: 2d2db837523044066f1a2c4b59a5c03f6cd35e6b
Summary:
Currently the stress test does not support restoring expected state (to a specific sequence number) when there is unsynced data loss during the reopen phase. This causes a few internal stress test failure with errors like inconsistent value. This PR disables dropping unsynced data during reopen to avoid failures due to this issue. We can re-enable later after we decide to support unsynced data loss during DB reopen in stress test.

Pull Request resolved: facebook#11731

Test Plan:
* Running this test a few times can fail for inconsistent value before this change
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=20.57166126835524 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=6 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=1 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=5 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=0 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=3 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```

Reviewed By: hx235

Differential Revision: D48537494

Pulled By: cbi42

fbshipit-source-id: ddae21b9bb6ee8d67229121f58513e95f7ef6d8d
…book#11725)

Summary:
Fix facebook#6470

Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`.

Pull Request resolved: facebook#11725

Reviewed By: akankshamahajan15

Differential Revision: D48529411

Pulled By: ajkr

fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517
Summary:
Add a bunch of C API functions to expose new `WaitForCompact` function and related options.

Pull Request resolved: facebook#11737

Test Plan: unit tests

Reviewed By: jaykorean

Differential Revision: D48568239

Pulled By: abulimov

fbshipit-source-id: 1ff35972d7abacd7e1e17fe2ada1e20cdc88d8de
…cebook#11740)

Summary:
**Context/Summary:**
Same intention as facebook#2693 - basically we now pick from the last sorted run and expand forward till we can't

Pull Request resolved: facebook#11740

Test Plan:
Existing UT
Stress test

Reviewed By: ajkr

Differential Revision: D48586475

Pulled By: hx235

fbshipit-source-id: 3eb3c3ee1d5f7e0b0d6d649baaeb8c6990fee398
…ebook#11592)

Summary:
In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once.

Pull Request resolved: facebook#11592

Reviewed By: cbi42

Differential Revision: D48003998

Pulled By: hx235

fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
…facebook#11755)

Summary: Pull Request resolved: facebook#11755

Reviewed By: anand1976

Differential Revision: D48656627

Pulled By: hx235

fbshipit-source-id: 568fa7749cbf6ecf65102b4513fa3af975fd91b8
…b_stress (facebook#11729)

Summary:
Same as title

Pull Request resolved: facebook#11729

Test Plan: make crash_test -j32

Reviewed By: anand1976

Differential Revision: D48534820

Pulled By: akankshamahajan15

fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e
Summary:
* Add some options to cache_bench to use JemallocNodumpAllocator
* Make num_shard_bits option use and report cache-specific defaults
* Add a usleep option to sleep between operations, for simulating a workload with more CPU idle/wait time.
* Use const& for JemallocAllocatorOptions, to improve API usability (e.g. can bind to temporary `{}`)
* InstallStackTraceHandler()

Pull Request resolved: facebook#11758

Test Plan: manual

Reviewed By: jowlyzhang

Differential Revision: D48668479

Pulled By: pdillinger

fbshipit-source-id: b6032fbe09444cdb8f1443a5e017d2eea4f6205a
Summary:
Fix seg fault in auto_readahead_size
```
db_stress:
internal_repo_rocksdb/repo/table/block_based/partitioned_index_iterator.h:70: virtual rocksdb::IndexValue rocksdb::PartitionedIndexIterator::value() const: Assertion `Valid()' failed.
```

During seek, after calculating readahead_size, db_stress can inject IOError resulting in failure to index_iter_->Seek and making index_iter_ invalid.

Pull Request resolved: facebook#11761

Test Plan: Reproducible locally and passed with this fix

Reviewed By: anand1976

Differential Revision: D48696248

Pulled By: akankshamahajan15

fbshipit-source-id: 2be43bf56ad0fc2f95f9093c19c9a1b15a716091
…rift_models/StoryFeedMediaSticker/ttypes.py

Summary: Python3 makes the use of `(object)` in class inheritance unnecessary. Let's modernize our code by eliminating this.

Reviewed By: itamaro

Differential Revision: D48673915

fbshipit-source-id: a1a6ae8572271eb2898b748c8216ea68e362f06a
Summary:
`VersionBuilderMap` type alias definition seem unused.
If this PR can be compiled fine then the alias is probably not needed anymore.

Pull Request resolved: facebook#11286

Reviewed By: jaykorean

Differential Revision: D48656747

Pulled By: ajkr

fbshipit-source-id: ac8554922aead7dc3d24fe7e6544a4622578c514
Summary:
Fixes facebook#11742

Even after performing duty (1) ("Waiting for the next refill time"), it is possible the remaining threads are all in `Wait()`. Waking up at least one thread is enough to ensure progress continues, even if no new requests arrive.

The repro unit test (facebook@bb54245e6) is not included as it depends on an unlanded PR (facebook#11753)

Pull Request resolved: facebook#11763

Reviewed By: jaykorean

Differential Revision: D48710130

Pulled By: ajkr

fbshipit-source-id: 9d166bd577ea3a96ccd81dde85871fec5e85a4eb
Summary:
Fix seg fault in auto_readahead_size with async_io when readahead_size = 0. If readahead_size is trimmed and is 0, it's not eligible for further prefetching and should return.

Error occured when the first buffer already contains data and it goes for prefetching in second buffer leading to assertion failure -

`assert(roundup_len1 >= alignment);
`
because roundup_len1 = length + readahead_size.
length is 0 and readahead_size is also 0.

Pull Request resolved: facebook#11769

Test Plan: Reproducible with db_stress with async_io enabled.

Reviewed By: anand1976

Differential Revision: D48743031

Pulled By: akankshamahajan15

fbshipit-source-id: 0e08c41f862f6287ca223fbfaf6cd42fc97b3c87
Summary:
The user-defined timestamps feature only enforces that for the same key, user-defined timestamps should be non-decreasing. For the user-defined timestamps in memtable only feature, during flush, we check the user-defined timestamps in each memtable to examine if the data is considered expired with regard to `full_history_ts_low`. In this process, it's assuming that a newer memtable should not have smaller user-defined timestamps than an older memtable. This check however is enforcing ordering of user-defined timestamps across keys, as apposed to the vanilla UDT feature, that only enforce ordering of user-defined timestamps for the same key.

This more strict user-defined timestamp ordering requirement could be an issue for secondary instances where commits can be out of order. And after thinking more about it, this requirement is really an overkill to keep the invariants of `full_history_ts_low` which are:

1) users cannot read below `full_history_ts_low`
2) users cannot write at or below `full_history_ts_low`
3) `full_history_ts_low` can only be increasing

As long as RocksDB enforces these 3 checks, we can prohibit inconsistent read that returns a different value. And these three checks are covered in existing APIs.

So this PR removes the extra checks in the UDT in memtable only feature that requires user-defined timestamps to be non decreasing across keys.

Pull Request resolved: facebook#11732

Reviewed By: ltamasi

Differential Revision: D48541466

Pulled By: jowlyzhang

fbshipit-source-id: 95453c6e391cbd511c0feab05f0b11c312d17186
Summary:
when a key is recorded for locking in a pessimistic transaction, the key is first looked up in a map, and then inserted into the map if it was not already contained.
this can be simplified to an unconditional insert. in the ideal case that all keys are unique, this saves all the find() operations.

Pull Request resolved: facebook#11743

Reviewed By: anand1976

Differential Revision: D48656798

Pulled By: ajkr

fbshipit-source-id: d0150de2db757e0c05e1797cfc24380790c71276
Summary:
Having a synthetic implementation of `TimedWait()` in `SystemClock` will allow us to add `SyncPoint`s while mutex is released, which was previously impossible since the lock was released and reacquired all within `pthread_cond_timedwait()`. Additionally, integrating `TimedWait()` with `MockSystemClock` allows us to cleanup some workarounds in the test code. In this PR I only cleaned up the `GenericRateLimiter` test workaround.

This is related to the intended follow-up mentioned in facebook#7101 description. There are a couple differences:

(1) This PR does not include removing the particular workaround that initially motivated it. Actually, the `Timer` class uses `InstrumentedCondVar`, so the interface introduced here is inadequate to remove that workaround. On the bright side, the interface introduced in this PR can be changed as needed since it can neither be used nor extended externally, due to using forward-declared `port::CondVar*` in the interface.
(2) This PR only makes the change in `SystemClock` not `Env`. Older revisions of this PR included `Env::TimedWait()` and `SpecialEnv::TimedWait()`; however, since they were unused it probably makes sense to defer adding them until when they are needed.

Pull Request resolved: facebook#11753

Reviewed By: pdillinger

Differential Revision: D48654995

Pulled By: ajkr

fbshipit-source-id: 15e19f2454b64d4ec7f50e328691c66ca9911122
Summary:
the value of `done` is always false here, so the sub-condition `!done` will always be true and the check can be removed.

Pull Request resolved: facebook#11746

Reviewed By: anand1976

Differential Revision: D48656845

Pulled By: ajkr

fbshipit-source-id: 523ba3d07b3af7880c8c8ccb20442fd7c0f49417
…book#11774)

Summary:
This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery.

Pull Request resolved: facebook#11774

Test Plan:
Added unit test
make all check

Reviewed By: ltamasi

Differential Revision: D48799109

Pulled By: jowlyzhang

fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035
facebook#11773)

Summary:
…n change (facebook#11755)"

This reverts commit 4513165.

Pull Request resolved: facebook#11773

Reviewed By: ajkr

Differential Revision: D48832320

Pulled By: hx235

fbshipit-source-id: 96cef26a885134360766a83505f6717598eac6a9
Summary:
wide_columns can now be pretty-printed in the following commands
- `./ldb dump_wal`
- `./ldb dump`
- `./ldb idump`
- `./ldb dump_live_files`
- `./ldb scan`
- `./sst_dump --command=scan`

There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor.

Pull Request resolved: facebook#11754

Test Plan:
**New Tests added**
- `WideColumnsHelperTest::DumpWideColumns`
- `WideColumnsHelperTest::DumpSliceAsWideColumns`

**Changes added to existing tests**
- `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in facebook#11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool.
- `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST

**dump_wal**
```
./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header
```
```
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A
```

**idump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump
```
```
'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:2, type:22 => attr_one:two attr_three:four
'third' seq:3, type:1 => baz
Internal keys in range: 3
```

**SST Dump from dump_live_files**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files
```
```
...
==============================
SST Files
==============================
/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1
------------------------------
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
...
```

**dump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump
```
```
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3
```

**scan**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan
```
```
first : :hello attr_name1:foo attr_name2:bar
second : attr_one:two attr_three:four
third : baz
```

**sst_dump**
```
./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan
```
```
options.env is 0x7ff54b296000
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
from [] to []
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
```

Reviewed By: ltamasi

Differential Revision: D48837999

Pulled By: jaykorean

fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.

Pull Request resolved: facebook#11757

Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc

for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.

Reviewed By: ltamasi

Differential Revision: D48668048

Pulled By: jowlyzhang

fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
Summary:
**Context/Summary:**
After facebook#11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. facebook#11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO.

However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` .
- `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization.

Therefore we decided to do this PR.

Pull Request resolved: facebook#11762

Test Plan: Modified existing UTs to cover this PR

Reviewed By: ajkr

Differential Revision: D48759560

Pulled By: hx235

fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79
Summary:
... in info_log. Becoming more important with disaggregated storage.

Pull Request resolved: facebook#11776

Test Plan: manual

Reviewed By: jaykorean

Differential Revision: D48849471

Pulled By: pdillinger

fbshipit-source-id: 9a8fd8b2564a4f133526ecd7c1414cb667e4ba54
facebook#11777)

Summary:
As mentioned in facebook#11754 , refactor to clean up some nearly identical logic. This PR changes the existing debugging string format of Scan command as the following.

```
❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex
```

Before
```
0x6669727374 : :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 : 0x62617A
```
After
```
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
```

Pull Request resolved: facebook#11777

Test Plan:
```
❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump --hex
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
Keys in range: 3

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
```

Reviewed By: jowlyzhang

Differential Revision: D48876755

Pulled By: jaykorean

fbshipit-source-id: b1c608a810fe038999ac528b690d398abf5f21d7
Summary:
This happens in (Compaction)MergingIterator layer, and can cause data loss during compaction or read/scan return incorrect result

Pull Request resolved: facebook#11782

Reviewed By: ajkr

Differential Revision: D48880575

Pulled By: cbi42

fbshipit-source-id: 2294ad284a6d653d3674bebe55380f12ee4b645b
…users (facebook#11786)

Summary:
This should only affect iterator when
- user uses DeleteRange(),
- An iterator from level L has a non-ok status (such non-ok status may not be caught before the bug fix in facebook#11783), and
- A range tombstone covers a key from level > L and triggers a reseek sets the status_ to OK in SeekImpl()/SeekPrevImpl() e.g. https://github.com/facebook/rocksdb/blob/bd6a8340c3a2db764620e90b3ac5be173fc68a0c/table/merging_iterator.cc#L801

Pull Request resolved: facebook#11786

Differential Revision: D48908830

Pulled By: cbi42

fbshipit-source-id: eb564be375af4e33dc27542eff753260186e6d5d
brodyhuang and others added 27 commits November 13, 2023 12:09
Summary:
When I call `DBWithTTLImpl::Resume()`, it returns `Status::NotSupported`.  Did `StackableDB` miss this API ?
Thanks !

Pull Request resolved: facebook#12060

Reviewed By: jaykorean

Differential Revision: D51202742

Pulled By: ajkr

fbshipit-source-id: 5e01a54a42efd81fd57b3c992b9af8bc45c59c9c
Summary:
When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded.

This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen.

Pull Request resolved: facebook#12067

Reviewed By: cbi42

Differential Revision: D51263133

Pulled By: ajkr

fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25
…k#12057)

Summary:
- Add missing null check for ColumnFamilyHandle in `GetEntity()`
- `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check.
- Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change)

Pull Request resolved: facebook#12057

Test Plan:
- Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case
- Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case

Reviewed By: jowlyzhang

Differential Revision: D51167445

Pulled By: jaykorean

fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664
…Overlap` (facebook#12064)

Summary:
Fixes facebook#11909. The test passes after the change in facebook#11917 to start mock clock from a non-zero time.

The reason for test failing is a bit complicated:
- The Put here https://github.com/pdillinger/rocksdb/blob/e4ad4a0ef1b852dc203311fb885c673c891f08e0/db/compaction/tiered_compaction_test.cc#L2045 happens before mock clock advances beyond 0.
- This causes oldest_key_time_ to be 0 for memtable.
- oldest_ancester_time of the first L0 file becomes 0
- L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: https://github.com/facebook/rocksdb/blob/509947ce2c970d296fd0d868455d560c7f778a57/db/compaction/compaction_job.cc#L1898C34-L1904.
- This causes some small sequence number to be mapped to current time: https://github.com/facebook/rocksdb/blob/509947ce2c970d296fd0d868455d560c7f778a57/db/compaction/compaction_job.cc#L301
- Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping
- When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug.
```
L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34)
L6: file 3 [K6@5, K10@20]
```

facebook#12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level.

Pull Request resolved: facebook#12064

Test Plan: the unit test passes

Reviewed By: ajkr

Differential Revision: D51281149

Pulled By: cbi42

fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62
Summary:
Fixes facebook#11457.

Pull Request resolved: facebook#12066

Reviewed By: cbi42

Differential Revision: D51259966

Pulled By: ajkr

fbshipit-source-id: a158b6f341b6b48233d917bfe4d00b639dbd8619
Summary: Pull Request resolved: facebook#12070

Reviewed By: jaykorean

Differential Revision: D51307148

Pulled By: ajkr

fbshipit-source-id: d04335506becd5970802f87ab0573b6307479222
Summary:
At the moment RocksDBJava uses the default CIrcleCI JVM on Windows builds. This can and has changed in the past and can cause some incompatibilities.

This PR addresses the problem of explicitly installing and using Liberica JDK 8 as Java 8 Is the primary target for RocksdbJava.

Pull Request resolved: facebook#12068

Reviewed By: cbi42

Differential Revision: D51307233

Pulled By: ajkr

fbshipit-source-id: 9cb4e173d8a9ac42e5f9fda1daf012302942fdbc
…cebook#12059)

Summary:
`CacheWithSecondaryAdapter` can distribute placeholder reservations across the primary and secondary caches. The current implementation of the accounting is quite complicated in order to avoid using a mutex. This may cause the accounting to be slightly off after changes to the cache capacity and ratio, resulting in assertion failures. There's also a bug in the unlikely event that the total reservation exceeds the cache capacity. Furthermore, the current implementation is difficult to reason about.

This PR simplifies it by doing the accounting while holding a mutex. The reservations are processed in 1MB chunks in order to avoid taking a lock too frequently. As a side effect, this also removes the restriction of not allowing to increase the compressed secondary cache capacity after decreasing it to 0.

Pull Request resolved: facebook#12059

Test Plan: Existing unit tests, and a new test for capacity increase from 0

Reviewed By: pdillinger

Differential Revision: D51278686

Pulled By: anand1976

fbshipit-source-id: 7e1ad2c50694772997072dd59cab35c93c12ba4f
Summary:
Fixes facebook#11000.

That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion.

Pull Request resolved: facebook#12069

Reviewed By: pdillinger

Differential Revision: D51262589

Pulled By: ajkr

fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003
Summary:
I want to use the `WriteBufferManager` in my rust project, which requires exposing it through the c api, just like `Cache` is.

Hopefully the changes are fairly straightfoward!

Pull Request resolved: facebook#11710

Reviewed By: cbi42

Differential Revision: D51166518

Pulled By: ajkr

fbshipit-source-id: cd266ff1e4a7ab145d05385cd125a8390f51f3fc
Summary: Pull Request resolved: facebook#12072

Reviewed By: cbi42

Differential Revision: D51398080

Pulled By: ajkr

fbshipit-source-id: 1043f2b012bd744e9c53c638e1ba56a3e0392e11
Summary:
Fix facebook#11510

Pull Request resolved: facebook#12065

Reviewed By: ajkr

Differential Revision: D51406695

Pulled By: cbi42

fbshipit-source-id: b9e32da5f9bcafb5365e4349f7295be90d5aa7ba
…te level (facebook#12063)

Summary:
The test failure in facebook#11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the  `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key.

Other fixes:
* skip range del sentinels when deciding output level for tiered compaction

Pull Request resolved: facebook#12063

Test Plan:
- existing unit tests
- apply the fix to facebook#11905 and run `./tiered_compaction_test --gtest_filter="*RangeDelsCauseFileEndpointsToOverlap*"`

Reviewed By: ajkr

Differential Revision: D51288985

Pulled By: cbi42

fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a
…facebook#12007)

Summary:
following facebook#11710
 - add test on wbm c api
- add a setter for WBM in `DBOptions`

Pull Request resolved: facebook#12007

Reviewed By: cbi42

Differential Revision: D51430042

Pulled By: ajkr

fbshipit-source-id: 608bc4d3ed35a84200459d0230b35be64b3475f7
Summary:
Required for open source repo.

Pull Request resolved: facebook#12076

Reviewed By: ajkr

Differential Revision: D51449839

Pulled By: cbi42

fbshipit-source-id: 4a25a3422880db3f28a2834d966341935db32530
Summary:
Fixes facebook#11218

Changes from facebook#10881 broke FreeBSD builds with:

    env/io_posix.h:39:9: error: 'POSIX_MADV_NORMAL' macro redefined [-Werror,-Wmacro-redefined]

This commit fixes FreeBSD builds by ignoring MADV defines.

Pull Request resolved: facebook#12078

Reviewed By: cbi42

Differential Revision: D51452802

Pulled By: ajkr

fbshipit-source-id: 0a1f5a90954e7d257a95794277a843ac77f3a709
…book#12081)

Summary:
`WithinPenultimateLevelOutputRange()` is updated in facebook#12063 to check internal key range. However, op_type of a key can change during compaction, e.g. MERGE -> PUT, which makes a key larger and becomes out of penultimate output range. This has caused stress test failures with error message "Unsafe to store Seq later than snapshot in the last level if per_key_placement is enabled". So update `WithinPenultimateLevelOutputRange()` to only check user key and sequence number.

Pull Request resolved: facebook#12081

Test Plan:
* This repro can produce the corruption within a few runs. Ran it a few times after the fix and did not see Corruption failure.
```
python3 ./tools/db_crashtest.py whitebox --test_tiered_storage --random_kill_odd=888887 --use_merge=1 --writepercent=100 --readpercent=0 --prefixpercent=0 --delpercent=0 --delrangepercent=0 --iterpercent=0 --write_buffer_size=419430 --column_families=1 --read_fault_one_in=0 --write_fault_one_in=0
```

Reviewed By: ajkr

Differential Revision: D51481202

Pulled By: cbi42

fbshipit-source-id: cad6b65099733e03071b496e752bbdb09cf4db82
Summary:
Add some asserts in the `CacheWithSecondaryAdapter` destructor to help debug a crash test failure.

Pull Request resolved: facebook#12082

Reviewed By: cbi42

Differential Revision: D51486041

Pulled By: anand1976

fbshipit-source-id: 76537beed31ba27ab9ac8b4ce6deb775629e3be5
Summary:
Fixes facebook#12079.

Fixed missing licenses in "\*.h" and "\*.cc" files

Pull Request resolved: facebook#12083

Reviewed By: cbi42

Differential Revision: D51489634

Pulled By: ajkr

fbshipit-source-id: 764bfee257b9d6603fd7606a55664b7537e1898f
Summary:
The option "write_buffer_size" has changed from 4MB for 64MB by default, and the compact_files_example will not work as expected, as the test data written is only about 50MB and will not trigger compaction.

Pull Request resolved: facebook#12084

Reviewed By: cbi42

Differential Revision: D51499959

Pulled By: ajkr

fbshipit-source-id: 4f4b25ebc4b6bb568501adc8e97813edcddceea8
…y feature (facebook#12062)

Summary:
These bugs surfaced while I was trying to add the stress test for the feature:

Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands.

Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be.

Pull Request resolved: facebook#12062

Test Plan:
For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key.  `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily.

For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths.

Reviewed By: ltamasi

Differential Revision: D51508280

Pulled By: jowlyzhang

fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
…ingleton (facebook#12128)

Summary:
Pull Request resolved: facebook#12128

The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.

Reviewed By: akankshamahajan15

Differential Revision: D51963950

fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
@igorcanadi igorcanadi requested a review from seckcoder February 23, 2024 16:43
Copy link

@seckcoder seckcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 glad that our recent changes didn't cause too much troubles in merging.

@igorcanadi igorcanadi merged commit 9b9c622 into master Feb 23, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.