Initial disk mode support #401

MathieuBordere · 2022-09-15T16:07:03Z

~~WIP - don't review~~

Have tried to run all existing tests with the new disk vfs. Most failures come from the fsm snapshot functionality not being in place for the on-disk case.
Not yet 100% sure about the abstraction, I feel like a user should be able to choose the VFS per database, but currently I just put the whole of dqlite in disk-mode, i.e. every database will be stored on disk. If the snapshotting behaviour is different for in-memory/on-disk, different VFS's for different databases could lead to complex snapshotting behaviour in raft, where we have to mix different methods to snapshot. Would like to avoid that.
SYNCs are still turned off, the real transaction is the raft log being stored to disk, the database writes to disk from SQLite don't have to be synced I think.
Still needs a bunch of cleanup.

codecov · 2022-10-28T12:54:05Z

Codecov Report

Merging #401 (b838f0f) into master (b10ee0e) will increase coverage by 0.35%.
The diff coverage is 77.64%.

@@            Coverage Diff             @@
##           master     #401      +/-   ##
==========================================
+ Coverage   73.87%   74.22%   +0.35%     
==========================================
  Files          31       32       +1     
  Lines        4685     5378     +693     
  Branches     1467     1680     +213     
==========================================
+ Hits         3461     3992     +531     
- Misses        734      823      +89     
- Partials      490      563      +73

Impacted Files	Coverage Δ
src/db.c	`52.56% <53.84%> (-4.11%)`	⬇️
src/server.c	`68.19% <69.56%> (-1.06%)`	⬇️
src/dqlite.c	`84.84% <75.00%> (-5.63%)`	⬇️
src/leader.c	`69.96% <75.00%> (ø)`
src/vfs.c	`83.77% <77.33%> (-2.37%)`	⬇️
src/fsm.c	`75.00% <82.15%> (+3.94%)`	⬆️
src/lib/fs.c	`83.33% <83.33%> (ø)`
src/config.c	`95.65% <100.00%> (+0.65%)`	⬆️
src/transport.c	`69.93% <0.00%> (-0.66%)`	⬇️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

MathieuBordere · 2022-10-28T14:52:45Z

This PR introduces a "disk-mode" for dqlite where the main SQLite database is
stored in a regular SQLite file on disk instead of storing the pages of the
database in memory.

disk-mode is considered experimental. Users can try it out and report any
issues with the new SQLite VFS or raft fsm.

Design considerations:

The SQLite database files live in a folder database in the raft dir. This
folder is emptied on every startup. The user is urged not to interact
directly with the database file while the system is running.
The database file is not the complete picture of the SQLite database. The
WAL still lives in memory and is not persisted to disk.
Snapshots are taken using the async snapshot interface of raft and are taken
in 3 (simplified) steps.
1. copy the WAL and disable checkpoints, this guarantees the contents of the
  disk file will not be overwritten during snapshotting.
2. mmap the database files and write the snapshot to disk
3. enable checkpoints again and clean up resources.

Known limitations:

The blocking SQLite API calls are executed in the raft/dqlite main loop,
which can lead to blocking the main event loop. One solution to fix this
could involve the use of coroutines (to be investigated).
Raft snapshots of the whole database are still loaded in memory in 1 chunck
when restoring the database. This happens on startup or when installing a
snapshot after receiving a raft InstallSnapshot RPC. So, while memory usage
should decrease greatly when using this mode, large memory spikes will still
occur. libraft will have to be adapted to allow.
a) Sending a snapshot in chunks, with the receiver of the snapshot
accumulating the chunks on disk.
b) Restoring a snapshot in chunks.
c) Taking a snapshot in chunks.

freeekanayaka

I just skimmed the diff and left a comment. Overall, I like the clear-cut in the vfs.c implementation, where the new disk mode does not mix with the the old in-memory mode, so we can experiment without the risk of regression.

I'll try to give a closer look in the next days.

src/vfs.c

freeekanayaka · 2022-11-13T09:37:50Z

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

freeekanayaka

Looks good to me, with the caveats already explained in the PR description.

src/lib/fs.c

MathieuBordere · 2022-11-15T07:32:25Z

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Thanks, I thought we did support multiple databases, I tested out a few cases and it appeared to be working (might have been a lucky shot, don't know). Will give it a closer look.

MathieuBordere · 2022-11-15T07:34:14Z

I've run a Jepsen test-suite run against this branch and noticed a - very - frequent failure, will try to figure it out first before merging this, will convert to Draft to inhibit accidental merge.

freeekanayaka · 2022-11-15T11:58:38Z

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Thanks, I thought we did support multiple databases, I tested out a few cases and it appeared to be working (might have been a lucky shot, don't know). Will give it a closer look.

It's kind of surprising, given the code here:

https://github.com/canonical/dqlite/blob/master/src/gateway.c#L130

but perhaps it works when using different connections.

MathieuBordere · 2022-11-15T12:23:28Z

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Thanks, I thought we did support multiple databases, I tested out a few cases and it appeared to be working (might have been a lucky shot, don't know). Will give it a closer look.

It's kind of surprising, given the code here:

https://github.com/canonical/dqlite/blob/master/src/gateway.c#L130

but perhaps it works when using different connections.

Indeed, when using multiple connections.

freeekanayaka · 2022-11-16T08:44:31Z

I've run a Jepsen test-suite run against this branch and noticed a - very - frequent failure, will try to figure it out first before merging this, will convert to Draft to inhibit accidental merge.

Could it be that the event look gets blocked for too long? For example when a checkpoint happens I'd expect the event loop to be blocked for a relatively long time, since we might need to write as many as 1000 pages to disk (the default checkpoint threshold is 1000).

Just thinking loud.

MathieuBordere · 2022-11-16T08:52:00Z

I've run a Jepsen test-suite run against this branch and noticed a - very - frequent failure, will try to figure it out first before merging this, will convert to Draft to inhibit accidental merge.

Could it be that the event look gets blocked for too long? For example when a checkpoint happens I'd expect the event loop to be blocked for a relatively long time, since we might need to write as many as 1000 pages to disk (the default checkpoint threshold is 1000).

Just thinking loud.

I first made the implementation with sync snapshots and later converted to async snapshots, but I did a sloppy job and raft copies the WAL in a uv worker thread while it is actively being written, while it should have copied the WAL in the main thread. Fixing it.

MathieuBordere · 2022-11-18T14:49:35Z

Jepsen tests are looking good, still investigating a failure that occurs with the disk-nemesis, possibly not a bug in this PR but in managing the behaviour when a raft node converts to RAFT_UNAVAILABLE after failing to apply a command to the state machine.

Signed-off-by: Mathieu Borderé <[email protected]>

The directory will be used to store the SQLite database. Signed-off-by: Mathieu Borderé <[email protected]>

Signed-off-by: Mathieu Borderé <[email protected]>

MathieuBordere · 2022-11-21T12:54:51Z

The remaining failure in the Jepsen suite is related to canonical/go-dqlite#213 . apply_frames fails because sqlite3_open_v2 fails due to the disk being full and the node becomes unavailable. This happens to every node in the cluster, resulting in an unavailable cluster.

MathieuBordere · 2022-11-23T14:49:15Z

I think this can be merged. The failure looks not to be related to this PR.

MathieuBordere force-pushed the disk-mode branch from 9395649 to 61a3d82 Compare October 28, 2022 12:49

MathieuBordere force-pushed the disk-mode branch 2 times, most recently from 2f070d5 to 53e9696 Compare October 28, 2022 14:50

MathieuBordere changed the title ~~DRAFT / WIP: disk-mode~~ disk-mode - 1st iteration Oct 28, 2022

MathieuBordere marked this pull request as ready for review October 28, 2022 14:53

MathieuBordere requested review from freeekanayaka and cole-miller October 28, 2022 14:59

freeekanayaka reviewed Oct 28, 2022

View reviewed changes

src/vfs.c Outdated Show resolved Hide resolved

MathieuBordere force-pushed the disk-mode branch from 53e9696 to beea17f Compare October 31, 2022 11:16

MathieuBordere force-pushed the disk-mode branch from beea17f to 8500ef3 Compare November 10, 2022 12:51

freeekanayaka approved these changes Nov 13, 2022

View reviewed changes

src/lib/fs.c Outdated Show resolved Hide resolved

src/lib/fs.c Outdated Show resolved Hide resolved

src/lib/fs.c Outdated Show resolved Hide resolved

MathieuBordere marked this pull request as draft November 15, 2022 07:43

MathieuBordere force-pushed the disk-mode branch from 8500ef3 to 17ff3ab Compare November 18, 2022 14:27

Mathieu Borderé added 6 commits November 21, 2022 11:05

vfs: Fix obsolete comment

eae25d3

test/lib/sqlite: initialize instead of shutdown

3befa98

Signed-off-by: Mathieu Borderé <[email protected]>

db config: Add directory

d0a03d3

The directory will be used to store the SQLite database. Signed-off-by: Mathieu Borderé <[email protected]>

vfs/fsm: Add disk methods

14e14c9

Signed-off-by: Mathieu Borderé <[email protected]>

test_fsm: Add disk fsm tests.

9fea33c

Signed-off-by: Mathieu Borderé <[email protected]>

test_vfs: Add disk vfs unit tests

b838f0f

Signed-off-by: Mathieu Borderé <[email protected]>

MathieuBordere force-pushed the disk-mode branch from 17ff3ab to b838f0f Compare November 21, 2022 10:08

MathieuBordere marked this pull request as ready for review November 23, 2022 14:48

stgraber changed the title ~~disk-mode - 1st iteration~~ Initial disk mode support Nov 23, 2022

stgraber merged commit d35c087 into canonical:master Nov 23, 2022

MathieuBordere deleted the disk-mode branch December 9, 2022 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial disk mode support #401

Initial disk mode support #401

MathieuBordere commented Sep 15, 2022 •

edited

Loading

codecov bot commented Oct 28, 2022 •

edited

Loading

MathieuBordere commented Oct 28, 2022

freeekanayaka left a comment •

edited

Loading

freeekanayaka commented Nov 13, 2022

freeekanayaka left a comment

MathieuBordere commented Nov 15, 2022

MathieuBordere commented Nov 15, 2022 •

edited

Loading

freeekanayaka commented Nov 15, 2022

MathieuBordere commented Nov 15, 2022

freeekanayaka commented Nov 16, 2022

MathieuBordere commented Nov 16, 2022 •

edited

Loading

MathieuBordere commented Nov 18, 2022

MathieuBordere commented Nov 21, 2022

MathieuBordere commented Nov 23, 2022

Initial disk mode support #401

Initial disk mode support #401

Conversation

MathieuBordere commented Sep 15, 2022 • edited Loading

codecov bot commented Oct 28, 2022 • edited Loading

Codecov Report

MathieuBordere commented Oct 28, 2022

Design considerations:

Known limitations:

freeekanayaka left a comment • edited Loading

Choose a reason for hiding this comment

freeekanayaka commented Nov 13, 2022

freeekanayaka left a comment

Choose a reason for hiding this comment

MathieuBordere commented Nov 15, 2022

MathieuBordere commented Nov 15, 2022 • edited Loading

freeekanayaka commented Nov 15, 2022

MathieuBordere commented Nov 15, 2022

freeekanayaka commented Nov 16, 2022

MathieuBordere commented Nov 16, 2022 • edited Loading

MathieuBordere commented Nov 18, 2022

MathieuBordere commented Nov 21, 2022

MathieuBordere commented Nov 23, 2022

MathieuBordere commented Sep 15, 2022 •

edited

Loading

codecov bot commented Oct 28, 2022 •

edited

Loading

freeekanayaka left a comment •

edited

Loading

MathieuBordere commented Nov 15, 2022 •

edited

Loading

MathieuBordere commented Nov 16, 2022 •

edited

Loading