Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial disk mode support #401

Merged
merged 6 commits into from
Nov 23, 2022
Merged

Conversation

MathieuBordere
Copy link
Contributor

@MathieuBordere MathieuBordere commented Sep 15, 2022

WIP - don't review

  • Have tried to run all existing tests with the new disk vfs. Most failures come from the fsm snapshot functionality not being in place for the on-disk case.
  • Not yet 100% sure about the abstraction, I feel like a user should be able to choose the VFS per database, but currently I just put the whole of dqlite in disk-mode, i.e. every database will be stored on disk. If the snapshotting behaviour is different for in-memory/on-disk, different VFS's for different databases could lead to complex snapshotting behaviour in raft, where we have to mix different methods to snapshot. Would like to avoid that.
  • SYNCs are still turned off, the real transaction is the raft log being stored to disk, the database writes to disk from SQLite don't have to be synced I think.
  • Still needs a bunch of cleanup.

@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Merging #401 (b838f0f) into master (b10ee0e) will increase coverage by 0.35%.
The diff coverage is 77.64%.

@@            Coverage Diff             @@
##           master     #401      +/-   ##
==========================================
+ Coverage   73.87%   74.22%   +0.35%     
==========================================
  Files          31       32       +1     
  Lines        4685     5378     +693     
  Branches     1467     1680     +213     
==========================================
+ Hits         3461     3992     +531     
- Misses        734      823      +89     
- Partials      490      563      +73     
Impacted Files Coverage Δ
src/db.c 52.56% <53.84%> (-4.11%) ⬇️
src/server.c 68.19% <69.56%> (-1.06%) ⬇️
src/dqlite.c 84.84% <75.00%> (-5.63%) ⬇️
src/leader.c 69.96% <75.00%> (ø)
src/vfs.c 83.77% <77.33%> (-2.37%) ⬇️
src/fsm.c 75.00% <82.15%> (+3.94%) ⬆️
src/lib/fs.c 83.33% <83.33%> (ø)
src/config.c 95.65% <100.00%> (+0.65%) ⬆️
src/transport.c 69.93% <0.00%> (-0.66%) ⬇️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@MathieuBordere MathieuBordere force-pushed the disk-mode branch 2 times, most recently from 2f070d5 to 53e9696 Compare October 28, 2022 14:50
@MathieuBordere
Copy link
Contributor Author

This PR introduces a "disk-mode" for dqlite where the main SQLite database is
stored in a regular SQLite file on disk instead of storing the pages of the
database in memory.

disk-mode is considered experimental. Users can try it out and report any
issues with the new SQLite VFS or raft fsm.

Design considerations:

  • The SQLite database files live in a folder database in the raft dir. This
    folder is emptied on every startup. The user is urged not to interact
    directly with the database file while the system is running.
    The database file is not the complete picture of the SQLite database. The
    WAL still lives in memory and is not persisted to disk.
  • Snapshots are taken using the async snapshot interface of raft and are taken
    in 3 (simplified) steps.
    1. copy the WAL and disable checkpoints, this guarantees the contents of the
      disk file will not be overwritten during snapshotting.
    2. mmap the database files and write the snapshot to disk
    3. enable checkpoints again and clean up resources.

Known limitations:

  • The blocking SQLite API calls are executed in the raft/dqlite main loop,
    which can lead to blocking the main event loop. One solution to fix this
    could involve the use of coroutines (to be investigated).
  • Raft snapshots of the whole database are still loaded in memory in 1 chunck
    when restoring the database. This happens on startup or when installing a
    snapshot after receiving a raft InstallSnapshot RPC. So, while memory usage
    should decrease greatly when using this mode, large memory spikes will still
    occur. libraft will have to be adapted to allow.
    a) Sending a snapshot in chunks, with the receiver of the snapshot
    accumulating the chunks on disk.
    b) Restoring a snapshot in chunks.
    c) Taking a snapshot in chunks.

@MathieuBordere MathieuBordere changed the title DRAFT / WIP: disk-mode disk-mode - 1st iteration Oct 28, 2022
@MathieuBordere MathieuBordere marked this pull request as ready for review October 28, 2022 14:53
Copy link
Contributor

@freeekanayaka freeekanayaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just skimmed the diff and left a comment. Overall, I like the clear-cut in the vfs.c implementation, where the new disk mode does not mix with the the old in-memory mode, so we can experiment without the risk of regression.

I'll try to give a closer look in the next days.

@freeekanayaka
Copy link
Contributor

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Copy link
Contributor

@freeekanayaka freeekanayaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with the caveats already explained in the PR description.

@MathieuBordere
Copy link
Contributor Author

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Thanks, I thought we did support multiple databases, I tested out a few cases and it appeared to be working (might have been a lucky shot, don't know). Will give it a closer look.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Nov 15, 2022

I've run a Jepsen test-suite run against this branch and noticed a - very - frequent failure, will try to figure it out first before merging this, will convert to Draft to inhibit accidental merge.

@MathieuBordere MathieuBordere marked this pull request as draft November 15, 2022 07:43
@freeekanayaka
Copy link
Contributor

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Thanks, I thought we did support multiple databases, I tested out a few cases and it appeared to be working (might have been a lucky shot, don't know). Will give it a closer look.

It's kind of surprising, given the code here:

https://github.com/canonical/dqlite/blob/master/src/gateway.c#L130

but perhaps it works when using different connections.

@MathieuBordere
Copy link
Contributor Author

I believe supporting disk mode as a global on/off switch (vs a per-database one) is fine. We don't support multiple databases anyway at the moment I think.

Thanks, I thought we did support multiple databases, I tested out a few cases and it appeared to be working (might have been a lucky shot, don't know). Will give it a closer look.

It's kind of surprising, given the code here:

https://github.com/canonical/dqlite/blob/master/src/gateway.c#L130

but perhaps it works when using different connections.

Indeed, when using multiple connections.

@freeekanayaka
Copy link
Contributor

I've run a Jepsen test-suite run against this branch and noticed a - very - frequent failure, will try to figure it out first before merging this, will convert to Draft to inhibit accidental merge.

Could it be that the event look gets blocked for too long? For example when a checkpoint happens I'd expect the event loop to be blocked for a relatively long time, since we might need to write as many as 1000 pages to disk (the default checkpoint threshold is 1000).

Just thinking loud.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Nov 16, 2022

I've run a Jepsen test-suite run against this branch and noticed a - very - frequent failure, will try to figure it out first before merging this, will convert to Draft to inhibit accidental merge.

Could it be that the event look gets blocked for too long? For example when a checkpoint happens I'd expect the event loop to be blocked for a relatively long time, since we might need to write as many as 1000 pages to disk (the default checkpoint threshold is 1000).

Just thinking loud.

I first made the implementation with sync snapshots and later converted to async snapshots, but I did a sloppy job and raft copies the WAL in a uv worker thread while it is actively being written, while it should have copied the WAL in the main thread. Fixing it.

@MathieuBordere
Copy link
Contributor Author

Jepsen tests are looking good, still investigating a failure that occurs with the disk-nemesis, possibly not a bug in this PR but in managing the behaviour when a raft node converts to RAFT_UNAVAILABLE after failing to apply a command to the state machine.

Mathieu Borderé added 6 commits November 21, 2022 11:05
The directory will be used to store the SQLite database.

Signed-off-by: Mathieu Borderé <[email protected]>
Signed-off-by: Mathieu Borderé <[email protected]>
Signed-off-by: Mathieu Borderé <[email protected]>
Signed-off-by: Mathieu Borderé <[email protected]>
@MathieuBordere
Copy link
Contributor Author

The remaining failure in the Jepsen suite is related to canonical/go-dqlite#213 . apply_frames fails because sqlite3_open_v2 fails due to the disk being full and the node becomes unavailable. This happens to every node in the cluster, resulting in an unavailable cluster.

@MathieuBordere MathieuBordere marked this pull request as ready for review November 23, 2022 14:48
@MathieuBordere
Copy link
Contributor Author

I think this can be merged. The failure looks not to be related to this PR.

@stgraber stgraber changed the title disk-mode - 1st iteration Initial disk mode support Nov 23, 2022
@stgraber stgraber merged commit d35c087 into canonical:master Nov 23, 2022
@MathieuBordere MathieuBordere deleted the disk-mode branch December 9, 2022 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants