-
Notifications
You must be signed in to change notification settings - Fork 95
Object Chunking and Garbage Collection
The large file material here should be valid for Riak CS 1.0, and the GC implementation described is currently under development, to be released in Riak CS 1.1.
The simple goal is the support objects up to 5GB in size, with the same API and behavioral characteristics as S3. S3 offers eventually consistent access to objects, with no concurrency control other than last-write-wins. This does mean that there might be one "version" of a file being read on one machine, while another is accepting a write for the same file, and yet another is deleting the file (but who's view for the delete is yet another version). All of this has to happen while also garbage collecting overwritten and failed-uploaded files. To accomplish this, we make heavy use of sibling resolution and Riak's eventual consistency properties.
Riak CS splits files up into small blocks, and stores each block as a separate Riak object. Files less than the size of one block will simply be stored as a single block. There is a separate manifest object that is used to record metadata about the file, such as the key, MD5, and how many blocks have been written so far. Manifests and individual blocks are stored in different buckets. A user bucket called "foo", will, for example, correspond to two buckets in Riak, one for the manifests, and one for the blocks.
There is a manifest for each PUT operation
on a particular key, each with a different
UUID. Because of things like garbage
collection and concurrent PUTs, the object stored at the
manifest {Bucket, Key} is a collection of manifests (an orddict),
not just a single manifest. Furthermore, {allow_mult, true}
is
set on this bucket, so there may be several siblings, each which
are a collection of manifests. Depending on the operation, a piece
of code may examine all of the manifests, or only a particular one.
The erlang record definition for manifests is well commented
here.
Individual blocks are immutable, as they include the UUID as part of their key. They can however, be deleted.
Manifests can be in one of four different states:
-
writing
: A manifest is in this state when it is first created, and as the blocks are being written to Riak. -
active
: A manifest is in this state once all of the blocks have been written to Riak. This is the only state that a manifest can be in and serve GET requests. -
pending_delete
: When a user deletes a file or a file is overwritten, the manifest is first put in thepending_delete
state, before the move to the GC bucket is successful. Once the manifest has been successfully moved, it will go into thescheduled_delete
state. -
scheduled_delete
: A manifest is in this state when its information has been written to theriak-cs-gc
bucket thus scheduling the file blocks for deletion. The manifest stays around for as long asleeway_seconds
, and is pruned lazily.
When two manifests with the same UUID are in conflict
(because they came from manifest-collection siblings),
they are resolved with this code.
NOTE/TO-UPDATE: riak_moss_manifest_resolution.erl
has been changed by the gc147-gc feature branch.
-
GET
:- Retrieve manifests
- Resolve siblings
- Select all
active
manifests, and choose the one with the most recentwrite_start_time
(a property of the manifest). If there are noactive
manifests, return404
.
Note, we specifically don't write the resolved manifests back to Riak, as this can create a situation where to actors both resolve siblings, therefore creating siblings again. This follows the general rule-of-thumb, don't make a write to Riak just to resolve siblings.
-
PUT
:- Create a new manifest (with fresh UUID)
in the
writing
state. Once all of the blocks have been written, change the state toactive
. - Follow the same steps as in
DELETE
, to delete any manifests that this overwrites with the following exception: Manifests found in thewriting
state are not marked aspending_delete
unless theirlast_block_written_time
isleeway_seconds
in the past.
Note, each time the manifest is written back to Riak, we resolve siblings, and write back with the correct vector clock.
- Create a new manifest (with fresh UUID)
in the
-
DELETE
:- Retrieve manifests
- Resolve siblings
- Select all
active
andwriting
manifests, and mark them aspending_delete
. - "Move" these manifests to the GC bucket
- Mark them as
scheduled_delete
First, a reminder that for a given named object, multiple internal versions of that object may be stored in the system at one time. Each version of the object is accessible by an object manifest that includes a UUID identifying that particular version. There is at most one active manifest for a named object at any one time and the active version is the only one that is externally available to a Riak CS user.
Garbage collection of an object version involves several different actions. These actions can be divided into synchronous actions that occur while the user is waiting for notification of successful command completion and asynchronous actions that are not directly tied to user actions and occur in the background. These two action groups are described in more detail in the following sections.
There are two direct actions a user may take to initiate the garbage collection of an object version: overwriting the object with a new version or deleting the object.
When an object version is overwritten a new object manifest is written
with the state set to active
and this new version becomes what is
available to the user, but in the case of a delete the object is no
longer externally available.
Also, as part of the overwrite or delete action, a set of eligible
manifest versions are determined and the state of each eligible
manifest is changed to pending_delete
and the delete_marked_time
field is set to a time value representing the current time.
The method for compiling the list of eligible manifests is dependent on the operation.
For object overwrites, the previously active
manifest version is
selected along with any manifest versions that are in the writing
state where the last_block_written_time
field (or the
write_start_time
if last_block_written_time
is undefined) of the
manifest represents a time value greater than leeway_seconds
seconds
ago. If a manifest version remains in the writing
state for greater
than leeway_seconds
seconds, it is assumed that that manifest
version represents a failed upload attempt and therefore it is
acceptable to reap any object blocks that may have been
written. Manifest versions in the writing
state whose
last_block_written_time
has not exceeded the leeway_seconds
threshold are not deemed eligible because they could represent an
object version that is still in the progress of writing its blocks.
Object deletes are more straightforward. Since no object is externally
available to the user after the delete operation, then any manifest
versions in the active
or writing
state are eligible to be
cleaned up. There is no concern about reaping the object version
that is currently being written to become the next active
version.
Once the states of the eligible manifests have been updated to
pending_delete
the manifest information for any pending_delete
manifest versions are collected into a CRDT set and the set is written
as a value to the riak-cs-gc
bucket keyed by a time value
representing the current epoch time plus the leeway interval (i.e.
the leeway_seconds
configuration option). If that write is
successful then the state for each manifest in the set is updated to
scheduled_delete
. This indicates that the blocks of the object have
been scheduled for deletion by the garbage collection daemon and
avoids other manifest resolution processes for the object from
scheduling unnecessary deletions.
Once the manifest enters the scheduled_delete
state it remains as a
tombstone for a minimum of leeway_seconds
.
After these actions have been attempted, the synchronous portion of the garbage collection process is concluded and a response is return to the user who issued the request.
The asynchronous portion of the garbage collection process is
orchestrated by the garbage collection daemon that wakes up at
specific intervals and checks the riak-cs-gc
bucket for any
scheduled entries that are eligible for reaping. The daemon enters a
working state and begins to delete the object blocks associated with
the eligible keys and continues until all keys have been
processed. The duration of this working state varies depending on
the number of keys involved and the size of the objects they
represent. The daemon checks for messages after processing each object
so that the work interval may be manually interrupted if needed.
Deletion eligibility is determined using the key values in
the riak-cs-gc
bucket which are time values. If the current time
according to the daemon is later than the time represented by a key
plus the leeway interval (gc_seconds_per_slice
* num_leeway_slices
seconds), the blocks for the object manifest stored at that key are
eligible for deletion and the daemon attempts to delete them.
The daemon gathers the eligible keys for deletion by performing a
secondary index range query on the $key
index with a lower bound of
time 0 and an upper bound of the current time. This allows the
daemon to collect all the keys that are eligible for deletion and have
some way of accounting for clock skew.
Once the object blocks represented by a key in the riak-cs-gc
bucket
have all been deleted, the key is deleted from the riak-cs-gc
bucket.
The garbage collection daemon may be queried and manipulated using the
riak-cs-gc
script. The script is installed to the sbin
directory
along with the primary riak-cs
script. The available commands that can
be used with the riak-cs-gc
script are listed below. Running the script
with no command provided displays a list of the available commands.
- batch - Manually start garbage collection for a batch of eligible objects
- status - Get the current status of the garbage collection daemon. The output is dependent on the current state of the daemon.
- pause - Pause the current batch of object garbage collection. It has no effect if there is no active batch.
- resume - Resume a paused garbage collection batch. It has no effect if there is no previously paused batch.
Manifest versions are retrieved and updated by the
riak_moss_manifest_fsm
module with very few exceptions. This module
encapsulates the logic needed to retrieve the manifests, resolve any
conflicts due to siblings, and write updated manifest versions back to
Riak.
The actual deletion of the blocks of an object is managed by the
riak_cs_delete_fsm
module. It starts up a number of delete workers
(based on the configured delete concurrency) and passes off object
block information to those workers who in turn carry out the actual
delete operation for that block. The delete workers are instances of
the riak_moss_block_server
module.
Once a worker deletes a block it notifies the delete fsm and waits for notification about another block to delete. Once all blocks of an object are deleted then the delete fsm starts an instance of the manifest fsm to handle deleting the manifest version from the object manifest data structure and if there are no remaining manifest versions to delete the entire object manifest data structure. The goal of this final step is to avoid the cost of scanning through empty manifest keys that could linger indefinitely.
- An slow reader may have blocks GC'd as it is reading an object if the read exceeds the leeway interval.
- There is some reliance on system clocks and this could lead to object blocks being deleted earlier or later than their intended eligibility window dictates due to clock skew.
- A network partition (or machine failure) lasting longer than
leeway_seconds
could cause a manifest to "come back to life" and appear active, it would then continually serve requests whose blocks could not be found.
The GC implementation gives the deployer two knobs to tune things like slow, used previously:
leeway_seconds
gc_interval
gc_retry_interval
They are commented here.