Skip to content

Commit 615de98

Browse files
committed
doc: Add error handling of SXM
Signed-off-by: Vincent Liu <[email protected]>
1 parent 4e34d8d commit 615de98

File tree

1 file changed

+102
-2
lines changed
  • doc/content/xapi/storage

1 file changed

+102
-2
lines changed

doc/content/xapi/storage/sxm.md

Lines changed: 102 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,14 @@ Title: Storage migration
88
- [But we have storage\_mux.ml](#but-we-have-storage_muxml)
99
- [Thought experiments on an alternative design](#thought-experiments-on-an-alternative-design)
1010
- [Design](#design)
11-
- [SMAPIv1 Migration](#smapiv1-migration)
11+
- [SMAPIv1 migration](#smapiv1-migration)
12+
- [SMAPIv3 migration](#smapiv3-migration)
13+
- [Error Handling](#error-handling)
14+
- [Preparation (SMAPIv1 and SMAPIv3)](#preparation-smapiv1-and-smapiv3)
15+
- [Snapshot and mirror failure (SMAPIv1)](#snapshot-and-mirror-failure-smapiv1)
16+
- [Mirror failure (SMAPIv3)](#mirror-failure-smapiv3)
17+
- [Copy failure (SMAPIv1)](#copy-failure-smapiv1)
18+
- [SMAPIv1 Migration implementation detail](#smapiv1-migration-implementation-detail)
1219
- [Receiving SXM](#receiving-sxm)
1320
- [Xapi code](#xapi-code)
1421
- [Storage code](#storage-code)
@@ -113,8 +120,100 @@ Note that later on storage_smapi{v1,v3}_migrate.ml will still have the flexibili
113120
to call remote SMAPIv2 functions, such as `Remote.VDI.attach dest_sr vdi`, and
114121
it will be handled just as before.
115122

123+
## SMAPIv1 migration
116124

117-
## SMAPIv1 Migration
125+
At a high level, mirror establishment for SMAPIv1 works as follows:
126+
127+
1. Take a snapshot of a VDI that is attached to VM1. This gives us an immutable
128+
copy of the current state of the VDI, with all the data until the point we took
129+
the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting
130+
to a shared parent, which stores the shared content for the snapshot and the writable
131+
VDI from which we took the snapshot (snapshot)
132+
2. Mirror the writable VDI to the server hosts: this means that all writes that goes to the
133+
client VDI will also be written to the mirrored VDI on the remote host (mirror)
134+
3. Copy the immutable snapshot from our local host to the remote (copy)
135+
4. Compose the mirror and the snapshot to form a single VDI
136+
5. Destroy the snapshot on the local host (cleanup)
137+
138+
139+
more detail to come...
140+
141+
## SMAPIv3 migration
142+
143+
More detail to come...
144+
145+
## Error Handling
146+
147+
Storage migration is a long-running process, and is prone to failures in each
148+
step. Hence it is important specifying what errors could be raised at each step
149+
and their significance. This is beneficial both for the user and for triaging.
150+
151+
There are two general cleanup functions in SXM: `MIRROR.receive_cancel` and
152+
`MIRROR.stop`. The former is for cleaning up whatever has been created by `MIRROR.receive_start`
153+
on the destination host (such as VDIs for receiving mirrored data). The latter is
154+
a more comprehensive function that attempts to "undo" all the side effects that
155+
was done during the SXM, and also calls `receive_cancel` as part of its operations.
156+
157+
Currently error handling was done by building up a list of cleanup functions in
158+
the `on_fail` list ref as the function executes. For example, if the `receive_start`
159+
has been completed successfully, add `receive_cancel` to the list of cleanup functions.
160+
And whenever an exception is encountered, just execute whatever has been added
161+
to the `on_fail` list ref. This is convenient, but does entangle all the error
162+
handling logic with the core SXM logic itself, making the code rather than hard
163+
to understand and maintain.
164+
165+
The idea to fix this is to introduce explicit "stages" during the SXM and define
166+
explicitly what error handling should be done if it fails at a certain stage. This
167+
helps separate the error handling logic into the `with` part of a `try with` block,
168+
which is where they are supposed to be. Since we need to accommodate the existing
169+
SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are
170+
introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that
171+
each stage also roughly corresponds to a helper function that is called within `MIRROR.start`,
172+
which is the wrapper function that initiates storage migration. And each helper
173+
functions themselves would also have error handling logic within themselves as
174+
needed (e.g. see `Storage_smapiv1_migrate.receive_start) to deal with exceptions
175+
that happen within each helper functions.
176+
177+
### Preparation (SMAPIv1 and SMAPIv3)
178+
179+
The preparation stage generally corresponds to what is done in `receive_start`, and
180+
this function itself will handle exceptions when there are partial failures within
181+
the function itself, such as an exception after the receiving VDI is created.
182+
It will use the old-style `on_fail` function but only with a limited scope.
183+
184+
There is nothing to be done at a higher level (i.e within `MIRROR.start` which
185+
calls `receive_start`) if preparation has failed.
186+
187+
### Snapshot and mirror failure (SMAPIv1)
188+
189+
For SMAPIv1, the mirror is done in a bit cumbersome way. The end goal is to establish
190+
connections between two tapdisk processes on the source and destination hosts.
191+
To achieve this goal, xapi will do two main jobs: 1. create a connection between two
192+
hosts and pass the connection to tapdisk; 2. create a snapshot as a starting point
193+
of the mirroring process.
194+
195+
Therefore handling of failures at these two stages are similar: clean up what was
196+
done in the preparation stage by calling `receive_cancel`, and that is almost it.
197+
Again, we will leave whatever is needed for partial failure handling within those
198+
functions themselves and only clean up at a stage-level in `storage_migrate.ml`
199+
200+
Note that `receive_cancel` is a multiplexed function for SMAPIv1 and SMAPIv3, which
201+
means different clean up logic will be executed depending on what type of SR we
202+
are migrating from.
203+
204+
### Mirror failure (SMAPIv3)
205+
206+
To be filled...
207+
208+
### Copy failure (SMAPIv1)
209+
210+
The final step of storage migration for SMAPIv1 is to copy the snapshot from the
211+
source to the destination. At this stage, most of the side effectful work has been
212+
done, so we do need to call `MIRROR.stop` to clean things up if we experience an
213+
failure during copying.
214+
215+
216+
## SMAPIv1 Migration implementation detail
118217

119218
```mermaid
120219
sequenceDiagram
@@ -1877,3 +1976,4 @@ let pre_deactivate_hook ~dbg ~dp ~sr ~vdi =
18771976
s.failed <- true
18781977
)
18791978
```
1979+

0 commit comments

Comments
 (0)