|
1 | 1 | .. _admin-disaster_recovery:
|
2 | 2 |
|
3 |
| -================================================================================ |
4 | 3 | Disaster recovery
|
5 |
| -================================================================================ |
| 4 | +================= |
6 | 5 |
|
7 |
| -The minimal fault-tolerant Tarantool configuration would be a |
8 |
| -:ref:`replication cluster<replication-topologies>` |
| 6 | +The minimal fault-tolerant Tarantool configuration would be a :ref:`replica set <replication-architecture>` |
9 | 7 | that includes a master and a replica, or two masters.
|
| 8 | +The basic recommendation is to configure all Tarantool instances in a replica set to create :ref:`snapshot files <index-box_persistence>` on a regular basis. |
10 | 9 |
|
11 |
| -The basic recommendation is to configure all Tarantool instances in a cluster to |
12 |
| -create :ref:`snapshot files <index-box_persistence>` at a regular basis. |
| 10 | +Here are action plans for typical crash scenarios. |
13 | 11 |
|
14 |
| -Here follow action plans for typical crash scenarios. |
15 | 12 |
|
16 | 13 | .. _admin-disaster_recovery-master_replica:
|
17 | 14 |
|
18 |
| --------------------------------------------------------------------------------- |
19 | 15 | Master-replica
|
20 |
| --------------------------------------------------------------------------------- |
| 16 | +-------------- |
21 | 17 |
|
22 |
| -Configuration: One master and one replica. |
| 18 | +.. _admin-disaster_recovery-master_replica_manual_failover: |
23 | 19 |
|
24 |
| -Problem: The master has crashed. |
| 20 | +Master crash: manual failover |
| 21 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
25 | 22 |
|
26 |
| -Your actions: |
| 23 | +**Configuration:** master-replica (:ref:`manual failover <replication-master_replica_bootstrap>`). |
27 | 24 |
|
28 |
| -1. Ensure the master is stopped for good. For example, log in to the master |
29 |
| - machine and use ``systemctl stop tarantool@<instance_name>``. |
| 25 | +**Problem:** The master has crashed. |
30 | 26 |
|
31 |
| -2. Switch the replica to master mode by setting |
32 |
| - :ref:`box.cfg.read_only <cfg_basic-read_only>` parameter to *false* and let |
33 |
| - the load be handled by the replica (effective master). |
| 27 | +**Actions:** |
34 | 28 |
|
35 |
| -3. Set up a replacement for the crashed master on a spare host, with |
36 |
| - :ref:`replication <cfg_replication-replication>` parameter set to replica |
37 |
| - (effective master), so it begins to catch up with the new master’s state. |
38 |
| - The new instance should have :ref:`box.cfg.read_only <cfg_basic-read_only>` |
39 |
| - parameter set to *true*. |
| 29 | +1. Ensure the master is stopped. |
| 30 | + For example, log in to the master machine and use ``tt stop``. |
40 | 31 |
|
41 |
| -You lose the few transactions in the master |
42 |
| -:ref:`write ahead log file <index-box_persistence>`, which it may have not |
43 |
| -transferred to the replica before crash. If you were able to salvage the master |
44 |
| -.xlog file, you may be able to recover these. In order to do it: |
| 32 | +2. Configure a new replica set leader using the :ref:`<replicaset_name>.leader <configuration_reference_replicasets_name_leader>` option. |
45 | 33 |
|
46 |
| -1. Find out the position of the crashed master, as reflected on the new master. |
| 34 | +3. Reload configuration on all instances using :ref:`config:reload() <config-module>`. |
47 | 35 |
|
48 |
| - a. Find out instance UUID from the crashed master :ref:`xlog <internals-wal>`: |
| 36 | +4. Make sure that a new replica set leader is a master using :ref:`box.info.ro <box_introspection-box_info>`. |
49 | 37 |
|
50 |
| - .. code-block:: console |
| 38 | +5. On a new master, :ref:`remove a crashed instance from the '_cluster' space <replication-remove_instances-remove_cluster>`. |
51 | 39 |
|
52 |
| - $ head -5 *.xlog | grep Instance |
53 |
| - Instance: ed607cad-8b6d-48d8-ba0b-dae371b79155 |
| 40 | +6. Set up a replacement for the crashed master on a spare host. |
54 | 41 |
|
55 |
| - b. On the new master, use the UUID to find the position: |
| 42 | +See also: :ref:`Performing manual failover <replication-controlled_failover>`. |
56 | 43 |
|
57 |
| - .. code-block:: tarantoolsession |
58 | 44 |
|
59 |
| - tarantool> box.info.vclock[box.space._cluster.index.uuid:select{'ed607cad-8b6d-48d8-ba0b-dae371b79155'}[1][1]] |
60 |
| - --- |
61 |
| - - 23425 |
62 |
| - <...> |
| 45 | +.. _admin-disaster_recovery-master_replica_auto_failover: |
63 | 46 |
|
64 |
| -2. Play the records from the crashed .xlog to the new master, starting from the |
65 |
| - new master position: |
| 47 | +Master crash: automated failover |
| 48 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
66 | 49 |
|
67 |
| - a. Issue this request locally at the new master's machine to find out |
68 |
| - instance ID of the new master: |
| 50 | +**Configuration:** master-replica (:ref:`automated failover <replication-bootstrap-auto>`). |
69 | 51 |
|
70 |
| - .. code-block:: tarantoolsession |
| 52 | +**Problem:** The master has crashed. |
71 | 53 |
|
72 |
| - tarantool> box.space._cluster:select{} |
73 |
| - --- |
74 |
| - - - [1, '88580b5c-4474-43ab-bd2b-2409a9af80d2'] |
75 |
| - ... |
| 54 | +**Actions:** |
76 | 55 |
|
77 |
| - b. Play the records to the new master: |
| 56 | +1. Use ``box.info.election`` to make sure a new master is elected automatically. |
78 | 57 |
|
79 |
| - .. code-block:: console |
| 58 | +2. On a new master, :ref:`remove a crashed instance from the '_cluster' space <replication-remove_instances-remove_cluster>`. |
| 59 | + |
| 60 | +3. Set up a replacement for the crashed master on a spare host. |
| 61 | + |
| 62 | +See also: :ref:`Testing automated failover <replication-automated-failover-testing>`. |
| 63 | + |
| 64 | + |
| 65 | +.. _admin-disaster_recovery-master_replica_data_loss: |
| 66 | + |
| 67 | +Data loss |
| 68 | +~~~~~~~~~ |
| 69 | + |
| 70 | +**Configuration:** master-replica. |
| 71 | + |
| 72 | +**Problem:** Some transactions are missing on a replica after the master has crashed. |
| 73 | + |
| 74 | +**Actions:** |
| 75 | + |
| 76 | +You lose a few transactions in the master |
| 77 | +:ref:`write-ahead log file <index-box_persistence>`, which may have not |
| 78 | +transferred to the replica before the crash. If you were able to salvage the master |
| 79 | +``.xlog`` file, you may be able to recover these. |
| 80 | + |
| 81 | +1. Find out instance UUID from the crashed master :ref:`xlog <internals-wal>`: |
| 82 | + |
| 83 | + .. code-block:: console |
| 84 | +
|
| 85 | + $ head -5 var/lib/instance001/*.xlog | grep Instance |
| 86 | + Instance: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660 |
| 87 | +
|
| 88 | +2. On the new master, use the UUID to find the position: |
| 89 | + |
| 90 | + .. code-block:: tarantoolsession |
| 91 | +
|
| 92 | + app:instance002> box.info.vclock[box.space._cluster.index.uuid:select{'9bb111c2-3ff5-36a7-00f4-2b9a573ea660'}[1][1]] |
| 93 | + --- |
| 94 | + - 999 |
| 95 | + ... |
| 96 | +
|
| 97 | +3. :ref:`Play the records <tt-play>` from the crashed ``.xlog`` to the new master, starting from the |
| 98 | + new master position: |
| 99 | + |
| 100 | + .. code-block:: console |
| 101 | +
|
| 102 | + $ tt play 127.0.0.1:3302 var/lib/instance001/00000000000000000000.xlog \ |
| 103 | + --from 1000 \ |
| 104 | + --replica 1 \ |
| 105 | + --username admin --password secret |
80 | 106 |
|
81 |
| - $ tt play <new_master_uri> <xlog_file> --from 23425 --replica 1 |
82 | 107 |
|
83 | 108 | .. _admin-disaster_recovery-master_master:
|
84 | 109 |
|
85 |
| --------------------------------------------------------------------------------- |
86 | 110 | Master-master
|
87 |
| --------------------------------------------------------------------------------- |
| 111 | +------------- |
| 112 | + |
| 113 | +**Configuration:** :ref:`master-master <replication-bootstrap-master-master>`. |
| 114 | + |
| 115 | +**Problem:** one master has crashed. |
88 | 116 |
|
89 |
| -Configuration: Two masters. |
| 117 | +**Actions:** |
90 | 118 |
|
91 |
| -Problem: Master#1 has crashed. |
| 119 | +1. Let the load be handled by another master alone. |
92 | 120 |
|
93 |
| -Your actions: |
| 121 | +2. Remove a crashed master from a replica set. |
94 | 122 |
|
95 |
| -1. Let the load be handled by master#2 (effective master) alone. |
| 123 | +3. Set up a replacement for the crashed master on a spare host. |
| 124 | + Learn more from :ref:`Adding and removing instances <replication-master-master-add-remove-instances>`. |
96 | 125 |
|
97 |
| -2. Follow the same steps as in the |
98 |
| -:ref:`master-replica <admin-disaster_recovery-master_replica>` recovery scenario |
99 |
| -to create a new master and salvage lost data. |
100 | 126 |
|
101 | 127 | .. _admin-disaster_recovery-data_loss:
|
102 | 128 |
|
103 |
| --------------------------------------------------------------------------------- |
104 |
| -Data loss |
105 |
| --------------------------------------------------------------------------------- |
| 129 | +Master-replica/master-master: data loss |
| 130 | +--------------------------------------- |
| 131 | + |
| 132 | +**Configuration:** master-replica or master-master. |
| 133 | + |
| 134 | +**Problem:** Data was deleted at one master and this data loss was propagated to the other node (master or replica). |
| 135 | + |
| 136 | +**Actions:** |
| 137 | + |
| 138 | +1. Put all nodes in read-only mode. |
| 139 | + Depending on the :ref:`replication.failover <configuration_reference_replication_failover>` mode, this can be done as follows: |
| 140 | + |
| 141 | + - ``manual``: change a replica set leader to ``null``. |
| 142 | + - ``election``: set :ref:`replication.election_mode <configuration_reference_replication_election_mode>` to ``voter`` or ``off`` at the replica set level. |
| 143 | + - ``off``: set ``database.mode`` to ``ro``. |
106 | 144 |
|
107 |
| -Configuration: Master-master or master-replica. |
| 145 | + Reload configurations on all instances using the ``reload()`` function provided by the :ref:`config <config-module>` module. |
108 | 146 |
|
109 |
| -Problem: Data was deleted at one master and this data loss was propagated to the |
110 |
| -other node (master or replica). |
| 147 | +2. Turn off deletion of expired checkpoints with :doc:`/reference/reference_lua/box_backup/start`. |
| 148 | + This prevents the Tarantool garbage collector from removing files |
| 149 | + made with older checkpoints until :doc:`/reference/reference_lua/box_backup/stop` is called. |
111 | 150 |
|
112 |
| -The following steps are applicable only to data in memtx storage engine. |
113 |
| -Your actions: |
| 151 | +3. Get the latest valid :ref:`.snap file <internals-snapshot>` and |
| 152 | + use ``tt cat`` command to calculate at which LSN the data loss occurred. |
114 | 153 |
|
115 |
| -1. Put all nodes in :ref:`read-only mode <cfg_basic-read_only>` and disable |
116 |
| - deletion of expired checkpoints with :doc:`/reference/reference_lua/box_backup/start`. |
117 |
| - This will prevent the Tarantool garbage collector from removing files |
118 |
| - made with older checkpoints until :doc:`/reference/reference_lua/box_backup/stop` is called. |
| 154 | +4. Start a new instance and use :ref:`tt play <tt-play>` command to |
| 155 | + play to it the contents of ``.snap`` and ``.xlog`` files up to the calculated LSN. |
119 | 156 |
|
120 |
| -2. Get the latest valid :ref:`.snap file <internals-snapshot>` and |
121 |
| - use ``tt cat`` command to calculate at which lsn the data loss occurred. |
| 157 | +5. Bootstrap a new replica from the recovered master. |
122 | 158 |
|
123 |
| -3. Start a new instance (instance#1) and use ``tt play`` command to |
124 |
| - play to it the contents of .snap/.xlog files up to the calculated lsn. |
| 159 | +.. NOTE:: |
125 | 160 |
|
126 |
| -4. Bootstrap a new replica from the recovered master (instance#1). |
| 161 | + The steps above are applicable only to data in the memtx storage engine. |
0 commit comments