Skip to content

Improve failure mode, add multiple DCs #1273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions pages/clustering/high-availability.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -601,8 +601,8 @@ the wrong state of other clusters, it can become a leader without being connecte

## Recovering from errors

Distributed systems can fail in numerous ways. With the current implementation, Memgraph instances are resilient to occasional network
failures and independent machine failures. Byzantine failures aren't handled since the Raft consensus protocol cannot deal with them either.
Distributed systems can fail in numerous ways. Memgraph processes are resilient to network
failures, omission faults and independent machine failures. Byzantine failures aren't tolerated since the Raft consensus protocol cannot deal with them either.

Recovery Time Objective (RTO) is an often used term for measuring the maximum tolerable length of time that an instance or cluster can be down.
Since every highly available Memgraph cluster has two types of instances, we need to analyze the failures of each separately.
Expand All @@ -618,9 +618,6 @@ and the time needed to realize the instance is down (`--instance-down-timeout-se
using just a handful of RPC messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed without the loss of committed data
if the newly chosen MAIN (previously REPLICA) had all up-to-date data.

Current deployment assumes the existence of only one datacenter, which automatically means that Memgraph won't be available in the case the whole datacenter goes down. We are actively
working on 2 datacenter (2-DC) architecture.

## Raft configuration parameters

Several Raft-related parameters are important for the correct functioning of the cluster. The leader coordinator sends a heartbeat
Expand All @@ -630,9 +627,14 @@ expiration is set to 2000ms so that cluster can never get into situation where m
the ability to survive occasional network hiccups without triggering leadership changes.


## Data center failure

The architecture we currently use allows us to deploy coordinators in 3 data centers and hence tolerate a failure of the whole data center. Data instances can be freely
distributed in any way you want between data centers. The failover time will be slighlty increased due to the network communication needed.

## Kubernetes

We support deploying Memgraph HA instances as part of the Kubernetes cluster.
We support deploying Memgraph HA as part of the Kubernetes cluster through Helm charts.
You can see example configurations [here](/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart).

## Docker Compose
Expand Down