Skip to content

[DOCS] Master cluster formation troubleshooting. Opster Migration #950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 32 additions & 1 deletion troubleshoot/elasticsearch/discovery-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,33 @@

The following sections describe some common discovery and election problems.

## First-time cluster formation issues [discovery-bootstrap]

If your cluster has never successfully formed before and you see this message in the logs:

`Master node not discovered yet this node has not previously joined a bootstrapped cluster`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is master not discovered yet, this node has not previously joined a bootstrapped cluster, and ... (always with additional information).


This usually indicates a misconfiguration in your initial cluster settings. Note this is for self-hosted instances. In this case, verify the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about "usually". Often I see clusters failing to form with connectivity issues.


1. The `discovery.seed_hosts` setting must contain the IP addresses or hostnames of other nodes in the cluster. At least one of these hosts must be reachable for discovery to work.
```sh
discovery.seed_hosts:
- 192.168.1.1:9300
- 192.168.1.2
- nodes.mycluster.com
```
2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.
2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and must be removed on subsequent starts.

Please also link to the docs about the initial_master_nodes setting here.

```sh
cluster.initial_master_nodes:
- master-node-name1
- master-node-name2
- master-node-name3
```
If this setting is omitted during the first cluster formation, no master election can occur.

Only nodes with `node.master: true` are eligible to become master nodes and participate in elections. Make sure the nodes listed in `cluster.initial_master_nodes` are properly configured as master-eligible. Nodes with `node.voting_only: true` can participate in voting but cannot become master themselves. See [this guide](/deploy-manage/distributed-architecture/discovery-cluster-formation/discovery-hosts-providers.md) for more information.

An {{es}} cluster requires a quorum of master-eligible nodes to elect a master. A quorum is defined as `(N/2 + 1)`, where N is the number of master-eligible nodes. If fewer than this number are available, the cluster will not elect a master and will not form. This quorum mechanism helps prevent split-brain scenarios where multiple nodes mistakenly believe they are the master. For more details, see [Quorum-based decision making](../../deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean (N+1)/2? Normally N is odd so this is probably the most useful way to express it. Otherwise to be fully technically correct the formula is ⌈(N+1)/2⌉. Or we can just say "majority" instead of "quorum" and avoid all this.


## No master is elected [discovery-no-master]

Expand All @@ -42,7 +69,7 @@

The threads involved in discovery and cluster membership are mainly `transport_worker` and `cluster_coordination` threads, for which there should never be a long wait. There may also be evidence of long waits for threads in the {{es}} logs, particularly looking at warning logs from `org.elasticsearch.transport.InboundHandler`. See [Networking threading model](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md#modules-network-threading-model) for more information.


If your cluster has recently lost one or more master-eligible nodes and the logs indicate that no master can be elected, verify that a quorum still exists. A master election requires a majority of the master-eligible nodes to be available (for example, 2 out of 3, or 3 out of 5). If the quorum cannot be met, the cluster will remain unformed until enough nodes are restored. This quorum mechanism is essential for ensuring consistency and preventing split-brain conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this duplicates the information above:

If the logs or the health report indicate that {{es}} can’t discover enough nodes to form a quorum...

and

If the logs or the health report indicate that {{es}} has discovered a possible quorum of nodes...

No need for the user to have to work out what a quorum/majority really is, and indeed they often get confused because they need a majority of the master-eligible nodes that previously made up the cluster, it's not enough to start some new nodes because those nodes' votes won't yet count in the election. I'd rather we didn't add this paragraph.


## Master is elected but unstable [discovery-master-unstable]

Expand All @@ -53,6 +80,10 @@
* Packet captures will reveal system-level and network-level faults, especially if you capture the network traffic simultaneously at all relevant nodes and analyse it alongside the {{es}} logs from those nodes. You should be able to observe any retransmissions, packet loss, or other delays on the connections between the nodes.
* Long waits for particular threads to be available can be identified by taking stack dumps of the main {{es}} process (for example, using `jstack`) or a profiling trace (for example, using Java Flight Recorder) in the few seconds leading up to the relevant log message.

If your master node is also acting as a data node under heavy indexing or search load, this can cause instability. In clusters under high demand, it is recommended to use [dedicated master nodes](/deploy-manage/distributed-architecture/clusters-nodes-shards.md/node-roles#dedicated-master-node)—nodes configured with `node.master: true` and `node.data: false`-to reduce load and improve election reliability.

Check failure on line 83 in troubleshoot/elasticsearch/discovery-troubleshooting.md

View workflow job for this annotation

GitHub Actions / preview / build

`/deploy-manage/distributed-architecture/clusters-nodes-shards.md/node-roles` does not exist. resolved to `/github/workspace/deploy-manage/distributed-architecture/clusters-nodes-shards.md/node-roles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true any more. We don't use dedicated master nodes at all in serverless for instance.


Additionally, ensure that the master node is not affected by resource contention from other applications. This is especially important when running in containers (e.g., Docker or Kubernetes), where CPU throttling, memory limits, or pod evictions can disrupt stability. Ensure adequate resource allocation and isolate master nodes from other workloads whenever possible.

The [Nodes hot threads](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) API sometimes yields useful information, but bear in mind that this API also requires a number of `transport_worker` and `generic` threads across all the nodes in the cluster. The API may be affected by the very problem you’re trying to diagnose. `jstack` is much more reliable since it doesn’t require any JVM threads.

The threads involved in discovery and cluster membership are mainly `transport_worker` and `cluster_coordination` threads, for which there should never be a long wait. There may also be evidence of long waits for threads in the {{es}} logs, particularly looking at warning logs from `org.elasticsearch.transport.InboundHandler`. See [Networking threading model](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md#modules-network-threading-model) for more information.
Expand Down
Loading