Skip to content

fix: allow read-only operations when all masters are down#497

Open
p0rtale wants to merge 1 commit into
masterfrom
fix/read-only-operations-with-dead-masters
Open

fix: allow read-only operations when all masters are down#497
p0rtale wants to merge 1 commit into
masterfrom
fix/read-only-operations-with-dead-masters

Conversation

@p0rtale

@p0rtale p0rtale commented May 24, 2026

Copy link
Copy Markdown
Contributor

Read-only operations (get, select, pairs, count, len, min, max) used to fail with connection errors if all masters in the cluster were unavailable, even when healthy replicas were up and failover hadn't processed yet.

To resolve this, the following improvements were made:

  • Introduced a read_only flag to utils.get_space[s] to fetch cluster schema from any healthy replica if masters are down.
  • Updated get, select, pairs, count, len, min, max to use this new flag.
  • Rewrote call.any to iterate through all replicasets and utilize vshard's callro instead of call to fetch metadata from replicas.

Closes TNTP-7102

@p0rtale p0rtale self-assigned this May 24, 2026
@p0rtale p0rtale force-pushed the fix/read-only-operations-with-dead-masters branch 3 times, most recently from a489998 to 427fc6a Compare May 25, 2026 18:06
@p0rtale p0rtale force-pushed the fix/read-only-operations-with-dead-masters branch from 427fc6a to c93121e Compare June 1, 2026 21:38
@p0rtale p0rtale requested review from Satbek, ita-sammann and vakhov June 4, 2026 08:43
@p0rtale p0rtale force-pushed the fix/read-only-operations-with-dead-masters branch from c93121e to 3dd8fbe Compare June 9, 2026 09:55
Comment thread crud/common/utils.lua

function utils.get_spaces(vshard_router, timeout, replica_id)
local replicasets, replicaset, replicaset_id, master
local function find_any_healthy_replica_conn(replicasets)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread crud/len.lua
end

local results, err = call.map(vshard_router, CRUD_LEN_FUNC_NAME, {space_name}, {
mode = 'write',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise len() will fail if masters are unavailable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise len() will fail if masters are unavailable

may be it's ok? We need to know exact number of tuples on master. No masters -> no result.

you may add ticket for some flag for len method to get len from replicas when master is absent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the read only options (mode, request_timeout, prefer_replica, balance) to allow users to explicitly fetch the length from replicas if needed.
The default behavior is unchanged (mode = 'write'), so len will still fail without masters to guarantee exact data.

Comment thread crud/common/utils.lua Outdated

function utils.get_space(space_name, vshard_router, timeout, replica_id)
local spaces, err, schema_version = utils.get_spaces(vshard_router, timeout, replica_id)
function utils.get_space(space_name, vshard_router, timeout, replica_id, read_only)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be add timeout, replica_id, read_only args in one opts table argument?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Read-only operations (get, select, pairs, count, len, min, max) used to
fail with connection errors if all masters in the cluster were unavailable,
even when healthy replicas were up and failover hadn't processed yet.

To resolve this, the following improvements were made:
- Introduced a `read_only` flag to `utils.get_space[s]` to fetch cluster
  schema from any healthy replica if masters are down.
- Updated `get`, `select`, `pairs`, `count`, `len`, `min`, `max` to use
  this new flag.
- Rewrote `call.any` to iterate through all replicasets and utilize
  vshard's `callro` instead of `call` to fetch metadata from replicas.
- Added support for `mode`, `balance`, `prefer_replica`, and `request_timeout`
  options in `crud.len`. The default mode remains `write` to preserve
  backward compatibility.
@p0rtale p0rtale force-pushed the fix/read-only-operations-with-dead-masters branch from 3dd8fbe to 6a1fada Compare June 14, 2026 20:02
@p0rtale p0rtale requested a review from Satbek June 15, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants