Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search only replicas (scale to zero) with Reader/Writer Separation #17299

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

prudhvigodithi
Copy link
Member

@prudhvigodithi prudhvigodithi commented Feb 7, 2025

Description

  • The primary goal is to allow users to designate an index as search-only allowing only to have the search only replicas running when enabled via an API call _searchonly/enable (can be disabled as _searchonly/disable).

  • With _searchonly/enable for an index the process has Two-Phase Scale-Down with a temporary block for the duration of the scale-down operation and then explicitly replace it with a permanent block once all prerequisites (e.g., shard sync, flush, metadata updates) have been met.

From #17299 (comment) using _scale API with search_only set to true or false.

curl -X POST "http://localhost:9200/my-index/_scale" \
-H "Content-Type: application/json" \
-d '{
  "search_only": true
}' 

curl -X POST "http://localhost:9200/my-index/_scale" \
-H "Content-Type: application/json" \
-d '{
  "search_only": false
}'

Related Issues

#16720 and part of #15306

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Feb 7, 2025

❌ Gradle check result for e89b812: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi
Copy link
Member Author

While I refactor the code and add additional tests, I’m creating this PR to gather early feedback please take a look and add your thoughts. I will share the testing results in the comments. Thanks!
@mch2 @shwetathareja @msfroh @getsaurabh02

Copy link
Contributor

github-actions bot commented Feb 7, 2025

❌ Gradle check result for 1bd7c6a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi
Copy link
Member Author

I went through and tested the following scenarios

Scenario 1: Search-Only Replicas Recovery with Persistent Data Directory and when cluster.remote_store.state.enabled is set to false

With the following settings, OpenSearch was started using:

./gradlew clean run -PnumNodes=6 --data-dir=/tmp/foo
OpenSearch settings

    
setting 'path.repo', '["/tmp/my-repo"]'
setting 'opensearch.experimental.feature.read.write.split.enabled', 'true'
setting 'node.attr.remote_store.segment.repository', 'my-repository'
setting 'node.attr.remote_store.translog.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.type', 'fs'
setting 'node.attr.remote_store.state.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.settings.location', '/tmp/my-repo'
    
  

Shard Allocation Before Recovery

curl -X GET "localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,unassigned.reason,node,searchOnly"

index    shard prirep state   unassigned.reason node
my-index 0     p      STARTED                   runTask-0
my-index 0     s      STARTED                   runTask-4
my-index 0     r      STARTED                   runTask-2
my-index 1     p      STARTED                   runTask-3
my-index 1     r      STARTED                   runTask-1
my-index 1     s      STARTED                   runTask-5

On restart (terminate the process) everything comes back as running. With search only enabled (/_searchonly/enable) after restart only search replicas are up as running and works as expected.

curl -X GET "localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,unassigned.reason,node,searchOnly"
index    shard prirep state   unassigned.reason node
my-index 0     s      STARTED                   runTask-2
my-index 1     s      STARTED                   runTask-1

Scenario 2: No Data Directory Preservation and when cluster.remote_store.state.enabled is set t o false – Index Lost After process Restart (Recovery)

In this scenario, OpenSearch is started without preserving the data directory, meaning that all local shard data is lost upon Recovery.

./gradlew clean run -PnumNodes=6
OpenSearch settings

    
setting 'path.repo', '["/tmp/my-repo"]'
setting 'opensearch.experimental.feature.read.write.split.enabled', 'true'
setting 'node.attr.remote_store.segment.repository', 'my-repository'
setting 'node.attr.remote_store.translog.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.type', 'fs'
setting 'node.attr.remote_store.state.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.settings.location', '/tmp/my-repo'
    
  

Behavior After Recovery:

  • Upon terminating the process and restarting OpenSearch, the index is completely lost.
  • Any attempt to retrieve the shard state results in an index not found exception.
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [my-index]","index":"my-index","resource.id":"my-index","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [my-index]","index":"my-index","resource.id":"my-index","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}%
  • Even With Remote Restore _remotestore/_restore?restore_all_shards=true, index remains unavailable.
  • Even after recreating the index manually and attempting to restore, documents do not get picked up.
  • Since during testing --data-dir was not used, local data (including cluster metadata) is wiped on recovery.
  • Because the cluster state is lost, OpenSearch no longer has any reference to index.

Scenario 3: Cluster Remote Store State Enabled (cluster.remote_store.state.enabled is set to true and with no persistent data directory) – Primary Shards Remain Unassigned After Recovery.

./gradlew clean run -PnumNodes=6
OpenSearch settings

    
setting 'path.repo', '["/tmp/my-repo"]'
setting 'opensearch.experimental.feature.read.write.split.enabled', 'true'
setting 'node.attr.remote_store.segment.repository', 'my-repository'
setting 'node.attr.remote_store.translog.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.type', 'fs'
setting 'node.attr.remote_store.state.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.settings.location', '/tmp/my-repo'
setting 'cluster.remote_store.state.enabled', 'true'
    
  

Shard Allocation After Recovery

curl -X GET "localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,unassigned.reason,node,searchOnly"
index    shard prirep state      unassigned.reason node
my-index 0     p      UNASSIGNED CLUSTER_RECOVERED 
my-index 0     s      UNASSIGNED CLUSTER_RECOVERED 
my-index 0     r      UNASSIGNED CLUSTER_RECOVERED 
my-index 1     p      UNASSIGNED CLUSTER_RECOVERED 
my-index 1     r      UNASSIGNED CLUSTER_RECOVERED 
my-index 1     s      UNASSIGNED CLUSTER_RECOVERED 

Issue: Primary Shards Remain Unassigned
Despite cluster.remote_store.state.enabled is true, the primary shards are not automatically assigned after restart ( replicating the recovery). The error message states:

"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster"
  • Remote store only contains segments and translogs, NOT active shard copies.
  • Since --data-dir was not used (data directory is not used), local copies of primary shards are lost.
  • OpenSearch does not automatically restore primaries from the remote store without explicit intervention.
curl -X POST "http://localhost:9200/_remotestore/_restore" -H 'Content-Type: application/json' -d'  
{                
  "indices": ["my-index"]
}
'
  • However with this PR, when _searchonly is enabled Search-Only Replicas Recover Without a Primary. Since cluster.remote_store.state.enabled is true, OpenSearch remembers the index exists after restart. The allocation logic skips checking for an active primary for search-only replicas.This allows search-only replicas to be assigned to a node, even without an existing primary. However without _searchonly the behavior is same for all replicas, wanted to give an advantage for users with _searchonly enabled indicies, for these indices they should not care _remotestore/_restore as we are not dealing with primaries.

    • Search-only replicas can recover automatically from the remote store.
    • Search queries remain functional.
    • Cluster state correctly remembers the index, but does not bring up primaries as _searchonly is enabled.
  • The default behavior is OpenSearch does not assume lost primaries should be re-created from remote storage.It waits for explicit user intervention to restore primary shards (_remotestore/_restore). Is this by design ?

Scenario 4: Persistent Data Directory with Remote Store State – Seamless Recovery of Primaries, replicas and search-only replicas

./gradlew clean run -PnumNodes=6 --data-dir=/tmp/foo
OpenSearch settings

    
setting 'path.repo', '["/tmp/my-repo"]'
setting 'opensearch.experimental.feature.read.write.split.enabled', 'true'
setting 'node.attr.remote_store.segment.repository', 'my-repository'
setting 'node.attr.remote_store.translog.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.type', 'fs'
setting 'node.attr.remote_store.state.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.settings.location', '/tmp/my-repo'
setting 'cluster.remote_store.state.enabled', 'true'
    
  

Upon recovery (no intervention is required )

curl -X GET "localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,unassigned.reason,node,searchOnly"
index    shard prirep state   unassigned.reason node
my-index 0     p      STARTED                   runTask-0
my-index 0     r      STARTED                   runTask-2
my-index 0     s      STARTED                   runTask-4
my-index 1     p      STARTED                   runTask-5
my-index 1     r      STARTED                   runTask-3
my-index 1     s      STARTED                   runTask-1
  • All primary and replica shards successfully recover since the cluster metadata is retained in the persistent data directory.

If search-only mode is enabled to index, OpenSearch correctly brings up only search replicas while removing primary and regular replicas.

curl -X GET "localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,unassigned.reason,node,searchOnly"
index    shard prirep state   unassigned.reason node
my-index 0     s      STARTED                   runTask-3
my-index 1     s      STARTED                   runTask-3
  • Only search replicas (SORs) are restored, as expected.

@prudhvigodithi
Copy link
Member Author

Coming from #17299 (comment) @shwetathareja can you please go over scenario 2 and 3 and if it make sense. I wanted to understand why _remotestore/_restore is required in these scenarios and I wanted to give advantage for users ti remove this intervention for search only indices.
Thanks
@mch2

@prudhvigodithi prudhvigodithi force-pushed the searchonly-2 branch 3 times, most recently from 8f1d4ea to 7fa5133 Compare February 7, 2025 23:32
Copy link
Contributor

github-actions bot commented Feb 7, 2025

❌ Gradle check result for 7fa5133: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi
Copy link
Member Author

I have updated the PR to adjust the cluster health configuration using only search replicas and to incorporate the changes made when _searchonly is enabled, the change is not too big hence going with the same PR.

Copy link
Contributor

github-actions bot commented Feb 8, 2025

❌ Gradle check result for 64bb954: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi prudhvigodithi self-assigned this Feb 10, 2025
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Roadmap:Search Project-wide roadmap label Search:Performance v3.0.0 Issues and PRs related to version 3.0.0 labels Feb 12, 2025
Copy link
Contributor

❌ Gradle check result for 470c0ea: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@prudhvigodithi
Copy link
Member Author

Adding @sachinpkale can you please take a look at this comment #17299 (comment) and provide your thoughts to understand why _remotestore/_restore is required (Scenario 3 from #17299 (comment)) and why the cluster cannot be auto recovered, is there any strong reason for this manual intervention to run the API?

curl -X GET "localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,unassigned.reason,node,searchOnly"
index    shard prirep state      unassigned.reason node
my-index 0     p      UNASSIGNED CLUSTER_RECOVERED 
my-index 0     s      UNASSIGNED CLUSTER_RECOVERED 
my-index 0     r      UNASSIGNED CLUSTER_RECOVERED 
my-index 1     p      UNASSIGNED CLUSTER_RECOVERED 
my-index 1     s      UNASSIGNED CLUSTER_RECOVERED 
my-index 1     r      UNASSIGNED CLUSTER_RECOVERED 

I dint get much info from the docs https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/remote-store/index/#restoring-from-a-backup.


@Override
public List<Route> routes() {
return asList(new Route(POST, "/{index}/_searchonly/enable"), new Route(POST, "/{index}/_searchonly/disable"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename _searchonly better to have a verb instead to denote an action on an index like _scale and use search-only as a query parameter/request body to ensure the API finds wider applicability

Copy link
Member Author

@prudhvigodithi prudhvigodithi Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will take a look at this to go with a generic and which has a wider applicability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I started with _scale #16720 (comment). May be we can have ?

POST /{index}/_scale

{
  "search-only": true
}

Adding @msfroh @mch2 @getsaurabh02

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per original discussion @prudhvigodithi _scale is more intuitive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I have updated to use _scale , example

curl -X POST "http://localhost:9200/my-index/_scale" \
-H "Content-Type: application/json" \
-d '{
  "search_only": true
}' 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately you'll need to make a corresponding set of changes in the action names and probably the class names for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do that. I can change to RestScaleSearchOnlyAction. Thanks

setting 'node.attr.remote_store.translog.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.type', 'fs'
setting 'node.attr.remote_store.state.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.settings.location', '/tmp/my-repo'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used these settings to test the search only replicas with remote store as filesystem, will remove them and change to default once I have the green tests and finalized the PR review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should add a param to easily set these from command line with ./gradlew run.

@prudhvigodithi prudhvigodithi force-pushed the searchonly-2 branch 2 times, most recently from 34d8bd5 to 285fc87 Compare February 24, 2025 22:23
Copy link
Contributor

❌ Gradle check result for 285fc87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Prudhvi Godithi <[email protected]>
Copy link
Contributor

❌ Gradle check result for cb515eb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for cb515eb: SUCCESS

setting 'node.attr.remote_store.translog.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.type', 'fs'
setting 'node.attr.remote_store.state.repository', 'my-repository'
setting 'node.attr.remote_store.repository.my-repository.settings.location', '/tmp/my-repo'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should add a param to easily set these from command line with ./gradlew run.

import static org.opensearch.test.hamcrest.OpenSearchAssertions.assertHitCount;

@OpenSearchIntegTestCase.ClusterScope(scope = OpenSearchIntegTestCase.Scope.TEST, numDataNodes = 0)
public class SearchOnlyIT extends RemoteStoreBaseIntegTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick - we have a few tests with a generic "SearchOnly" name, can we narrow this naming to the Scale functionality.

totalSearchReplicas += shardTable.searchOnlyReplicas().stream().filter(ShardRouting::active).count();
}
assertEquals("Expected 1 primary", 1, totalPrimaries);
assertEquals("Expected 0 writer replicas", 1, totalWriterReplicas);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't have to make these assertions on the routing table as its implied by ensureGreen that all desired shards are active.

assertEquals(1, shardTable.searchOnlyReplicas().stream().filter(ShardRouting::active).count());

// Scale down to search-only mode
assertAcked(client().admin().indices().prepareSearchOnly(TEST_INDEX).setSearchOnly(true).get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepareSearchOnly API would need to be updated to use your Scale naming.


// Verify we can still search
SearchResponse searchResponse = client().prepareSearch(TEST_INDEX).get();
assertHitCount(searchResponse, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - though this still uses segrep with remote store, post indexing above we should add a wait condition that 10 docs are active to avoid flakiness? Or is it a condition of the scale API and the block presence that the replica will be up to date before those are ack'd? If thats the case pls make a small comment.

@@ -94,6 +94,7 @@
import java.util.TreeSet;
import java.util.function.Function;

import static org.opensearch.action.admin.indices.scale.searchonly.TransportSearchOnlyAction.INDEX_SEARCHONLY_BLOCK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets define the new block along with all the others here in IndexMetadata instead of the transport layer.

@@ -94,6 +96,7 @@ public class IndexRoutingTable extends AbstractDiffable<IndexRoutingTable>
private final Map<Integer, IndexShardRoutingTable> shards;

private final List<ShardRouting> allActiveShards;
protected final Logger logger = LogManager.getLogger(this.getClass());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected?

+ indexMetadata.getNumberOfSearchOnlyReplicas()
+ "], got ["
+ routingNumberOfReplicas
+ "]"
);
} else if (routingNumberOfReplicas != expectedReplicas) {
// Just log if there's a mismatch but the index is search-only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to log here if this is an expected state?

@@ -425,8 +442,33 @@ public Builder initializeAsNew(IndexMetadata indexMetadata) {
/**
* Initializes an existing index.
*/

public Builder initializeAsRecovery(IndexMetadata indexMetadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about hte other cases here?
initializeAsFromCloseToOpen
initializeAsFromDangling
initializeAsFromOpenToClose

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this check need to be in initializeEmpty instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see initializeEmpty is generically called across those methods, let me update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now after thinking It's logical that these operations would need different behavior than recovery?. When reopening or importing an index, we should typically want all shards (primaries, regular replicas, and search replicas) to be initialized properly, not just search replicas, WDYT ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking back yes we can move this logic to common method initializeEmpty, but at this there is an issue with _close API when search replicas are enabled #15306 (comment). For this PR I will move and keep the code in initializeEmpty.

null

if (mdFile == null) {
listener.onResponse(new CheckpointInfoResponse(indexShard.getLatestReplicationCheckpoint(), Collections.emptyMap(), null));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is so that SRs can become active and sync with remote before a primary has uploaded anything correct? I don't think we want to make the same change for other replicas that rely on primary initialization to complete first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point let me handle a case for isSearchOnly. Something like } else if (indexShard.routingEntry().isSearchOnly()) {.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Search Project-wide roadmap label Search:Performance v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Scale to Zero with Reader/Writer Separation.
4 participants