Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health check to route queries to healthy cluster and router jmx counter updates #24449

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

auden-woolfson
Copy link
Contributor

@auden-woolfson auden-woolfson commented Jan 28, 2025

Description

Add coodinator health checks to presto router to ensure queries are sent to active/healthy clusters. Part of presto router forward fit. Also includes code implemented to patch CVEs.

== RELEASE NOTES ==

General Changes
* Add health coordinator health checks to presto router
* Add counter JMX metrics to presto router

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Jan 28, 2025
Copy link

linux-foundation-easycla bot commented Jan 28, 2025

CLA Missing ID CLA Not Signed

@auden-woolfson auden-woolfson marked this pull request as ready for review February 3, 2025 22:29
@auden-woolfson auden-woolfson requested review from vinothchandar, 7c00 and a team as code owners February 3, 2025 22:29
@auden-woolfson
Copy link
Contributor Author

Will fix CLA and squash commits upon review

@saravanan19
Copy link

@auden Woolfson - do we plan on including counter jmx metrics as part of this PR? is that intentional? If so, we need the PR title and description updated. If not, need a separate PR.

@auden-woolfson auden-woolfson changed the title Add health check to route queries to healthy cluster Add health check to route queries to healthy cluster and router jmx counter updates Feb 17, 2025
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good. Please squash and split into logically cohesive commits

public void startConfigReloadTask()
{
File routerConfigFile = new File(routerConfig.getConfigFile());
//ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

{
this.routerConfig = config;
this.scheduledExecutorService = scheduledExecutorService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just instantiate the scheduledExecutorService here ? Why inject it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, @saravanan19 is there a reason for this?

}
lastConfigUpdate.set(newConfigUpdateTime);
}
}, 0L, (long) 5, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(long) 5 -> 5L

}

@PostConstruct
public void startConfigReloadTask()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a FileWatcher instead.
1/ Updates are lower latency
2/ Saves on having an extra bg thread


public RemoteState(HttpClient httpClient, URI remoteUri)
private Boolean isHealthy = false;
private long lastHealthyResponseTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of long, prefer using Instant to track an accurate timestamp since last response

}

scheduler.setCandidates(healthyClusterURIs);
if (schedulerType == WEIGHTED_RANDOM_CHOICE || schedulerType == WEIGHTED_ROUND_ROBIN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unrelated change to health check. Can you make a new PR for this change instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp-sivaprasad, can you please add this to #24580?

binder.bind(ClusterManager.class).in(Scopes.SINGLETON);
binder.bind(RemoteInfoFactory.class).in(Scopes.SINGLETON);

bindHttpClient(binder, QUERY_TRACKER, ForQueryInfoTracker.class, IDLE_TIMEOUT_SECOND, REQUEST_TIMEOUT_SECOND);
bindHttpClient(binder, QUERY_TRACKER, ForClusterInfoTracker.class, IDLE_TIMEOUT_SECOND, REQUEST_TIMEOUT_SECOND);

//Determine the NodeVersion
NodeVersion nodeVersion = new NodeVersion(serverConfig.getPrestoVersion());
binder.bind(NodeVersion.class).toInstance(nodeVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used ?

@@ -17,6 +17,11 @@
</properties>

<dependencies>
<dependency>
<groupId>com.facebook.presto</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is only TestingPrestoServer used from presto-main ? Or are there other types ? This can be a test-only dependency ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to make this a test only dep for this PR, but there are other parts of router that we are forward fitting that will rely on this dependency in the main module. It might be beneficial to just leave this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of pulling in a huge dependency like presto-main into presto-router. What other PRs are bringing this in ? Let's see if we can avoid this by building good mocks for the Presto server. I think we may get by just having a few API endpoint's mocked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the authentication piece uses this. Other might as well. We can switch this for this PR but I can't guarantee that we will be able to leave it like this. We can cross that bridge when we get to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thought we might still need this. We are binding multiple classes (ServerConfig, WebUiResource, PluginManagerConfig) in the RouterModule from presto main.

@Test(enabled = false)
public void testHealthChecks()
{
prestoServers.get(0).stopResponding();
Copy link
Contributor

@aaneja aaneja Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test where a server becomes unresponsive (i.e unhealthy), is removed out of rotation, and then becomes responsive again and is added back to the rotation ?

}

@PostConstruct
public void startConfigReloadTask()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this scenario - the file config gets updated, old servers are removed. New ones get added, existing ones stay as-is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants