Add health check to route queries to healthy cluster and router jmx counter updates #24449

auden-woolfson · 2025-01-28T22:44:56Z

Description

Add coodinator health checks to presto router to ensure queries are sent to active/healthy clusters. Part of presto router forward fit. Also includes code implemented to patch CVEs.

== RELEASE NOTES ==

General Changes
* Add health coordinator health checks to presto router
* Add counter JMX metrics to presto router

linux-foundation-easycla · 2025-01-28T22:45:00Z

✅ login: auden-woolfson / name: Auden Woolfson (490281b, 84c49f5, 590fffb, 32a0f2c, 217e359, 0d7cafe, 235a9a2, 8b82a56, 0472c5a, 6b91289, e8efa14, 8800275, 938a364, 9965e6e, a3643b3, 9f4cdd1, 0aa374e, 5f6115c, 794d01e, db481aa, d31afa5, e572ebb, 1954f4c, a9c285d, 4d94830, 4033720, 88c41f0)
❌ The email address for the commit (ccaa707) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please submit a support request ticket.

auden-woolfson · 2025-02-03T22:29:40Z

Will fix CLA and squash commits upon review

saravanan19 · 2025-02-17T18:21:17Z

@auden Woolfson - do we plan on including counter jmx metrics as part of this PR? is that intentional? If so, we need the PR title and description updated. If not, need a separate PR.

aaneja

Overall approach looks good. Please squash and split into logically cohesive commits

aaneja · 2025-02-18T12:39:28Z

presto-router/src/main/java/com/facebook/presto/router/cluster/ClusterManager.java

+    public void startConfigReloadTask()
+    {
+        File routerConfigFile = new File(routerConfig.getConfigFile());
+        //ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);


aaneja · 2025-02-18T12:43:08Z

presto-router/src/main/java/com/facebook/presto/router/cluster/ClusterManager.java

    {
+        this.routerConfig = config;
+        this.scheduledExecutorService = scheduledExecutorService;


Why not just instantiate the scheduledExecutorService here ? Why inject it ?

I'm not sure, @saravanan19 is there a reason for this?

aaneja · 2025-02-18T12:45:55Z

presto-router/src/main/java/com/facebook/presto/router/cluster/ClusterManager.java

+                }
+                lastConfigUpdate.set(newConfigUpdateTime);
+            }
+        }, 0L, (long) 5, TimeUnit.SECONDS);


(long) 5 -> 5L

aaneja · 2025-02-18T13:19:39Z

presto-router/src/main/java/com/facebook/presto/router/cluster/ClusterManager.java

+    }
+
+    @PostConstruct
+    public void startConfigReloadTask()


Consider using a FileWatcher instead.
1/ Updates are lower latency
2/ Saves on having an extra bg thread

aaneja · 2025-02-18T13:32:47Z

presto-router/src/main/java/com/facebook/presto/router/cluster/RemoteState.java


-    public RemoteState(HttpClient httpClient, URI remoteUri)
+    private Boolean isHealthy = false;
+    private long lastHealthyResponseTime;


Instead of long, prefer using Instant to track an accurate timestamp since last response

aaneja · 2025-02-18T15:12:17Z

presto-router/src/main/java/com/facebook/presto/router/cluster/ClusterManager.java

        }

+        scheduler.setCandidates(healthyClusterURIs);
+        if (schedulerType == WEIGHTED_RANDOM_CHOICE || schedulerType == WEIGHTED_ROUND_ROBIN) {


This looks like an unrelated change to health check. Can you make a new PR for this change instead ?

@jp-sivaprasad, can you please add this to #24580?

aaneja · 2025-02-18T15:13:46Z

presto-router/src/main/java/com/facebook/presto/router/RouterModule.java

        binder.bind(ClusterManager.class).in(Scopes.SINGLETON);
        binder.bind(RemoteInfoFactory.class).in(Scopes.SINGLETON);

        bindHttpClient(binder, QUERY_TRACKER, ForQueryInfoTracker.class, IDLE_TIMEOUT_SECOND, REQUEST_TIMEOUT_SECOND);
        bindHttpClient(binder, QUERY_TRACKER, ForClusterInfoTracker.class, IDLE_TIMEOUT_SECOND, REQUEST_TIMEOUT_SECOND);

+        //Determine the NodeVersion
+        NodeVersion nodeVersion = new NodeVersion(serverConfig.getPrestoVersion());
+        binder.bind(NodeVersion.class).toInstance(nodeVersion);


Is this used ?

aaneja · 2025-02-18T15:20:24Z

presto-router/pom.xml

@@ -17,6 +17,11 @@
    </properties>

    <dependencies>
+        <dependency>
+            <groupId>com.facebook.presto</groupId>


Is only TestingPrestoServer used from presto-main ? Or are there other types ? This can be a test-only dependency ?

We may be able to make this a test only dep for this PR, but there are other parts of router that we are forward fitting that will rely on this dependency in the main module. It might be beneficial to just leave this

Not a fan of pulling in a huge dependency like presto-main into presto-router. What other PRs are bringing this in ? Let's see if we can avoid this by building good mocks for the Presto server. I think we may get by just having a few API endpoint's mocked

I believe the authentication piece uses this. Other might as well. We can switch this for this PR but I can't guarantee that we will be able to leave it like this. We can cross that bridge when we get to it

Actually on second thought we might still need this. We are binding multiple classes (ServerConfig, WebUiResource, PluginManagerConfig) in the RouterModule from presto main.

aaneja · 2025-02-18T17:11:59Z

presto-router/src/test/java/com/facebook/presto/router/TestHealthChecks.java

+    @Test(enabled = false)
+    public void testHealthChecks()
+    {
+        prestoServers.get(0).stopResponding();


Can you add a test where a server becomes unresponsive (i.e unhealthy), is removed out of rotation, and then becomes responsive again and is added back to the rotation ?

aaneja · 2025-02-18T17:13:51Z

presto-router/src/main/java/com/facebook/presto/router/cluster/ClusterManager.java

+    }
+
+    @PostConstruct
+    public void startConfigReloadTask()


Can you add a test for this scenario - the file config gets updated, old servers are removed. New ones get added, existing ones stay as-is

…/RemoteState.java Co-authored-by: Anant Aneja <[email protected]>

prestodb-ci added the from:IBM PR from IBM label Jan 28, 2025

auden-woolfson added the presto-router label Jan 28, 2025

auden-woolfson force-pushed the router_health_checks branch from be503f7 to e4de4ba Compare January 30, 2025 06:06

auden-woolfson marked this pull request as ready for review February 3, 2025 22:29

auden-woolfson requested review from vinothchandar, 7c00 and a team as code owners February 3, 2025 22:29

auden-woolfson requested a review from presto-oss February 3, 2025 22:29

Auden Woolfson and others added 2 commits February 3, 2025 15:40

Add health check to route queries to healthy cluster

ccaa707

clean files

0d7cafe

auden-woolfson force-pushed the router_health_checks branch from e4de4ba to 0d7cafe Compare February 3, 2025 23:41

auden-woolfson added 11 commits February 4, 2025 10:04

add back router config

6b91289

organize pom

938a364

format

e8efa14

fix poms

9965e6e

rm router_ui

235a9a2

revert ui

32a0f2c

create health checks test

490281b

disable test

217e359

add licence header

590fffb

fix checkstyle;

84c49f5

fix modernizer

db481aa

auden-woolfson changed the title ~~Add health check to route queries to healthy cluster~~ Add health check to route queries to healthy cluster and router jmx counter updates Feb 17, 2025

aaneja requested changes Feb 18, 2025

View reviewed changes

auden-woolfson added 3 commits February 18, 2025 12:01

rm config reload task

9f4cdd1

rm comment

0472c5a

long formatting

a3643b3

auden-woolfson and others added 11 commits February 18, 2025 13:19

add file watcher

8800275

boolean

0aa374e

Update presto-router/src/main/java/com/facebook/presto/router/cluster…

8b82a56

…/RemoteState.java Co-authored-by: Anant Aneja <[email protected]>

fix remote state

794d01e

fix

e572ebb

rm node from router module

5f6115c

style

1954f4c

use java time duration

4d94830

style

d31afa5

make cluster status tracker inner class

a9c285d

add presto.version

4033720

aaneja mentioned this pull request Feb 20, 2025

Metrics based custom scheduler plugin #24439

Open

6 tasks

rm node from router module

88c41f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health check to route queries to healthy cluster and router jmx counter updates #24449

Add health check to route queries to healthy cluster and router jmx counter updates #24449

auden-woolfson commented Jan 28, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jan 28, 2025 •

edited

Loading

auden-woolfson commented Feb 3, 2025

saravanan19 commented Feb 17, 2025

aaneja left a comment

aaneja Feb 18, 2025

aaneja Feb 18, 2025

auden-woolfson Feb 18, 2025

aaneja Feb 18, 2025

aaneja Feb 18, 2025

aaneja Feb 18, 2025

aaneja Feb 18, 2025

auden-woolfson Feb 18, 2025

aaneja Feb 18, 2025

aaneja Feb 18, 2025

auden-woolfson Feb 18, 2025

aaneja Feb 19, 2025

auden-woolfson Feb 19, 2025

auden-woolfson Feb 19, 2025

aaneja Feb 18, 2025 •

edited

Loading

aaneja Feb 18, 2025

Add health check to route queries to healthy cluster and router jmx counter updates #24449

Are you sure you want to change the base?

Add health check to route queries to healthy cluster and router jmx counter updates #24449

Conversation

auden-woolfson commented Jan 28, 2025 • edited Loading

Description

linux-foundation-easycla bot commented Jan 28, 2025 • edited Loading

auden-woolfson commented Feb 3, 2025

saravanan19 commented Feb 17, 2025

aaneja left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

auden-woolfson commented Jan 28, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jan 28, 2025 •

edited

Loading

aaneja Feb 18, 2025 •

edited

Loading