Skip to content

CachingRouteLocator does not properly refresh routes upon RefreshRoutesEvent due to a racing condition #3005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ATAlight opened this issue Jul 17, 2023 · 6 comments

Comments

@ATAlight
Copy link

Describe the bug
I have a spring cloud gateway configured to derive routes from Consul discovery client. When a service gets registered or deregistered in Consul, the gateway application receives the event, correctly identifies that there's been a configuration change and correctly fires event RefreshRoutesEvent.

As expected, listener method in CachingRouteLocator

public void onApplicationEvent(RefreshRoutesEvent event) {
	try {
		fetch().collect(Collectors.toList()).subscribe(
				list -> Flux.fromIterable(list).materialize().collect(Collectors.toList()).subscribe(signals -> {
					applicationEventPublisher.publishEvent(new RefreshRoutesResultEvent(this));
					cache.put(CACHE_KEY, signals);
				}, this::handleRefreshError), this::handleRefreshError);
	}
	catch (Throwable e) {
		handleRefreshError(e);
	}
}

receives the event and updates the routes. However, the list of routes it receives to update cache here: cache.put(CACHE_KEY, signals); is stale. It reflects the previous configuration. So if a new service was registered, it's still missing in the cached list resulting in a missing route configuration for the newly registered service.

If I try to set a breakpoint somewhere, it starts working properly. My assumption is there's a racing condition somewhere. It could either happen locally or maybe Consul restful API has some kind of delay and does not provide latest configuration even after it has sent an update event to the gateway.

Introducing a listener that processes RefreshRoutesResultEvent fired from code above like this:

@EventListener(RefreshRoutesResultEvent.class)
public void onApplicationEvent(RefreshRoutesResultEvent event) {
	if (routeLocator instanceof CachingRouteLocator) {
		((CachingRouteLocator) routeLocator).refresh().subscribe(route -> {
			System.out.println("Routes refreshed: " + route.getId());
		});
	}
}

seems to be fixing the problem. I suspect this works due to the fact that RefreshRoutesResultEvent goes to the end of the event queue and causes to delay the underlying calls to Consul (or to some kind of cache in between). The problem is that I'm not supposed to do it. Also, it may not guarantee a bulletproof fix to a racing condition. Plus it's too heavy to refresh it multiple times.

This seems to be a bug especially since slowing down a thread corrects the problem. It's not supposed to allow for racing conditions like that.

This is the version I'm currently using: spring-cloud-gateway-server-3.1.8.jar

@clepet
Copy link

clepet commented Feb 20, 2024

We are facing exactly the same issue in our application.
Whenever a service gets registered in Consul all following requests from our frontend to this newly registered service end up in HTTP 404. When I then manually refresh the route cache of the gateway all requests are resolved and function correctly.
It seems like the problem ist the same as described here: spring-cloud/spring-cloud-consul#771
Especially the following part:

When queryPassing is true and the service is started after gateway, we observe:
Service starts and registers with consul - endpoints not yet started.
Route is not registered with gateway because /actuator/health not yet up on the service.
Service health check goes to passing.
Service route returns 404.
Another service starts and registers with consul, same thing happens, but first service now has its route registered because healthcheck is passing.

We are using spring-cloud-gateway-server 4.0.9.

We in our team are thinking that this bug is a pretty big issue because with it we cannot really guarantee the stability of our application. Are there any plans for fixing this bug?

@ATAlight unfortunately it seems like your workaround doesn't work for spring-cloud-gateway-server 4.0.9 anymore.

Edit:

We are using this solution as a workaround now. It does the same as triggering the "Refreshing the Route Cache" Actuator API every 30 seconds:

@Component
public class RefreshGatewayCacheTask {

    private final AbstractGatewayControllerEndpoint gatewayControllerEndpoint;

    public RefreshGatewayCacheTask(AbstractGatewayControllerEndpoint gatewayControllerEndpoint) {
        this.gatewayControllerEndpoint = gatewayControllerEndpoint;
    }

    @Scheduled(fixedDelay = 30000)
    public void refreshGatewayCache() {
        // log something like "Scheduled Task: Refreshing the gateway cache"
        this.gatewayControllerEndpoint.refresh(Collections.emptyList()).subscribe();
    }
}

For this to work you have to set the following property: management.endpoint.gateway.enabled=true so the AbstractGatewayControllerEndpoint gets created and can be used.

That seems to do the trick for now. I don't know however if this ends in weird behaviour or bad performance in the long run.

@spencergibb
Copy link
Member

There was a recent change around this, can you verify it is still an issue in 4.1.2-SNAPSHOT?

@spring-cloud-issues
Copy link

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@clepet
Copy link

clepet commented Mar 20, 2024

Thank you @spencergibb for replying and looking into this issue 👍 .
Unfortunately I think I won't be able to test the new snapshot until next week because of release stress in my project right now.
I will try to reply early next week.

@spencergibb
Copy link
Member

No problem

@clepet
Copy link

clepet commented Mar 21, 2024

I've found some time to test it with 4.1.2-SNAPSHOT.

It works!
Although the gateway seems to need around 30 - 60 seconds after a new service instance is healthy and registered in consul until it realizes the new service instance. Before that it still tries to reroute requests through the old, not existing service instance.

But that is still ok and way better than before 👍 .

What I did to test the behavior:

  1. Deployed the gateway with 4.1.2-SNAPSHOT (and without my previous workaround)
  2. Killed the only instance of another service via portainer
  3. Waited for the automatically started new service instance to be healthy and registered in consul
  4. Navigated our frontend in a way that it makes requests against the newly deployed service instance

For reference here are the exceptions I got in those 30 - 60 seconds until the gateway found the new service instance:

io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: null: /<IP:PORT>
  Caused by: java.net.NoRouteToHostException: null 
     at io.netty.channel.unix.Errors.newConnectException0(Errors.java:158) 
     at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131) 
     at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) 
     at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) 
     at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) 
     at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) 
     at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499) 
     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) 
     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) 
     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
     at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) 
     at java.base/java.lang.Thread.run(Thread.java:833)
io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: null: /<IP:PORT>
  Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:  
  Error has been observed    at the following site(s): 
  *__checkpoint ⇢ <INTERNAL>
  *__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ AuthorizationWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ ExceptionTranslationWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ LogoutWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ ServerRequestCacheWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ SecurityContextServerWebExchangeWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ AuthenticationWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ CsrfWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain] 
  *__checkpoint ⇢ org.springframework.web.filter.reactive.ServerHttpObservationFilter [DefaultWebFilterChain] 
  *__checkpoint ⇢ HTTP GET "<API>" [ExceptionHandlingWebHandler] 
  Original Stack Trace: 
  Caused by: java.net.NoRouteToHostException: null 
     at io.netty.channel.unix.Errors.newConnectException0(Errors.java:158) 
     at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131) 
     at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) 
     at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) 
     at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) 
     at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) 
     at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499) 
     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) 
     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) 
     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
     at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) 
     at java.base/java.lang.Thread.run(Thread.java:833)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants