Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add try/catch logic for xds stream initialization #1012

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

brycezhongqing
Copy link
Collaborator

@brycezhongqing brycezhongqing commented Aug 7, 2024

Summary

For the _executorService in the xdsClientImpl, there is no try/catch for the xds stream initialization so the runtime exception is swallowed and hard to find out the root cause for INDIS connect creation failed. Here we add the try/catch to the startRpcStreamLocal to get the error for debugging more convinient in the future.

📔: I have also tried to setUncaughtExceptionHandler to the _executorService in container, like

final ScheduledExecutorServiceFactory.Cfg xdsExecutorServiceCfg = new ScheduledExecutorServiceFactory.Cfg();
    xdsExecutorServiceCfg.traceContext = "xdsExecutor";
    xdsExecutorServiceCfg.threadType = ServiceCallExecutorService.ThreadType.WORKER;
    xdsExecutorServiceCfg.threadFactory = new NamedThreadFactory("Indis xDS client executor", new UncaughtExceptionHandler());
    ScheduledExecutorService xdsExecutor = configOverrideAndGetBean(ScheduledExecutorServiceFactory.class,
        view.getScope().child(XDS_EXECUTOR_PREFIX), xdsExecutorServiceCfg);

but it didn't work to get the throwable as there mentioned
an UncaughtExceptionHandler can be specified in the constructor if uncaught exceptions need to be caught. If the NamedThreadFactory is used with an ExecutorService then the exceptions are caught and stashed away, so there is no point in using specifying an UncaughtExceptionHandler in those cases.

Then I found for

configOverrideAndGetBean(ScheduledExecutorServiceFactory.class,
        view.getScope().child(XDS_EXECUTOR_PREFIX), xdsExecutorServiceCfg);

what finally return is ScheduledThreadPoolExecutor

I also wrote the unit test to verify and confirm setUncaughtExceptionHandler would never be triggered for ScheduledExecutorService. I thought it's related to java bug

package com.linkedin.d2.client.factory;

import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.ThreadFactory;

public class Example {
  public static void main(String[] args) {
    // Create a NamedThreadFactory with a custom name
    ThreadFactory namedThreadFactory = r -> {
      Thread t = new Thread(r);
      System.out.println("Created new thread: " + t);
      t.setUncaughtExceptionHandler((t1, e) -> System.out.println("Uncaught exception: " + e));
      return t;
    };

    ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor(namedThreadFactory);

    executorService.execute(() -> {
      throw new RuntimeException("This exception will be caught by the UncaughtExceptionHandler");
    });
    executorService.shutdown();
  
  }
}

Test Done

  1. Add exclude group: 'io.envoyproxy.controlplane' to the build.gradle
  2. deploy in qei and check the log
2024/08/06 22:31:27.158 ERROR [XdsClientImpl] [Indis xDS client executor-4-1] [seas-federated-gateway-war] [AAYfETi+GiKzsvuxJY7QfQ==] Unexpected exception while starting RPC stream
java.lang.NoClassDefFoundError: io/envoyproxy/envoy/service/discovery/v3/AggregatedDiscoveryServiceGrpc
	at com.linkedin.d2.xds.XdsClientImpl.startRpcStreamLocal(XdsClientImpl.java:206) ~[d2-29.58.1-SNAPSHOT.jar:?]
	at com.linkedin.d2.xds.XdsClientImpl.lambda$startRpcStream$3(XdsClientImpl.java:178) ~[d2-29.58.1-SNAPSHOT.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at com.linkedin.container.servicecall.trace.internal.ServiceCallTraceHandlerImpl.repairCallTree(ServiceCallTraceHandlerImpl.java:129) [com.linkedin.container-core.container-servicecall-trace-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.ServiceCallExecutorService$TraceDataCallableDecorator.call(ServiceCallExecutorService.java:139) [com.linkedin.container-core.container-servicecall-api-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.IC1ExecutionContextPropagator$IC1CallableDecorator.call(IC1ExecutionContextPropagator.java:60) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.IC1ExecutionContextPropagator.call(IC1ExecutionContextPropagator.java:33) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.ExecutionContextManagerImpl$1.call(ExecutionContextManagerImpl.java:118) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at io.grpc.Context.call(Context.java:551) [io.grpc.grpc-api-1.59.1.jar:1.59.1]
	at com.linkedin.opentelemetry.context.IoContextPropagator$IoExecutionContext.call(IoContextPropagator.java:40) [com.linkedin.container-core.opentelemetry-sdk-context-1.3.274.jar:?]
	at com.linkedin.opentelemetry.context.IoContextPropagator.call(IoContextPropagator.java:26) [com.linkedin.container-core.opentelemetry-sdk-context-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.ExecutionContextManagerImpl$1.call(ExecutionContextManagerImpl.java:118) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.DefaultExecutionContextPropagator$MDCCallableDecorator.call(DefaultExecutionContextPropagator.java:100) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.ic2.internal.ThreadLocalICManager.callWithIC(ThreadLocalICManager.java:120) [com.linkedin.container-core.container-ic-impl-1.3.274.jar:?]
	at com.linkedin.container.ic2.internal.ThreadLocalICManager.callWithNewIC(ThreadLocalICManager.java:101) [com.linkedin.container-core.container-ic-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.DefaultExecutionContextPropagator.call(DefaultExecutionContextPropagator.java:48) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.ExecutionContextManagerImpl$1.call(ExecutionContextManagerImpl.java:118) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.internal.ExecutionContextManagerImpl.call(ExecutionContextManagerImpl.java:76) [com.linkedin.container-core.container-servicecall-impl-1.3.274.jar:?]
	at com.linkedin.container.servicecall.ServiceCallExecutorService$DecoratedCallable.call(ServiceCallExecutorService.java:117) [com.linkedin.container-core.container-servicecall-api-1.3.274.jar:?]
	at com.linkedin.util.concurrent.AbstractExecutorServiceDecorator$2.run(AbstractExecutorServiceDecorator.java:56) [com.linkedin.util.util-core-28.3.101.jar:?]
	at com.linkedin.container.concurrent.InstrumentedScheduledThreadPoolExecutor$InstrumentedRunnable.run(InstrumentedScheduledThreadPoolExecutor.java:98) [com.linkedin.container-core.container-impl-1.3.274.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.ClassNotFoundException: io.envoyproxy.envoy.service.discovery.v3.AggregatedDiscoveryServiceGrpc
	at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) ~[?:?]
	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?]
	at java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[?:?]
	at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:538) ~[?:?]
	at java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[?:?]
	... 27 more

test.log

@PapaCharlie
Copy link
Member

Wow, I'm surprised there's just a Java bug here.... That's crazy. Either way, thanks for adding this log in! Now we will know ahead of time if this breaks for the same reason!

@brycezhongqing brycezhongqing merged commit cd385ff into master Aug 7, 2024
2 checks passed
@brycezhongqing brycezhongqing deleted the zhonchen/try_catch_xds_init_error branch August 7, 2024 16:27
Copy link
Contributor

@bohhyang bohhyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Why this line was needed btw?
"xdsExecutorServiceCfg.threadType = ServiceCallExecutorService.ThreadType.WORKER;"

@brycezhongqing
Copy link
Collaborator Author

LGTM. Why this line was needed btw? "xdsExecutorServiceCfg.threadType = ServiceCallExecutorService.ThreadType.WORKER;"

Hey @bohhyang I have tried to add / not add ServiceCallExecutorService.ThreadType.WORKER there just to avoid any misconfig, actually
WORKER associates a copy of the request thread's IC with the worker thread we could ignore this param

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants