-
Notifications
You must be signed in to change notification settings - Fork 481
Open
Description
During investigation of ICM 699060688, we identified a gap in our ability to proactively detect transient gRPC communication failures between the Functions Host and language workers. Currently, when gRPC communication fails (connection refused, timeout, etc.), we lack metrics to aggregate and alert on these failures.
Background:
- SLA site ping failure occurred due to transient gRPC communication issue with Java 11 worker
- Issue self-mitigated but we had no visibility into failure frequency
- Similar incidents could occur without detection until they impact customer-facing metrics
Proposed Changes:
- Add new metric constants to
src/WebJobs.Script/Diagnostics/MetricEventNames.cs:
// gRPC communication metrics
public const string WorkerGrpcConnectionFailed = "{0}worker.grpc.connection.failed";
public const string WorkerGrpcMessageSendFailed = "{0}worker.grpc.message.send.failed";
public const string WorkerGrpcInitTimeout = "{0}worker.grpc.init.timeout";
public const string WorkerGrpcTransientError = "{0}worker.grpc.transient.error";- Emit metrics in
src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:HandleWorkerStartStreamError→ emitWorkerGrpcConnectionFailedHandleWorkerInitError→ emitWorkerGrpcInitTimeout(when TimeoutException)SendStreamingMessageAsyncfailure path → emitWorkerGrpcMessageSendFailed
Reactions are currently unavailable