Skip to content

Add gRPC communication failure metrics for worker channel diagnostics #11595

@liliankasem

Description

@liliankasem

During investigation of ICM 699060688, we identified a gap in our ability to proactively detect transient gRPC communication failures between the Functions Host and language workers. Currently, when gRPC communication fails (connection refused, timeout, etc.), we lack metrics to aggregate and alert on these failures.

Background:

  • SLA site ping failure occurred due to transient gRPC communication issue with Java 11 worker
  • Issue self-mitigated but we had no visibility into failure frequency
  • Similar incidents could occur without detection until they impact customer-facing metrics

Proposed Changes:

  1. Add new metric constants to src/WebJobs.Script/Diagnostics/MetricEventNames.cs:
// gRPC communication metrics
public const string WorkerGrpcConnectionFailed = "{0}worker.grpc.connection.failed";
public const string WorkerGrpcMessageSendFailed = "{0}worker.grpc.message.send.failed";
public const string WorkerGrpcInitTimeout = "{0}worker.grpc.init.timeout";
public const string WorkerGrpcTransientError = "{0}worker.grpc.transient.error";
  1. Emit metrics in src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:
    • HandleWorkerStartStreamError → emit WorkerGrpcConnectionFailed
    • HandleWorkerInitError → emit WorkerGrpcInitTimeout (when TimeoutException)
    • SendStreamingMessageAsync failure path → emit WorkerGrpcMessageSendFailed

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions