Error handling ADR #2571

scottgerring · 2025-01-28T15:15:46Z

Error handling patterns in public API interfaces

This issue follows the mADR template following the ADR structure and is written in the style of an architectural decision record. It attempts to provide a top-level view on how we should handle errors in our public traits, something that's come up in a number of discrete issues and a PR now:

My hope is that we can collectively make a design decision and argue the details of that here, and then the actual PRs and implementations become straightforward.

Additionally, If we like this format (ADRs!), we could begin capturing these in the repository itself for future reference, along with other architectural artefacts.

Context and Problem Statement

There is uncertainty around how to model errors in in the opentelemetry-rust public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority.

The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters:

A given public interface in opentelemetry-sdk,such as trait LogExporter
... exposes multiple discrete actions with logically disjoint error types (e.g. export and shutdown - that is, the class of errors returned for each of these actions are foreseeably very different, as is the callers reaction to them
... is implemented by multiple concrete types such as InMemoryLogExporter, OtlpLogExporter, StdOutLogExporter that have different error requirements - for instance, an OtlpLogExporter will experience network failures, an InMemoryLogExporter will not
Potentially has operations on the API that, either in the direct implementation, or in a derived utility that utilises the direct implementation, call multiple API actions and therefore need to return an aggregated log type

Today, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of LogExporter, shutdown swallows errors, and export returns the LogError type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the opentelemetry-specification tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'.

There are also similar examples in the builders for providers and exports.

Considered Options

Option 1: Continue as is
Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors.

Option 2: Extend trait-wide error type to all methods on trait
In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make.

Conversely, it will reduce the number of error types in the code base.

Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type

For example, in the above we'd have something like:

pub trait LogExporter {
        
	fn export(...) -> Result<..., ExportError>;
	fn shutdown(...) -> Result<..., ShutdownError>
}

// Concrete errors for an export operation
pub enum ExportError {
    // The distinction between failure and timed out is part of the OTEL spec
    // we need to meet. 

    ExportFailed,  
    
    ExportTimedOut(Duration),
	
	// Allow impls to box up errors that can't be logically mapped
	// back to one of the APIs errors 
	#[error("Unknown error (should not occur): {source:?}")] 
	Unknown { 
		source: Box<dyn std::error::Error + Send + Sync>, 
	},
}

// Aggregate error type for convenience 
// Note: This will be added in response to need, not pre-emptively
#[derive(Debug, thiserror::Error)]
pub enum LogError {
	#[error("Export error: {0}")] 
	InitError(#[from] ExportError),
	
	#[error("Shutdown error: {0}")] 
	ShutdownError(#[from] ShutdownError),
}

// A downcast helper for callers that need to work with impl-specific
// unknown errors concretely
impl ExportError {
    /// Attempt to downcast the inner `source` error to a specific type `T`
    pub fn downcast_ref<T: std::error::Error + 'static>(&self) -> Option<&T> {
        if let ExportError::Unknown { source } = self {
            source.downcast_ref::<T>()
        } else {
            None
        }
    }
}

Decision Outcome

Chosen option: "Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"

Consequences

Good, because callers can handle focussed errors with focussed remediation
Good, because implementors of the pub traits can box up custom errors in a fashion that follow's canonical's error and panic discipline guide, by avoiding type erasure of impl-specific errors
Good, because the per-trait error type (LogError for LogExporter above) provides consumers of the trait that hit multiple methods in a single method an error type they can use
Bad, because there's more code than a single error type
Bad, because a caller may need to use downcast_ref if they have a known trait impl and want to handle a Unknown error

The text was updated successfully, but these errors were encountered:

scottgerring · 2025-01-29T07:27:52Z

There is one thing that irks me, which is that canonical's error handling guide discourages type-erased errors - e.g. Box<dyn Error> - in library crates - because the compiler can't help you exhaustively match, effectively.

My gut feeling is that having a reasonably expressive domain-specific error on the trait is helpful -e.g. we can assume that ExportFailed::ExportTimedOut is likely to be useful for multiple impls - so there's some balance here - but I welcome opinions.

More Explicit 1 - Nest error type in trait

An alternative approach is found in the std library, where the trait uses a nested type for its error:

pub trait TryFrom<T>: [Sized] {
    // Our error type
    type Error;
    
    // Our function that returns some impl-specific error
    fn try_from(value: T) -> [Result]<Self, Self::Error>;
}


impl TryFrom<i32> for GreaterThanZero {
    // In this case, let's make our our type a string
    type Error = &'static str;

    fn try_from(value: i32) -> Result<Self, Self::Error> {
}

But this would make handling the error in callers difficult, because the type has been erased.

cijothomas · 2025-01-29T18:25:00Z

// Aggregate error type for convenience
#[derive(Debug, thiserror::Error)]
pub enum LogError {

Can you elaborate on the reason for having this?

cijothomas · 2025-01-29T18:30:14Z

For components like Exporter's Export() method:
Is there any advantage having more variants than the below two?

ExportFailed,      
ExportTimedOut(Duration),

The processor is invoking exporter.export(), and apart from logging the fact that export() failed or timed-out, what else can a processor could do? A more specific reason why the export failed would be part of Exporter's internal logs. For example, the OTLP gRPC exporter can fail due to various issues with network - its gRPC library will return errors, and the Exporter can do appropriate logging with details on these, and simply return ExportFailed back to the caller. (Or ExportFailed(String)).

Trying to see if there are any actual advantages of letting processor be more adventurous and try to down_cast and get hold of actual Error?

cijothomas · 2025-01-29T18:49:05Z

BurntSushi/jiff#8 Another discussion shared by someone on this topic to me.

scottgerring · 2025-01-29T19:15:35Z

// Aggregate error type for convenience
#[derive(Debug, thiserror::Error)]
pub enum LogError {

Can you elaborate on the reason for having this?

In this example, imagine some other thing uses LogExporter, and does an aggregate operation that uses both export and shutdown. This could also be a function on the trait itself, or a helper on an impl - flushThenShutdown; the detail isn't important, just the idea that such an aggregate operation will likely come up in various pub traits.

This gives us a convenient way of dealing with it. It could also be an identified pattern to use if and when the need arises, rather than preemptively.

cijothomas · 2025-01-29T19:20:18Z

// Aggregate error type for convenience
#[derive(Debug, thiserror::Error)]
pub enum LogError {

Can you elaborate on the reason for having this?

In this example, imagine some other thing uses LogExporter, and does an aggregate operation that uses both export and shutdown. This could also be a function on the trait itself, or a helper on an impl - flushThenShutdown; the detail isn't important, just the idea that such an aggregate operation will likely come up in various pub traits.

This gives us a convenient way of dealing with it. It could also be an identified pattern to use if and when the need arises, rather than preemptively.

I think we can start without it, and if a future need arises, we can add it wherever it makes sense?

scottgerring · 2025-01-29T19:20:24Z

For components like Exporter's Export() method: Is there any advantage having more variants than the below two?
ExportFailed,      
ExportTimedOut(Duration),
The processor is invoking exporter.export(), and apart from logging the fact that export() failed or timed-out, what else can a processor could do? A more specific reason why the export failed would be part of Exporter's internal logs. For example, the OTLP gRPC exporter can fail due to various issues with network - its gRPC library will return errors, and the Exporter can do appropriate logging with details on these, and simply return ExportFailed back to the caller. (Or ExportFailed(String)).

Trying to see if there are any actual advantages of letting processor be more adventurous and try to down_cast and get hold of actual Error?

If we think about it from a "general pattern" perspective, I think it's foreseeable that an impl will have some ✨ special failure mode ✨ that could be retried by a savvy caller that depends on that particular impl directly. The trait doesn't have to foresee that exact interaction, it just has to provide an ergonomic escape hatch to let the trait impl and its dependent do their thing. My aim here is to tease out a pattern that can work in general in any "public trait with some extensible errors" case.

Super concretely, for the exporters, what are the foreseeable reactions to a failure that a client could do? I think at least - retry or not. Here the particular error is a signal to the caller. From these two, I would understand a timeout to be a retryable error, and a failure to not be. We'd probably want to be super clear on the semantics so the caller gets a hint about this.

What do you think?

cijothomas · 2025-01-29T19:30:01Z

Super concretely, for the exporters, what are the foreseeable reactions to a failure that a client could do?

It is explicitly spec-ed out in OTel that, Processor should not take that responsibility. Retry, if any, is fully Exporter responsibility.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export

Concurrent requests and retry logic is the responsibility of the exporter. The default SDK's LogRecordProcessors SHOULD NOT implement retry logic, as the required logic is likely to depend heavily on the specific protocol and backend the logs are being sent to.

We can also literally follow the OTel spec wording - exporter.export() methods returns an indication of success/failure alone.
In Rust terms it can be a Result which error variant is just the below.

ExportFailed,      
ExportTimedOut,

Ok - equivalent of what spec calls Success
Error variant is the equivalent of what spec calls Failure. We distinguish between Timeout as this is the processor giving up on the exporter due to timeout, not necessarily a failure inside exporter.

(I can imagine someone writing custom exporting processor wanting more than that, but it is rare, and we can keep things simpler...)

lalitb · 2025-01-29T19:39:17Z

ExportFailed,
ExportTimedOut,

Both retry, and timeout is not actionable by the processors. The retry (if any) is always supposed to be done by the exporter - do we even want to distinguish them. Which means, simply keeping:

fn export() -> Result<(), ()>;

also works?

scottgerring · 2025-01-29T19:58:12Z

ExportFailed,
ExportTimedOut,

Both retry, and timeout is not actionable by the processors. The retry (if any) is always supposed to be done by the exporter - do we even want to distinguish them. Which means, simply keeping:

fn export() -> Result<(), ()>;
also works?

I think it’s important to think about the pattern in general and the behaviour in this particular instance. If we say fallible operations should return errors in a particular fashion, I reckon they all should. Also returning unit error seems kinda awkward, but happy to be corrected if this is a pattern seen elsewhere ?

lalitb · 2025-01-29T21:37:01Z

I think it’s important to think about the pattern in general and the behaviour in this particular instance. If we say fallible operations should return errors in a particular fashion, I reckon they all should. Also returning unit error seems kinda awkward, but happy to be corrected if this is a pattern seen elsewhere ?

Yes, I am definitely aligned with the Rust idiomatic way of thinking about the patterns if this doesn't add extra-complexity and doesn't expose the non-actionable APIs. My thoughts are influenced by the work I did for the otel-cpp exporter interface, where the exporter doesn’t return anything and instead delegates error handling to a default/custom error handler.

virtual opentelemetry::sdk::common::ExportResult Export(const ResourceMetrics &data) noexcept = 0;

cijothomas · 2025-01-29T21:41:22Z

do we even want to distinguish them

Between ExportFailed vs Timedout , yes. I think that'd allow user to increase timeout next time. Also, timeout is not a confirmation that export failed. For a network call, the backend may have received everything, but got timeout while the server send its ACK back to client. This is different from Failure, where there is no such hope.

scottgerring · 2025-01-30T11:39:10Z

Between ExportFailed vs Timedout , yes. I think that'd allow user to increase timeout next time. Also, timeout is not a confirmation that export failed. For a network call, the backend may have received everything, but got timeout while the server send its ACK back to client. This is different from Failure, where there is no such hope.

Devils advocate: you mentioned above the exporter should handle retrying where appropriate internally; at the point that you've got a failure back in the thing driving the exporter, there is no more hope, no? Reading the spec you linked earlier:

Failure - exporting failed. The batch must be dropped. For example, this can happen when the batch contains bad data and cannot be serialized.

I see where @lalitb 's coming from - you get a failure, there's nothing you can do. I still don't think we should return Result<T,()>; if we return an error, we should put something in there the caller can introspect, but perhaps there's very little to that error type in this instance.

scottgerring · 2025-01-30T11:52:13Z

We've got really focused on the exporter traits, but I'd like to generalize this out to see if there are other patterns for error handling already in action. Are there other traits that we know have fallible operations? I had a bit of a hunt and came only across:

MetricReader
Uses a single MetricResult<T> for collect, force_flush, and shutdown.

cijothomas · 2025-01-30T17:35:51Z

@scottgerring @lalitb @cijothomas had a brief call to close this, and here is the notes:

Should every operation with discrete failure modes have its own error type?

Shutdown/ForceFlush/Export(mostly for internal use, not directly called by enduser)/Provider(Exporter)-Build
There're <10 such operations, well focus on the main ones above for 0.28

Yes this is generally agreed. Lalit/Cijo/Scott

Should error types box or stringify inner-errors for "Custom" ?

Failed(String) vs Failed<Box>

Boxing the error will not prevent stringifying for logging purposes, but retains flexibility incase anyone wants to use the original Error.
Perf is not a concern.
Also aligns with the canonical best practices.

Let's go with Box dyn error. Lalit/Cijo/Scott.

Should shutdown have a distinction between Failed and Timeout

Yes there should be distinction between them. Spec also distinguishes between these.

Will apply this decisions to https://github.com/open-telemetry/opentelemetry-rust/pull/2573/files and #2381 and use them as reference for applying this throughout the repo.

This is considered 0.28 blocker, so this is planned to be finished by mid next week and release 0.28 end of next week.

scottgerring · 2025-01-31T14:16:31Z

@cijothomas @lalitb I lightly updated the ADR at the top of the issue.
I've left the internal error type thing as the box pending resolution of your PR #2573 @cijothomas!

It just occurred to me that there's no point my touching LogExporter / TraceExporter until your PR introducing the common error types is merged.

cijothomas · 2025-02-06T20:21:14Z

Removing from 0.28 milestone, and have implemented the core pieces. There is still further cleanup required (OTLP ExporterBuilders, Metric Views.), so the issue is kept open. Also, we may want to capture the decision into some doc in the repo itself.

scottgerring · 2025-02-10T07:58:08Z

@cijothomas @lalitb I'd love to clean this up to reflect what we've implemented, and found a "docs" subdirectory that we can store architectural decision discussion and other such bits in. If you are happy i'll open a PR!

cijothomas · 2025-02-10T17:30:21Z

Yes, let's use docs directly to put design docs. (I'll submit one soon there too!)

cijothomas added this to the 0.28.0 milestone Jan 29, 2025

cijothomas mentioned this issue Jan 29, 2025

Use dedicated ShutdownResult for Metric SDK shutdown #2573

Merged

cijothomas mentioned this issue Jan 30, 2025

Removed unnecessary public methods from OTLP Exporter #2575

Merged

cijothomas removed this from the 0.28.0 milestone Feb 6, 2025

This was referenced Feb 12, 2025

Should OTLP Exporter Error enum be public and visible to the processor? #2561

Closed

OTLP Exporter builder to return more specific Error type #2654

Closed

scottgerring mentioned this issue Feb 14, 2025

docs: Add ADR dir and error handling ADR #2664

Merged

4 tasks

cijothomas closed this as completed in #2664 Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling ADR #2571

Error handling ADR #2571

scottgerring commented Jan 28, 2025 •

edited

Loading

scottgerring commented Jan 29, 2025 •

edited

Loading

cijothomas commented Jan 29, 2025

cijothomas commented Jan 29, 2025

cijothomas commented Jan 29, 2025

scottgerring commented Jan 29, 2025

cijothomas commented Jan 29, 2025

scottgerring commented Jan 29, 2025 •

edited

Loading

cijothomas commented Jan 29, 2025

lalitb commented Jan 29, 2025

scottgerring commented Jan 29, 2025

lalitb commented Jan 29, 2025

cijothomas commented Jan 29, 2025

scottgerring commented Jan 30, 2025 •

edited

Loading

scottgerring commented Jan 30, 2025 •

edited

Loading

cijothomas commented Jan 30, 2025

scottgerring commented Jan 31, 2025 •

edited

Loading

cijothomas commented Feb 6, 2025

scottgerring commented Feb 10, 2025

cijothomas commented Feb 10, 2025

Error handling ADR #2571

Error handling ADR #2571

Comments

scottgerring commented Jan 28, 2025 • edited Loading

Error handling patterns in public API interfaces

Context and Problem Statement

Considered Options

Decision Outcome

Consequences

scottgerring commented Jan 29, 2025 • edited Loading

More Explicit 1 - Nest error type in trait

cijothomas commented Jan 29, 2025

cijothomas commented Jan 29, 2025

cijothomas commented Jan 29, 2025

scottgerring commented Jan 29, 2025

cijothomas commented Jan 29, 2025

scottgerring commented Jan 29, 2025 • edited Loading

cijothomas commented Jan 29, 2025

lalitb commented Jan 29, 2025

scottgerring commented Jan 29, 2025

lalitb commented Jan 29, 2025

cijothomas commented Jan 29, 2025

scottgerring commented Jan 30, 2025 • edited Loading

scottgerring commented Jan 30, 2025 • edited Loading

cijothomas commented Jan 30, 2025

scottgerring commented Jan 31, 2025 • edited Loading

cijothomas commented Feb 6, 2025

scottgerring commented Feb 10, 2025

cijothomas commented Feb 10, 2025

scottgerring commented Jan 28, 2025 •

edited

Loading

scottgerring commented Jan 29, 2025 •

edited

Loading

scottgerring commented Jan 29, 2025 •

edited

Loading

scottgerring commented Jan 30, 2025 •

edited

Loading

scottgerring commented Jan 30, 2025 •

edited

Loading

scottgerring commented Jan 31, 2025 •

edited

Loading