Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling ADR #2571

Open
scottgerring opened this issue Jan 28, 2025 · 19 comments
Open

Error handling ADR #2571

scottgerring opened this issue Jan 28, 2025 · 19 comments

Comments

@scottgerring
Copy link
Contributor

scottgerring commented Jan 28, 2025

Error handling patterns in public API interfaces

This issue follows the mADR template following the ADR structure and is written in the style of an architectural decision record. It attempts to provide a top-level view on how we should handle errors in our public traits, something that's come up in a number of discrete issues and a PR now:

My hope is that we can collectively make a design decision and argue the details of that here, and then the actual PRs and implementations become straightforward.

Additionally, If we like this format (ADRs!), we could begin capturing these in the repository itself for future reference, along with other architectural artefacts.

Context and Problem Statement

There is uncertainty around how to model errors in in the opentelemetry-rust public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority.

The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters:

  • A given public interface in opentelemetry-sdk,such as trait LogExporter
  • ... exposes multiple discrete actions with logically disjoint error types (e.g. export and shutdown - that is, the class of errors returned for each of these actions are foreseeably very different, as is the callers reaction to them
  • ... is implemented by multiple concrete types such as InMemoryLogExporter, OtlpLogExporter, StdOutLogExporter that have different error requirements - for instance, an OtlpLogExporter will experience network failures, an InMemoryLogExporter will not
  • Potentially has operations on the API that, either in the direct implementation, or in a derived utility that utilises the direct implementation, call multiple API actions and therefore need to return an aggregated log type

Today, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of LogExporter, shutdown swallows errors, and export returns the LogError type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the opentelemetry-specification tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'.

There are also similar examples in the builders for providers and exports.

Considered Options

Option 1: Continue as is
Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors.

Option 2: Extend trait-wide error type to all methods on trait
In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make.

Conversely, it will reduce the number of error types in the code base.

Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type

For example, in the above we'd have something like:

pub trait LogExporter {
        
	fn export(...) -> Result<..., ExportError>;
	fn shutdown(...) -> Result<..., ShutdownError>
}

// Concrete errors for an export operation
pub enum ExportError {
    // The distinction between failure and timed out is part of the OTEL spec
    // we need to meet. 

    ExportFailed,  
    
    ExportTimedOut(Duration),
	
	// Allow impls to box up errors that can't be logically mapped
	// back to one of the APIs errors 
	#[error("Unknown error (should not occur): {source:?}")] 
	Unknown { 
		source: Box<dyn std::error::Error + Send + Sync>, 
	},
}

// Aggregate error type for convenience 
// Note: This will be added in response to need, not pre-emptively
#[derive(Debug, thiserror::Error)]
pub enum LogError {
	#[error("Export error: {0}")] 
	InitError(#[from] ExportError),
	
	#[error("Shutdown error: {0}")] 
	ShutdownError(#[from] ShutdownError),
}

// A downcast helper for callers that need to work with impl-specific
// unknown errors concretely
impl ExportError {
    /// Attempt to downcast the inner `source` error to a specific type `T`
    pub fn downcast_ref<T: std::error::Error + 'static>(&self) -> Option<&T> {
        if let ExportError::Unknown { source } = self {
            source.downcast_ref::<T>()
        } else {
            None
        }
    }
}

Decision Outcome

Chosen option: "Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"

Consequences

  • Good, because callers can handle focussed errors with focussed remediation
  • Good, because implementors of the pub traits can box up custom errors in a fashion that follow's canonical's error and panic discipline guide, by avoiding type erasure of impl-specific errors
  • Good, because the per-trait error type (LogError for LogExporter above) provides consumers of the trait that hit multiple methods in a single method an error type they can use
  • Bad, because there's more code than a single error type
  • Bad, because a caller may need to use downcast_ref if they have a known trait impl and want to handle a Unknown error
@scottgerring
Copy link
Contributor Author

scottgerring commented Jan 29, 2025

There is one thing that irks me, which is that canonical's error handling guide discourages type-erased errors - e.g. Box<dyn Error> - in library crates - because the compiler can't help you exhaustively match, effectively.

My gut feeling is that having a reasonably expressive domain-specific error on the trait is helpful -e.g. we can assume that ExportFailed::ExportTimedOut is likely to be useful for multiple impls - so there's some balance here - but I welcome opinions.

More Explicit 1 - Nest error type in trait

An alternative approach is found in the std library, where the trait uses a nested type for its error:

pub trait TryFrom<T>: [Sized] {
    // Our error type
    type Error;
    
    // Our function that returns some impl-specific error
    fn try_from(value: T) -> [Result]<Self, Self::Error>;
}


impl TryFrom<i32> for GreaterThanZero {
    // In this case, let's make our our type a string
    type Error = &'static str;

    fn try_from(value: i32) -> Result<Self, Self::Error> {
}

But this would make handling the error in callers difficult, because the type has been erased.

@cijothomas
Copy link
Member

// Aggregate error type for convenience
#[derive(Debug, thiserror::Error)]
pub enum LogError {

Can you elaborate on the reason for having this?

@cijothomas
Copy link
Member

For components like Exporter's Export() method:
Is there any advantage having more variants than the below two?

ExportFailed,      
ExportTimedOut(Duration),

The processor is invoking exporter.export(), and apart from logging the fact that export() failed or timed-out, what else can a processor could do? A more specific reason why the export failed would be part of Exporter's internal logs. For example, the OTLP gRPC exporter can fail due to various issues with network - its gRPC library will return errors, and the Exporter can do appropriate logging with details on these, and simply return ExportFailed back to the caller. (Or ExportFailed(String)).

Trying to see if there are any actual advantages of letting processor be more adventurous and try to down_cast and get hold of actual Error?

@cijothomas
Copy link
Member

BurntSushi/jiff#8 Another discussion shared by someone on this topic to me.

@cijothomas cijothomas added this to the 0.28.0 milestone Jan 29, 2025
@scottgerring
Copy link
Contributor Author

// Aggregate error type for convenience
#[derive(Debug, thiserror::Error)]
pub enum LogError {

Can you elaborate on the reason for having this?

In this example, imagine some other thing uses LogExporter, and does an aggregate operation that uses both export and shutdown. This could also be a function on the trait itself, or a helper on an impl - flushThenShutdown; the detail isn't important, just the idea that such an aggregate operation will likely come up in various pub traits.

This gives us a convenient way of dealing with it. It could also be an identified pattern to use if and when the need arises, rather than preemptively.

@cijothomas
Copy link
Member

// Aggregate error type for convenience
#[derive(Debug, thiserror::Error)]
pub enum LogError {

Can you elaborate on the reason for having this?

In this example, imagine some other thing uses LogExporter, and does an aggregate operation that uses both export and shutdown. This could also be a function on the trait itself, or a helper on an impl - flushThenShutdown; the detail isn't important, just the idea that such an aggregate operation will likely come up in various pub traits.

This gives us a convenient way of dealing with it. It could also be an identified pattern to use if and when the need arises, rather than preemptively.

I think we can start without it, and if a future need arises, we can add it wherever it makes sense?

@scottgerring
Copy link
Contributor Author

scottgerring commented Jan 29, 2025

For components like Exporter's Export() method: Is there any advantage having more variants than the below two?

ExportFailed,      
ExportTimedOut(Duration),

The processor is invoking exporter.export(), and apart from logging the fact that export() failed or timed-out, what else can a processor could do? A more specific reason why the export failed would be part of Exporter's internal logs. For example, the OTLP gRPC exporter can fail due to various issues with network - its gRPC library will return errors, and the Exporter can do appropriate logging with details on these, and simply return ExportFailed back to the caller. (Or ExportFailed(String)).

Trying to see if there are any actual advantages of letting processor be more adventurous and try to down_cast and get hold of actual Error?

If we think about it from a "general pattern" perspective, I think it's foreseeable that an impl will have some ✨ special failure mode ✨ that could be retried by a savvy caller that depends on that particular impl directly. The trait doesn't have to foresee that exact interaction, it just has to provide an ergonomic escape hatch to let the trait impl and its dependent do their thing. My aim here is to tease out a pattern that can work in general in any "public trait with some extensible errors" case.

Super concretely, for the exporters, what are the foreseeable reactions to a failure that a client could do? I think at least - retry or not. Here the particular error is a signal to the caller. From these two, I would understand a timeout to be a retryable error, and a failure to not be. We'd probably want to be super clear on the semantics so the caller gets a hint about this.

What do you think?

@cijothomas
Copy link
Member

Super concretely, for the exporters, what are the foreseeable reactions to a failure that a client could do?

It is explicitly spec-ed out in OTel that, Processor should not take that responsibility. Retry, if any, is fully Exporter responsibility.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export

Concurrent requests and retry logic is the responsibility of the exporter. The default SDK's LogRecordProcessors SHOULD NOT implement retry logic, as the required logic is likely to depend heavily on the specific protocol and backend the logs are being sent to.

We can also literally follow the OTel spec wording - exporter.export() methods returns an indication of success/failure alone.
In Rust terms it can be a Result which error variant is just the below.

ExportFailed,      
ExportTimedOut,

Ok - equivalent of what spec calls Success
Error variant is the equivalent of what spec calls Failure. We distinguish between Timeout as this is the processor giving up on the exporter due to timeout, not necessarily a failure inside exporter.

(I can imagine someone writing custom exporting processor wanting more than that, but it is rare, and we can keep things simpler...)

@lalitb
Copy link
Member

lalitb commented Jan 29, 2025

ExportFailed,
ExportTimedOut,

Both retry, and timeout is not actionable by the processors. The retry (if any) is always supposed to be done by the exporter - do we even want to distinguish them. Which means, simply keeping:

fn export() -> Result<(), ()>;

also works?

@scottgerring
Copy link
Contributor Author

ExportFailed,
ExportTimedOut,

Both retry, and timeout is not actionable by the processors. The retry (if any) is always supposed to be done by the exporter - do we even want to distinguish them. Which means, simply keeping:

fn export() -> Result<(), ()>;
also works?

I think it’s important to think about the pattern in general and the behaviour in this particular instance. If we say fallible operations should return errors in a particular fashion, I reckon they all should. Also returning unit error seems kinda awkward, but happy to be corrected if this is a pattern seen elsewhere ?

@lalitb
Copy link
Member

lalitb commented Jan 29, 2025

I think it’s important to think about the pattern in general and the behaviour in this particular instance. If we say fallible operations should return errors in a particular fashion, I reckon they all should. Also returning unit error seems kinda awkward, but happy to be corrected if this is a pattern seen elsewhere ?

Yes, I am definitely aligned with the Rust idiomatic way of thinking about the patterns if this doesn't add extra-complexity and doesn't expose the non-actionable APIs. My thoughts are influenced by the work I did for the otel-cpp exporter interface, where the exporter doesn’t return anything and instead delegates error handling to a default/custom error handler.

virtual opentelemetry::sdk::common::ExportResult Export(const ResourceMetrics &data) noexcept = 0;

@cijothomas
Copy link
Member

do we even want to distinguish them

Between ExportFailed vs Timedout , yes. I think that'd allow user to increase timeout next time. Also, timeout is not a confirmation that export failed. For a network call, the backend may have received everything, but got timeout while the server send its ACK back to client. This is different from Failure, where there is no such hope.

@scottgerring
Copy link
Contributor Author

scottgerring commented Jan 30, 2025

Between ExportFailed vs Timedout , yes. I think that'd allow user to increase timeout next time. Also, timeout is not a confirmation that export failed. For a network call, the backend may have received everything, but got timeout while the server send its ACK back to client. This is different from Failure, where there is no such hope.

Devils advocate: you mentioned above the exporter should handle retrying where appropriate internally; at the point that you've got a failure back in the thing driving the exporter, there is no more hope, no? Reading the spec you linked earlier:

Failure - exporting failed. The batch must be dropped. For example, this can happen when the batch contains bad data and cannot be serialized.

I see where @lalitb 's coming from - you get a failure, there's nothing you can do. I still don't think we should return Result<T,()>; if we return an error, we should put something in there the caller can introspect, but perhaps there's very little to that error type in this instance.

@scottgerring
Copy link
Contributor Author

scottgerring commented Jan 30, 2025

We've got really focused on the exporter traits, but I'd like to generalize this out to see if there are other patterns for error handling already in action. Are there other traits that we know have fallible operations? I had a bit of a hunt and came only across:

MetricReader
Uses a single MetricResult<T> for collect, force_flush, and shutdown.

@cijothomas
Copy link
Member

@scottgerring @lalitb @cijothomas had a brief call to close this, and here is the notes:

Should every operation with discrete failure modes have its own error type?

Shutdown/ForceFlush/Export(mostly for internal use, not directly called by enduser)/Provider(Exporter)-Build
There're <10 such operations, well focus on the main ones above for 0.28

Yes this is generally agreed. Lalit/Cijo/Scott

Should error types box or stringify inner-errors for "Custom" ?

Failed(String) vs Failed<Box>

Boxing the error will not prevent stringifying for logging purposes, but retains flexibility incase anyone wants to use the original Error.
Perf is not a concern.
Also aligns with the canonical best practices.

Let's go with Box dyn error. Lalit/Cijo/Scott.

Should shutdown have a distinction between Failed and Timeout

Yes there should be distinction between them. Spec also distinguishes between these.

Will apply this decisions to https://github.com/open-telemetry/opentelemetry-rust/pull/2573/files and #2381 and use them as reference for applying this throughout the repo.

This is considered 0.28 blocker, so this is planned to be finished by mid next week and release 0.28 end of next week.

@scottgerring
Copy link
Contributor Author

scottgerring commented Jan 31, 2025

@cijothomas @lalitb I lightly updated the ADR at the top of the issue.
I've left the internal error type thing as the box pending resolution of your PR #2573 @cijothomas!

It just occurred to me that there's no point my touching LogExporter / TraceExporter until your PR introducing the common error types is merged.

@cijothomas
Copy link
Member

Removing from 0.28 milestone, and have implemented the core pieces. There is still further cleanup required (OTLP ExporterBuilders, Metric Views.), so the issue is kept open. Also, we may want to capture the decision into some doc in the repo itself.

@cijothomas cijothomas removed this from the 0.28.0 milestone Feb 6, 2025
@scottgerring
Copy link
Contributor Author

@cijothomas @lalitb I'd love to clean this up to reflect what we've implemented, and found a "docs" subdirectory that we can store architectural decision discussion and other such bits in. If you are happy i'll open a PR!

@cijothomas
Copy link
Member

Yes, let's use docs directly to put design docs. (I'll submit one soon there too!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants