-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling ADR #2571
Comments
There is one thing that irks me, which is that canonical's error handling guide discourages type-erased errors - e.g. My gut feeling is that having a reasonably expressive domain-specific error on the trait is helpful -e.g. we can assume that More Explicit 1 - Nest error type in traitAn alternative approach is found in the std library, where the trait uses a nested type for its error: pub trait TryFrom<T>: [Sized] {
// Our error type
type Error;
// Our function that returns some impl-specific error
fn try_from(value: T) -> [Result]<Self, Self::Error>;
}
impl TryFrom<i32> for GreaterThanZero {
// In this case, let's make our our type a string
type Error = &'static str;
fn try_from(value: i32) -> Result<Self, Self::Error> {
} But this would make handling the error in callers difficult, because the type has been erased. |
Can you elaborate on the reason for having this? |
For components like Exporter's Export() method:
The processor is invoking exporter.export(), and apart from logging the fact that export() failed or timed-out, what else can a processor could do? A more specific reason why the export failed would be part of Exporter's internal logs. For example, the OTLP gRPC exporter can fail due to various issues with network - its gRPC library will return errors, and the Exporter can do appropriate logging with details on these, and simply return ExportFailed back to the caller. (Or ExportFailed(String)). Trying to see if there are any actual advantages of letting processor be more adventurous and try to down_cast and get hold of actual Error? |
BurntSushi/jiff#8 Another discussion shared by someone on this topic to me. |
In this example, imagine some other thing uses This gives us a convenient way of dealing with it. It could also be an identified pattern to use if and when the need arises, rather than preemptively. |
I think we can start without it, and if a future need arises, we can add it wherever it makes sense? |
If we think about it from a "general pattern" perspective, I think it's foreseeable that an impl will have some ✨ special failure mode ✨ that could be retried by a savvy caller that depends on that particular impl directly. The trait doesn't have to foresee that exact interaction, it just has to provide an ergonomic escape hatch to let the trait impl and its dependent do their thing. My aim here is to tease out a pattern that can work in general in any "public trait with some extensible errors" case. Super concretely, for the exporters, what are the foreseeable reactions to a failure that a client could do? I think at least - retry or not. Here the particular error is a signal to the caller. From these two, I would understand a timeout to be a retryable error, and a failure to not be. We'd probably want to be super clear on the semantics so the caller gets a hint about this. What do you think? |
It is explicitly spec-ed out in OTel that, Processor should not take that responsibility. Retry, if any, is fully Exporter responsibility.
We can also literally follow the OTel spec wording - exporter.export() methods returns an indication of success/failure alone.
Ok - equivalent of what spec calls Success (I can imagine someone writing custom exporting processor wanting more than that, but it is rare, and we can keep things simpler...) |
Both retry, and timeout is not actionable by the processors. The retry (if any) is always supposed to be done by the exporter - do we even want to distinguish them. Which means, simply keeping: fn export() -> Result<(), ()>; also works? |
I think it’s important to think about the pattern in general and the behaviour in this particular instance. If we say fallible operations should return errors in a particular fashion, I reckon they all should. Also returning unit error seems kinda awkward, but happy to be corrected if this is a pattern seen elsewhere ? |
Yes, I am definitely aligned with the Rust idiomatic way of thinking about the patterns if this doesn't add extra-complexity and doesn't expose the non-actionable APIs. My thoughts are influenced by the work I did for the otel-cpp exporter interface, where the exporter doesn’t return anything and instead delegates error handling to a default/custom error handler. virtual opentelemetry::sdk::common::ExportResult Export(const ResourceMetrics &data) noexcept = 0; |
Between ExportFailed vs Timedout , yes. I think that'd allow user to increase timeout next time. Also, timeout is not a confirmation that export failed. For a network call, the backend may have received everything, but got timeout while the server send its ACK back to client. This is different from Failure, where there is no such hope. |
Devils advocate: you mentioned above the exporter should handle retrying where appropriate internally; at the point that you've got a failure back in the thing driving the exporter, there is no more hope, no? Reading the spec you linked earlier:
I see where @lalitb 's coming from - you get a failure, there's nothing you can do. I still don't think we should return |
We've got really focused on the exporter traits, but I'd like to generalize this out to see if there are other patterns for error handling already in action. Are there other traits that we know have fallible operations? I had a bit of a hunt and came only across: MetricReader |
@scottgerring @lalitb @cijothomas had a brief call to close this, and here is the notes:
Shutdown/ForceFlush/Export(mostly for internal use, not directly called by enduser)/Provider(Exporter)-Build Yes this is generally agreed. Lalit/Cijo/Scott
Failed(String) vs Failed<Box> Boxing the error will not prevent stringifying for logging purposes, but retains flexibility incase anyone wants to use the original Error. Let's go with Box dyn error. Lalit/Cijo/Scott.
Yes there should be distinction between them. Spec also distinguishes between these. Will apply this decisions to https://github.com/open-telemetry/opentelemetry-rust/pull/2573/files and #2381 and use them as reference for applying this throughout the repo. This is considered 0.28 blocker, so this is planned to be finished by mid next week and release 0.28 end of next week. |
@cijothomas @lalitb I lightly updated the ADR at the top of the issue. It just occurred to me that there's no point my touching |
Removing from 0.28 milestone, and have implemented the core pieces. There is still further cleanup required (OTLP ExporterBuilders, Metric Views.), so the issue is kept open. Also, we may want to capture the decision into some doc in the repo itself. |
@cijothomas @lalitb I'd love to clean this up to reflect what we've implemented, and found a "docs" subdirectory that we can store architectural decision discussion and other such bits in. If you are happy i'll open a PR! |
Yes, let's use docs directly to put design docs. (I'll submit one soon there too!) |
Error handling patterns in public API interfaces
This issue follows the mADR template following the ADR structure and is written in the style of an architectural decision record. It attempts to provide a top-level view on how we should handle errors in our public traits, something that's come up in a number of discrete issues and a PR now:
My hope is that we can collectively make a design decision and argue the details of that here, and then the actual PRs and implementations become straightforward.
Additionally, If we like this format (ADRs!), we could begin capturing these in the repository itself for future reference, along with other architectural artefacts.
Context and Problem Statement
There is uncertainty around how to model errors in in the
opentelemetry-rust
public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority.The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters:
opentelemetry-sdk
,such as trait LogExporterInMemoryLogExporter
,OtlpLogExporter
,StdOutLogExporter
that have different error requirements - for instance, anOtlpLogExporter
will experience network failures, anInMemoryLogExporter
will notToday, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of
LogExporter
,shutdown
swallows errors, andexport
returns theLogError
type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the opentelemetry-specification tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'.There are also similar examples in the builders for providers and exports.
Considered Options
Option 1: Continue as is
Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors.
Option 2: Extend trait-wide error type to all methods on trait
In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make.
Conversely, it will reduce the number of error types in the code base.
Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type
For example, in the above we'd have something like:
Decision Outcome
Chosen option: "Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"
Consequences
pub trait
s can box up custom errors in a fashion that follow's canonical's error and panic discipline guide, by avoiding type erasure of impl-specific errorsLogError
forLogExporter
above) provides consumers of the trait that hit multiple methods in a single method an error type they can usedowncast_ref
if they have a known trait impl and want to handle aUnknown
errorThe text was updated successfully, but these errors were encountered: