Support persistent batch processor to prevent telemetry data loss #6940

xhyzzZ · 2024-12-11T00:20:20Z

Is your feature request related to a problem? Please describe.
We are using BatchSpanProcessor and have a scenario that where the traffic burst, we will lose some of the spans because we can't always tune the processor configs perfectly at time. Hence I am thinking if there is a way we can persist the data to make sure there is no data loss when traffic spikes.

Reference: If the configs are not tuned well, it will drop spans here silently only with limited metrics: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java#L238

Describe the solution you'd like

Replacing the in memory queue implementation with some local persisitent queue solution like https://github.com/bulldog2011/bigqueue?
Using MpscUnboundedArrayQueue instead of bounded MpscArrayQueue?

opentelemetry-java/sdk/trace-shaded-deps/src/main/java/io/opentelemetry/sdk/trace/internal/JcTools.java

Line 32 in efdacc1

public static <T> Queue<T> newFixedSizeQueue(int capacity) {

Describe alternatives you've considered
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

jack-berg · 2024-12-12T16:46:27Z

The behavior of batch span processor is dictated by the specification, so changing its behavior to have a persistent queue would require changing the spec, which I expect would be an involved process.

I have heard a variety of people interested in increasing the reliability of telemetry delivery. This requires several design changes:

We need a solution for handling bursts of spans that overwhelm the buffer contained in batch span processor. This could be local persistence on disk, but this is not without problems. Something still has to serialize to disk and deserialize later. Can this serialization keep up with bursts of spans? The disk itself needs to be thought through. What happens if an app suddenly is OOM killed while un-exported spans remain on disk? Is the disk ephemeral or persistent across app restarts? If an app comes back up and has a bunch of unexported spans on disk, what's the priority between exporting those spans and new spans?
The OTLP protocol itself needs to be enhanced for more reliability. Right now, OTLP is prone to duplicate data, since if an export request is transmitted and the success response never makes it to the client, the client will typically retry.

These are not trivial challenges. There's currently a proposal to start an Audit logging SIG. On the surface, this seems unrelated to your request. But when you dig deeper, one of the primary challenges the audit SIG would face is improving the reliability around delivery of OpenTelemetry data. Whatever improvements they make for reliable delivery of audit logs will also likely be available as an opt in feature for traces, metrics, and logs. I think progress on this request is most likely to come from that area so I encourage you to check it out and comment.

breedx-splk · 2025-02-04T22:02:50Z

Hi @xhyzzZ . Sorry for the delay in responding, but I thought I should chime in to also let you know about the disk-buffering module in opentelemetry-java-contrib which you can find here. This was primarily built for the Android/mobile use case, where the network is unreliable. As such, it may not entirely fit your use case (or could use some tuning), but if the telemetry can stream to disk/persistence/storage before being sent to the network, this might solve your problem.

I think several of us would be curious to hear about your thoughts and hope that you can evaluate it. Thanks!

xhyzzZ · 2025-02-04T23:02:04Z

Hey @breedx-splk , thanks for the info. I just read the doc briefly. We have implemented our own version of persistent storage solution and almost looks like the same as yours(two thread, one thread for writing and one thread for reading).

I think there are two issues here for dropping spans:

Batch Span Processor will drop spans if you don't tune the config well, this is because of the internal implementation of the in-memory bounded queue.
The service will drop spans if there are spans still in the batch span processor and the host is crashed for example, after restarting, those spans previously in batch span processor will be dropped silently.

For the disk buffering, if I understand correctly, will resolve most of the drop spans issues with limitations. But I have several questions would like to confirm.

It looks like from the code example fromDiskExporter.exportStoredBatch is acting like the actual batch span processor to export batch from files. Could we use batch span processor in this case when creating SdkTracerProvider for writing? If yes, issue 1 is out of scope, we still need to change the spec. Also if we don't use batch span processor for receiving spans, looks like the speed of writing buffers to disk will be affected(Simple Span Processor will not perform well)?
Although looks like it doesn't mentioned in the design, if the application crashed, when restarted, the reader should be able to pickup from the latest offset from the buffer files?

xhyzzZ added the Feature Request Suggest an idea for this project label Dec 11, 2024

jack-berg added the blocked:spec blocked on open or unresolved spec label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support persistent batch processor to prevent telemetry data loss #6940

Support persistent batch processor to prevent telemetry data loss #6940

xhyzzZ commented Dec 11, 2024 •

edited

Loading

jack-berg commented Dec 12, 2024

breedx-splk commented Feb 4, 2025

xhyzzZ commented Feb 4, 2025 •

edited

Loading

Support persistent batch processor to prevent telemetry data loss #6940

Support persistent batch processor to prevent telemetry data loss #6940

Comments

xhyzzZ commented Dec 11, 2024 • edited Loading

jack-berg commented Dec 12, 2024

breedx-splk commented Feb 4, 2025

xhyzzZ commented Feb 4, 2025 • edited Loading

xhyzzZ commented Dec 11, 2024 •

edited

Loading

xhyzzZ commented Feb 4, 2025 •

edited

Loading