Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persistent batch processor to prevent telemetry data loss #6940

Open
xhyzzZ opened this issue Dec 11, 2024 · 3 comments
Open

Support persistent batch processor to prevent telemetry data loss #6940

xhyzzZ opened this issue Dec 11, 2024 · 3 comments
Labels
blocked:spec blocked on open or unresolved spec Feature Request Suggest an idea for this project

Comments

@xhyzzZ
Copy link

xhyzzZ commented Dec 11, 2024

Is your feature request related to a problem? Please describe.
We are using BatchSpanProcessor and have a scenario that where the traffic burst, we will lose some of the spans because we can't always tune the processor configs perfectly at time. Hence I am thinking if there is a way we can persist the data to make sure there is no data loss when traffic spikes.

Reference: If the configs are not tuned well, it will drop spans here silently only with limited metrics: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java#L238

Describe the solution you'd like

  1. Replacing the in memory queue implementation with some local persisitent queue solution like https://github.com/bulldog2011/bigqueue?
  2. Using MpscUnboundedArrayQueue instead of bounded MpscArrayQueue?
    public static <T> Queue<T> newFixedSizeQueue(int capacity) {

Describe alternatives you've considered
N/A

Additional context
N/A

@xhyzzZ xhyzzZ added the Feature Request Suggest an idea for this project label Dec 11, 2024
@jack-berg
Copy link
Member

The behavior of batch span processor is dictated by the specification, so changing its behavior to have a persistent queue would require changing the spec, which I expect would be an involved process.

I have heard a variety of people interested in increasing the reliability of telemetry delivery. This requires several design changes:

  • We need a solution for handling bursts of spans that overwhelm the buffer contained in batch span processor. This could be local persistence on disk, but this is not without problems. Something still has to serialize to disk and deserialize later. Can this serialization keep up with bursts of spans? The disk itself needs to be thought through. What happens if an app suddenly is OOM killed while un-exported spans remain on disk? Is the disk ephemeral or persistent across app restarts? If an app comes back up and has a bunch of unexported spans on disk, what's the priority between exporting those spans and new spans?
  • The OTLP protocol itself needs to be enhanced for more reliability. Right now, OTLP is prone to duplicate data, since if an export request is transmitted and the success response never makes it to the client, the client will typically retry.

These are not trivial challenges. There's currently a proposal to start an Audit logging SIG. On the surface, this seems unrelated to your request. But when you dig deeper, one of the primary challenges the audit SIG would face is improving the reliability around delivery of OpenTelemetry data. Whatever improvements they make for reliable delivery of audit logs will also likely be available as an opt in feature for traces, metrics, and logs. I think progress on this request is most likely to come from that area so I encourage you to check it out and comment.

@jack-berg jack-berg added the blocked:spec blocked on open or unresolved spec label Dec 12, 2024
@breedx-splk
Copy link
Contributor

Hi @xhyzzZ . Sorry for the delay in responding, but I thought I should chime in to also let you know about the disk-buffering module in opentelemetry-java-contrib which you can find here. This was primarily built for the Android/mobile use case, where the network is unreliable. As such, it may not entirely fit your use case (or could use some tuning), but if the telemetry can stream to disk/persistence/storage before being sent to the network, this might solve your problem.

I think several of us would be curious to hear about your thoughts and hope that you can evaluate it. Thanks!

@xhyzzZ
Copy link
Author

xhyzzZ commented Feb 4, 2025

Hey @breedx-splk , thanks for the info. I just read the doc briefly. We have implemented our own version of persistent storage solution and almost looks like the same as yours(two thread, one thread for writing and one thread for reading).

I think there are two issues here for dropping spans:

  1. Batch Span Processor will drop spans if you don't tune the config well, this is because of the internal implementation of the in-memory bounded queue.
  2. The service will drop spans if there are spans still in the batch span processor and the host is crashed for example, after restarting, those spans previously in batch span processor will be dropped silently.

For the disk buffering, if I understand correctly, will resolve most of the drop spans issues with limitations. But I have several questions would like to confirm.

  1. It looks like from the code example fromDiskExporter.exportStoredBatch is acting like the actual batch span processor to export batch from files. Could we use batch span processor in this case when creating SdkTracerProvider for writing? If yes, issue 1 is out of scope, we still need to change the spec. Also if we don't use batch span processor for receiving spans, looks like the speed of writing buffers to disk will be affected(Simple Span Processor will not perform well)?
  2. Although looks like it doesn't mentioned in the design, if the application crashed, when restarted, the reader should be able to pickup from the latest offset from the buffer files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked:spec blocked on open or unresolved spec Feature Request Suggest an idea for this project
Projects
None yet
Development

No branches or pull requests

3 participants