feat(llmobs): support evp proxy for evaluation metrics #12966

Yun-Kim · 2025-03-29T00:53:37Z

MLOB-2495
This PR does four things:

Adds support for submitting eval metrics to LLM Observability using the agent as a proxy. Previously, eval metrics were only supported to be sent via agentless intake.
Adds retry logic to submit eval metric payloads, up to a max of 3 attempts given failure/error. This is done to hopefully limit dropped eval payloads due to random connection errors.
(Non-user facing) Refactors the LLMObs writer classes (eval and span writers) to be subclasses of the same BaseLLMObsWriter class. Additionally we simplify the writer constructor such that only interval/timeout is required, and api_key/site/agentless_enabled/_agentless_url are optional and otherwise inferred internally. _agentless_url is now a internal optional override param for testing.
(Non-user facing) Refactors the TestLLMObsSpanWriter class to be defined in a single tests/llmobs/_utils.py file rather than per integration test conftest file.

Eval support for Agent Proxy

We were previously only submitting eval metrics via agentless intake. However we recently discovered that agent proxy does support submitting to direct LLMObs API intake (with the correct subdomain headers). This PR adds that support so that users do not need to specify an API key to submit eval metrics, especially if they are relying on agent proxy to submit traces.

Writer Refactor

We were previously making LLMObsSpanWriter subclass off HTTPWriter, which is the APM tracer's base writer class for submitting traces. We initially reused this class as it provided a lot of convenience methods/functionality like retry logic, payload submission logic specifically to the agent (when we were implementing agent proxy support for spans).

However, we've decided that there is a bit too much functionality and abstraction in the HTTPWriter class, including the writer client / encoder structure, vague error logs on unsuccessful writes, as well as the slightly convoluted codepath for actually flushing/submitting payloads. For example, the HTTPWriter structure involves:

enqueuing event to be written: enqueue() --> write() --> write_with_client() --> client.encoder.put()
flushing and writing payload: periodic() --> flush_queue() --> _flush_queue_with_client() --> client.encoder.encode() + _send_payload()

This mismatch in functionality/requirements is due in part because LLMObs spans and eval metric payloads are JSON encoded, meaning there is no real need for a separate encoder-specific logic.

This new refactored writer directly manages its buffer, enqueuing, encoding, and flushing. This logic looks something like this:

enqueuing event to be written: enqueue(event) --> BaseLLMObsWriter._buffer.append(event).
flushing and writing payload: periodic() --> BaseLLMObsWriter._encode() + BaseLLMObsWriter._send_payload()

Once we eventually specify the encoding schema to be more complex than the simple json.dumps() that we are currently doing, we can reuse the HTTPWriter class and its encoder/client pattern. Until then, we don't really need the complex functionality provided by HTTPWriter.

Writer Initialization

The Writer classes now take in the following arguments:

required args: interval, timeout, is_agentless
optional kwargs: _site, _api_key (taken from config if not provided)
one last optional kwarg: _override_url (for testing purposes only to set for non-prod/stg intake URLs, for example the local mock backend server that it set in tests/llmobs/conftest.py)

Instead of the previous design where subclass constructors individually handled setting intake/endpoint/url/headers for its own case, now each Span/Eval writer class has the following class attributes that are used to build up the actual intake/endpoint/headers in the base class constructor:

EVENT_TYPE: "span" or "evaluation_metric" for debugging/telemetry purposes
EVP_SUBDOMAIN_HEADER_VALUE: for agent proxy - either evals subdomain header value "api" or spans subdomain header value "llmobs-intake"
EVP_PROXY_ENDPOINT: for agent proxy - "/evp_proxy/v2/api/intake/llm-obs/v2/eval-metric" for evals or "/evp_proxy/v2/api/v2/llmobs" for spans
AGENTLESS_BASE_URL: for agentless - either "https://api" for evals or "https://llmobs-intake" for spans
AGENTLESS_ENDPOINT: for agentless - either "/api/intake/llm-obs/v2/eval-metric" for evals or "/api/v2/llmobs" for spans

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2025-03-29T00:54:31Z

CODEOWNERS have been resolved as:

releasenotes/notes/feat-llmobs-evals-agent-proxy-55e35060f1aa2555.yaml  @DataDog/apm-python
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_chat_completion_event.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_completion_bad_api_key.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_completion_event.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_multiple_events.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_timed_events.yaml  @DataDog/ml-observability
tests/llmobs/test_llmobs_eval_metric_agent_writer.py                    @DataDog/ml-observability
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_telemetry.py                                            @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
tests/contrib/botocore/test_bedrock_llmobs.py                           @DataDog/ml-observability
tests/contrib/crewai/conftest.py                                        @DataDog/ml-observability
tests/contrib/langgraph/conftest.py                                     @DataDog/ml-observability
tests/contrib/openai_agents/conftest.py                                 @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_ragas_evaluators.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/llmobs/test_llmobs_span_agent_writer.py                           @DataDog/ml-observability
tests/llmobs/test_llmobs_span_agentless_writer.py                       @DataDog/ml-observability
tests/llmobs/test_llmobs_span_encoder.py                                @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.send_score_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_categorical_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_metric_bad_api_key.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_multiple_events.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_score_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_timed_events.yaml  @DataDog/ml-observability
tests/llmobs/test_llmobs_eval_metric_agentless_writer.py                @DataDog/ml-observability

github-actions · 2025-03-29T01:14:07Z

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 229 ± 2 ms.

The average import time from base is: 231 ± 1 ms.

The import time difference between this PR and base is: -2.02 ± 0.07 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.975 ms (0.86%)

ddtrace.bootstrap.sitecustomize 1.277 ms (0.56%)

ddtrace.bootstrap.preload 1.277 ms (0.56%)

ddtrace.internal.products 1.277 ms (0.56%)

ddtrace.internal.remoteconfig.client 0.614 ms (0.27%)

ddtrace 0.697 ms (0.30%)

ddtrace._logger 0.027 ms (0.01%)

logging 0.027 ms (0.01%)

traceback 0.027 ms (0.01%)

contextlib 0.027 ms (0.01%)

pr-commenter · 2025-03-29T01:35:01Z

Benchmarks

Benchmark execution time: 2025-04-17 16:19:26

Comparing candidate commit 832dfda in PR branch yunkim/llmobs-refactor-writer with baseline commit 31d13cf in branch main.

Found 7 performance improvements and 0 performance regressions! Performance is the same for 487 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathsplitdrive_aspect

🟩 execution_time [-506.955ns; -408.782ns] or [-12.298%; -9.916%]

scenario:iast_aspects-ospathsplitext_aspect

🟩 execution_time [-660.899ns; -533.154ns] or [-12.790%; -10.318%]

scenario:otelspan-start-finish

🟩 execution_time [-18.584ms; -18.233ms] or [-23.602%; -23.156%]

scenario:otelspan-start-finish-telemetry

🟩 execution_time [-19.457ms; -19.031ms] or [-24.124%; -23.596%]

scenario:span-start-finish

🟩 execution_time [-16.267ms; -15.850ms] or [-33.663%; -32.800%]

scenario:span-start-finish-telemetry

🟩 execution_time [-16.282ms; -16.003ms] or [-32.979%; -32.414%]

scenario:span-start-finish-traceid128

🟩 execution_time [-18.587ms; -18.160ms] or [-36.197%; -35.367%]

tests/llmobs/_utils.py

ddtrace/llmobs/_constants.py

ddtrace/llmobs/_llmobs.py

sabrenner

did a first pass, looks really good to me!! this should make editing/testing/debugging the writers a lot easier. and thanks for throwing agent-proxy evals in too! mostly just left some questions/suggestions and a couple style nits.

ddtrace/llmobs/_llmobs.py

ddtrace/llmobs/_constants.py

ddtrace/llmobs/_telemetry.py

ddtrace/llmobs/_writer.py

tests/contrib/openai_agents/conftest.py

tests/llmobs/test_llmobs_evaluator_runner.py

tests/llmobs/test_llmobs_eval_metric_writer.py

ddtrace/llmobs/_writer.py

sabrenner

did another pass - just a couple questions and one bug i found when trying it out locally, but i think it should be good after!

ddtrace/llmobs/_writer.py

sabrenner

works locally for me now, everything looks great!

@IAL32

Resolves #13336. Credit to @IAL32 and a cherry-pick from #13338. When we made the jump from using the shared HTTPWriter to our own BaseLLMObsWriter class to submit spans and evals #12966, we used our own `_get_connection()` to return HTTP/HTTPS connections. However we forgot to include UDSHTTP connection (for the unix socket case), which means we broke UDS support until now. ### Why was this a problem in the first place? We used our own `_get_connection()` in #12966 because of an issue where creating the shared HTTPConnection helper class was leading to MRO superclass constructor issues in our tests. At the time we thought this was due to the shared HTTPConnection helper class having multiple superclasses and an issue with Python 3.10 in general, but this turns out to be due to vcrpy mocking HTTPConnection entirely and only being an issue in tests that rely on vcrpy. This PR makes some changes to avoid using vcrpy when not necessary, and making better assertions to ensure that spans are being sent (not necessary in most tests to have them be accepted). ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) [](https://datadoghq.atlassian.net/browse/MLOB-2725) --------- Co-authored-by: IAL32 <[email protected]>

Refactor llmobs writers

61a71e7

Yun-Kim added 3 commits April 4, 2025 15:20

Use http helper for getting connection

9826477

Refactor writers to share base class

fac5c92

Merge remote-tracking branch 'origin' into yunkim/llmobs-refactor-writer

75ca16d

datadog-datadog-prod-us1 bot reviewed Apr 7, 2025

View reviewed changes

tests/llmobs/_utils.py Outdated Show resolved Hide resolved

Refactor test llmobs writer out to utils, add release note

cd62fb9

Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from 243a0b1 to cd62fb9 Compare April 7, 2025 22:36

Yun-Kim changed the title ~~chore(llmobs): refactor span writer~~ feat(llmobs): support evp proxy for evaluation metrics Apr 7, 2025

Yun-Kim and others added 5 commits April 7, 2025 19:00

typing

5bc6dd7

wip fix writer tests

0353d63

Fix LLMObs tests

929ba5e

Merge branch 'main' into yunkim/llmobs-refactor-writer

835bf84

fmt

f561ec9

Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from bfe40fe to f561ec9 Compare April 8, 2025 22:10

Yun-Kim added 2 commits April 9, 2025 10:18

fix test fixtures

2a8aa8d

remove fix, evan can fix this when he comes back

3e8c16a

Yun-Kim commented Apr 10, 2025

View reviewed changes

ddtrace/llmobs/_constants.py Show resolved Hide resolved

Yun-Kim commented Apr 10, 2025

View reviewed changes

ddtrace/llmobs/_llmobs.py Outdated Show resolved Hide resolved

Yun-Kim commented Apr 10, 2025

View reviewed changes

ddtrace/llmobs/_llmobs.py Outdated Show resolved Hide resolved

Make site/agentless_url an internal detail

b3b960b

Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from abd46cb to b3b960b Compare April 10, 2025 16:19

Yun-Kim added 2 commits April 10, 2025 14:56

fix tests, remove comment

9ed288b

test fixtures

e1b2f9d

Yun-Kim marked this pull request as ready for review April 10, 2025 22:30

Yun-Kim requested review from a team as code owners April 10, 2025 22:30

Yun-Kim requested a review from a team as a code owner April 10, 2025 22:30

Yun-Kim requested review from avara1986, erikayasuda and wantsui April 10, 2025 22:30

Yun-Kim mentioned this pull request Apr 10, 2025

feat(litellm): add llmobs tracing for litellm #12885

Merged

2 tasks

avara1986 approved these changes Apr 11, 2025

View reviewed changes

Merge branch 'main' into yunkim/llmobs-refactor-writer

bb591c3

sabrenner reviewed Apr 11, 2025

View reviewed changes

emmettbutler approved these changes Apr 14, 2025

View reviewed changes

Yun-Kim added 2 commits April 15, 2025 18:04

Further refactor

d3cebfc

Add eval metric proxy test file

c99fec7

sabrenner reviewed Apr 16, 2025

View reviewed changes

ddtrace/llmobs/_writer.py Outdated Show resolved Hide resolved

ddtrace/llmobs/_writer.py Outdated Show resolved Hide resolved

ddtrace/llmobs/_writer.py Outdated Show resolved Hide resolved

ddtrace/llmobs/_writer.py Show resolved Hide resolved

ddtrace/llmobs/_writer.py Show resolved Hide resolved

fmt

2544c96

Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from d6a7eac to 2544c96 Compare April 16, 2025 18:03

Fix agentless, intake, endpoint setting

b00924e

sabrenner reviewed Apr 16, 2025

View reviewed changes

ddtrace/llmobs/_writer.py Outdated Show resolved Hide resolved

Yun-Kim added 3 commits April 16, 2025 17:02

Clean up writers

b607133

rename test

69b9dd5

Fix flaky tests

5caf7e9

sabrenner approved these changes Apr 17, 2025

View reviewed changes

Yun-Kim enabled auto-merge (squash) April 17, 2025 14:48

Yun-Kim disabled auto-merge April 17, 2025 14:48

Import order, fmt

832dfda

Yun-Kim enabled auto-merge (squash) April 17, 2025 15:20

Yun-Kim merged commit 691c82d into main Apr 17, 2025
338 checks passed

Yun-Kim deleted the yunkim/llmobs-refactor-writer branch April 17, 2025 16:20

Yun-Kim mentioned this pull request Apr 25, 2025

chore(llmobs): minimize number of span/eval encoding #13273

Closed

2 tasks

Yun-Kim mentioned this pull request May 6, 2025

fix(llmobs): reuse shared conn #13339

Merged

2 tasks

feat(llmobs): support evp proxy for evaluation metrics #12966

feat(llmobs): support evp proxy for evaluation metrics #12966

Uh oh!

Conversation

Yun-Kim commented Mar 29, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Eval support for Agent Proxy

Writer Refactor

Writer Initialization

Checklist

Reviewer Checklist

Uh oh!

github-actions bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bootstrap import analysis

Summary

Import time breakdown

Uh oh!

pr-commenter bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:iast_aspects-ospathsplitdrive_aspect

scenario:iast_aspects-ospathsplitext_aspect

scenario:otelspan-start-finish

scenario:otelspan-start-finish-telemetry

scenario:span-start-finish

scenario:span-start-finish-telemetry

scenario:span-start-finish-traceid128

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sabrenner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sabrenner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sabrenner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Yun-Kim commented Mar 29, 2025 •

edited by jira bot

Loading

github-actions bot commented Mar 29, 2025 •

edited

Loading

github-actions bot commented Mar 29, 2025 •

edited

Loading

pr-commenter bot commented Mar 29, 2025 •

edited

Loading