Skip to content

feat(llmobs): support evp proxy for evaluation metrics #12966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 17, 2025

Conversation

Yun-Kim
Copy link
Contributor

@Yun-Kim Yun-Kim commented Mar 29, 2025

MLOB-2495
This PR does four things:

  1. Adds support for submitting eval metrics to LLM Observability using the agent as a proxy. Previously, eval metrics were only supported to be sent via agentless intake.
  2. Adds retry logic to submit eval metric payloads, up to a max of 3 attempts given failure/error. This is done to hopefully limit dropped eval payloads due to random connection errors.
  3. (Non-user facing) Refactors the LLMObs writer classes (eval and span writers) to be subclasses of the same BaseLLMObsWriter class. Additionally we simplify the writer constructor such that only interval/timeout is required, and api_key/site/agentless_enabled/_agentless_url are optional and otherwise inferred internally. _agentless_url is now a internal optional override param for testing.
  4. (Non-user facing) Refactors the TestLLMObsSpanWriter class to be defined in a single tests/llmobs/_utils.py file rather than per integration test conftest file.

Eval support for Agent Proxy

We were previously only submitting eval metrics via agentless intake. However we recently discovered that agent proxy does support submitting to direct LLMObs API intake (with the correct subdomain headers). This PR adds that support so that users do not need to specify an API key to submit eval metrics, especially if they are relying on agent proxy to submit traces.

Writer Refactor

We were previously making LLMObsSpanWriter subclass off HTTPWriter, which is the APM tracer's base writer class for submitting traces. We initially reused this class as it provided a lot of convenience methods/functionality like retry logic, payload submission logic specifically to the agent (when we were implementing agent proxy support for spans).

However, we've decided that there is a bit too much functionality and abstraction in the HTTPWriter class, including the writer client / encoder structure, vague error logs on unsuccessful writes, as well as the slightly convoluted codepath for actually flushing/submitting payloads. For example, the HTTPWriter structure involves:

  • enqueuing event to be written: enqueue() --> write() --> write_with_client() --> client.encoder.put()
  • flushing and writing payload: periodic() --> flush_queue() --> _flush_queue_with_client() --> client.encoder.encode() + _send_payload()

This mismatch in functionality/requirements is due in part because LLMObs spans and eval metric payloads are JSON encoded, meaning there is no real need for a separate encoder-specific logic.

This new refactored writer directly manages its buffer, enqueuing, encoding, and flushing. This logic looks something like this:

  • enqueuing event to be written: enqueue(event) --> BaseLLMObsWriter._buffer.append(event).
  • flushing and writing payload: periodic() --> BaseLLMObsWriter._encode() + BaseLLMObsWriter._send_payload()

Once we eventually specify the encoding schema to be more complex than the simple json.dumps() that we are currently doing, we can reuse the HTTPWriter class and its encoder/client pattern. Until then, we don't really need the complex functionality provided by HTTPWriter.

Writer Initialization

The Writer classes now take in the following arguments:

  • required args: interval, timeout, is_agentless
  • optional kwargs: _site, _api_key (taken from config if not provided)
  • one last optional kwarg: _override_url (for testing purposes only to set for non-prod/stg intake URLs, for example the local mock backend server that it set in tests/llmobs/conftest.py)

Instead of the previous design where subclass constructors individually handled setting intake/endpoint/url/headers for its own case, now each Span/Eval writer class has the following class attributes that are used to build up the actual intake/endpoint/headers in the base class constructor:

  • EVENT_TYPE: "span" or "evaluation_metric" for debugging/telemetry purposes
  • EVP_SUBDOMAIN_HEADER_VALUE: for agent proxy - either evals subdomain header value "api" or spans subdomain header value "llmobs-intake"
  • EVP_PROXY_ENDPOINT: for agent proxy - "/evp_proxy/v2/api/intake/llm-obs/v2/eval-metric" for evals or "/evp_proxy/v2/api/v2/llmobs" for spans
  • AGENTLESS_BASE_URL: for agentless - either "https://api" for evals or "https://llmobs-intake" for spans
  • AGENTLESS_ENDPOINT: for agentless - either "/api/intake/llm-obs/v2/eval-metric" for evals or "/api/v2/llmobs" for spans

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Mar 29, 2025

CODEOWNERS have been resolved as:

releasenotes/notes/feat-llmobs-evals-agent-proxy-55e35060f1aa2555.yaml  @DataDog/apm-python
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_chat_completion_event.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_completion_bad_api_key.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_completion_event.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_multiple_events.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_span_agentless_writer.test_send_timed_events.yaml  @DataDog/ml-observability
tests/llmobs/test_llmobs_eval_metric_agent_writer.py                    @DataDog/ml-observability
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_telemetry.py                                            @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
tests/contrib/botocore/test_bedrock_llmobs.py                           @DataDog/ml-observability
tests/contrib/crewai/conftest.py                                        @DataDog/ml-observability
tests/contrib/langgraph/conftest.py                                     @DataDog/ml-observability
tests/contrib/openai_agents/conftest.py                                 @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_ragas_evaluators.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/llmobs/test_llmobs_span_agent_writer.py                           @DataDog/ml-observability
tests/llmobs/test_llmobs_span_agentless_writer.py                       @DataDog/ml-observability
tests/llmobs/test_llmobs_span_encoder.py                                @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.send_score_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_categorical_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_metric_bad_api_key.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_multiple_events.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_score_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_eval_metric_agentless_writer.test_send_timed_events.yaml  @DataDog/ml-observability
tests/llmobs/test_llmobs_eval_metric_agentless_writer.py                @DataDog/ml-observability

Copy link
Contributor

github-actions bot commented Mar 29, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 229 ± 2 ms.

The average import time from base is: 231 ± 1 ms.

The import time difference between this PR and base is: -2.02 ± 0.07 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.975 ms (0.86%)
ddtrace.bootstrap.sitecustomize 1.277 ms (0.56%)
ddtrace.bootstrap.preload 1.277 ms (0.56%)
ddtrace.internal.products 1.277 ms (0.56%)
ddtrace.internal.remoteconfig.client 0.614 ms (0.27%)
ddtrace 0.697 ms (0.30%)
ddtrace._logger 0.027 ms (0.01%)
logging 0.027 ms (0.01%)
traceback 0.027 ms (0.01%)
contextlib 0.027 ms (0.01%)

@pr-commenter
Copy link

pr-commenter bot commented Mar 29, 2025

Benchmarks

Benchmark execution time: 2025-04-17 16:19:26

Comparing candidate commit 832dfda in PR branch yunkim/llmobs-refactor-writer with baseline commit 31d13cf in branch main.

Found 7 performance improvements and 0 performance regressions! Performance is the same for 487 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathsplitdrive_aspect

  • 🟩 execution_time [-506.955ns; -408.782ns] or [-12.298%; -9.916%]

scenario:iast_aspects-ospathsplitext_aspect

  • 🟩 execution_time [-660.899ns; -533.154ns] or [-12.790%; -10.318%]

scenario:otelspan-start-finish

  • 🟩 execution_time [-18.584ms; -18.233ms] or [-23.602%; -23.156%]

scenario:otelspan-start-finish-telemetry

  • 🟩 execution_time [-19.457ms; -19.031ms] or [-24.124%; -23.596%]

scenario:span-start-finish

  • 🟩 execution_time [-16.267ms; -15.850ms] or [-33.663%; -32.800%]

scenario:span-start-finish-telemetry

  • 🟩 execution_time [-16.282ms; -16.003ms] or [-32.979%; -32.414%]

scenario:span-start-finish-traceid128

  • 🟩 execution_time [-18.587ms; -18.160ms] or [-36.197%; -35.367%]

@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from 243a0b1 to cd62fb9 Compare April 7, 2025 22:36
@Yun-Kim Yun-Kim changed the title chore(llmobs): refactor span writer feat(llmobs): support evp proxy for evaluation metrics Apr 7, 2025
@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from bfe40fe to f561ec9 Compare April 8, 2025 22:10
@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from abd46cb to b3b960b Compare April 10, 2025 16:19
@Yun-Kim Yun-Kim marked this pull request as ready for review April 10, 2025 22:30
@Yun-Kim Yun-Kim requested review from a team as code owners April 10, 2025 22:30
@Yun-Kim Yun-Kim requested a review from a team as a code owner April 10, 2025 22:30
Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass, looks really good to me!! this should make editing/testing/debugging the writers a lot easier. and thanks for throwing agent-proxy evals in too! mostly just left some questions/suggestions and a couple style nits.

Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did another pass - just a couple questions and one bug i found when trying it out locally, but i think it should be good after!

@Yun-Kim Yun-Kim force-pushed the yunkim/llmobs-refactor-writer branch from d6a7eac to 2544c96 Compare April 16, 2025 18:03
Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works locally for me now, everything looks great!

@Yun-Kim Yun-Kim enabled auto-merge (squash) April 17, 2025 14:48
@Yun-Kim Yun-Kim disabled auto-merge April 17, 2025 14:48
@Yun-Kim Yun-Kim enabled auto-merge (squash) April 17, 2025 15:20
@Yun-Kim Yun-Kim merged commit 691c82d into main Apr 17, 2025
338 checks passed
@Yun-Kim Yun-Kim deleted the yunkim/llmobs-refactor-writer branch April 17, 2025 16:20
@Yun-Kim Yun-Kim mentioned this pull request May 6, 2025
2 tasks
Yun-Kim added a commit that referenced this pull request May 7, 2025
Resolves #13336. Credit to @IAL32 and a cherry-pick from #13338.

When we made the jump from using the shared HTTPWriter to our own
BaseLLMObsWriter class to submit spans and evals #12966, we used our own
`_get_connection()` to return HTTP/HTTPS connections. However we forgot
to include UDSHTTP connection (for the unix socket case), which means we
broke UDS support until now.

### Why was this a problem in the first place? 
We used our own `_get_connection()` in #12966 because of an issue where
creating the shared HTTPConnection helper class was leading to MRO
superclass constructor issues in our tests. At the time we thought this
was due to the shared HTTPConnection helper class having multiple
superclasses and an issue with Python 3.10 in general, but this turns
out to be due to vcrpy mocking HTTPConnection entirely and only being an
issue in tests that rely on vcrpy. This PR makes some changes to avoid
using vcrpy when not necessary, and making better assertions to ensure
that spans are being sent (not necessary in most tests to have them be
accepted).

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

[](https://datadoghq.atlassian.net/browse/MLOB-2725)

---------

Co-authored-by: IAL32 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants