Skip to content

test(recorded): add replay harness (1/5)#1974

Merged
Pouyanpi merged 10 commits into
developfrom
stack/recorded-tests-01-harness
Jun 26, 2026
Merged

test(recorded): add replay harness (1/5)#1974
Pouyanpi merged 10 commits into
developfrom
stack/recorded-tests-01-harness

Conversation

@Pouyanpi

@Pouyanpi Pouyanpi commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the recorded-test replay harness, cassette sanitization, pytest markers, refresh target, and harness self-tests.

Why

The replay infrastructure needs to be reviewed separately from provider-specific coverage so cassette handling and sanitization rules are clear.

What Changed

  • Adds pytest-recording, inline snapshot, and dirty-equals dev dependencies.
  • Adds cassette serialization, sanitization, fixture, assertion, inspection, and fake-cassette helpers.
  • Adds recorded-test marker registration, README guidance, and make target for refresh/replay.

Review Notes

No provider coverage is introduced in this PR beyond helper tests for the harness itself.

Stack Position

Part 1 of 5.

Stack Context

This stack decomposes recorded end-to-end replay coverage into reviewable slices. The PRs should be reviewed against their parent branch in the stack.

Please review each PR against its parent branch, not directly against the root base branch, except for part 1.

Order PR Branch Base
1 #1974 stack/recorded-tests-01-harness develop
2 #1975 stack/recorded-tests-02-deterministic-library-load stack/recorded-tests-01-harness
3 #1976 stack/recorded-tests-03-clients stack/recorded-tests-02-deterministic-library-load
4 #1977 stack/recorded-tests-04-public-api stack/recorded-tests-03-clients
5 #1978 stack/recorded-tests-05-library-rails stack/recorded-tests-04-public-api

Validation

poetry check --lock
poetry lock --no-update
poetry install --with dev
poetry run pytest tests/recorded --block-network -q
pre-commit hooks passed during commit creation

Summary by CodeRabbit

  • New Features

    • Introduced recorded testing framework enabling deterministic integration tests using cassette-based API interactions without live network access.
  • Tests

    • Added comprehensive assertion utilities, cassette sanitization, and test helpers for recorded integration tests.
  • Documentation

    • Added guide for recorded testing conventions, cassette naming, and snapshot workflows.
  • Chores

    • Added record-tests build target and updated project configuration for cassette recording.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1974

@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@m-misiura m-misiura left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work; I'll come back with questions / comments later in the week :)

@Pouyanpi Pouyanpi force-pushed the stack/recorded-tests-01-harness branch from 59c19ea to c0ff0c7 Compare June 9, 2026 16:22
Comment thread tests/recorded/cassette.py
Comment thread tests/recorded/conftest.py
Comment thread tests/recorded/inspect_cassette.py Outdated
Comment thread tests/recorded/cassette.py Outdated
@Pouyanpi Pouyanpi marked this pull request as ready for review June 11, 2026 10:33
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive recorded/cassette-based integration testing framework using pytest-recording and VCR. It adds cassette body processing with normalization, pytest/VCR integration with YAML serialization and request matching, test assertions and helpers, validation utilities, cassette inspection tooling, Rails-specific configuration, and extensive documentation.

Changes

Recorded Cassette Testing Framework

Layer / File(s) Summary
Build configuration and project setup
Makefile, pyproject.toml, pytest.ini, tests/recorded/__init__.py, tests/recorded/rails/__init__.py
Makefile adds record-tests target; pyproject.toml adds pytest-recording, inline-snapshot, dirty-equals dependencies and snapshot formatting configuration; pytest.ini registers recorded, live, vcr, and fake_cassette markers; package modules include Apache-2.0 license headers.
Cassette data processing and normalization
tests/recorded/sanitization.py, tests/recorded/cassette.py
Sanitization module defines filtered headers/query parameters, allowed headers, volatile response fields, and secret detection patterns. Cassette module implements smart-character normalization, JSON/SSE body parsing with rehydration, token usage standardization across provider naming conventions, and streaming/non-streaming response parsing into RecordedChatResponse dataclass.
VCR and pytest integration
tests/recorded/conftest.py
Implements VCR/pytest-recording configuration including custom YAML serializer for readable cassette storage, request/response body matching with normalization, before-record hooks for header filtering and secret scrubbing, volatile field normalization (with special jailbreak/score handling), SSE response scrubbing, HTTP client lifecycle management with leak detection, and provider-specific API key fixtures.
Test assertion and normalization helpers
tests/recorded/assertions.py, tests/recorded/normalization.py, tests/recorded/utils.py, tests/recorded/fake_cassettes.py, tests/recorded/snapshots.py
Assertions module validates Rails results, generation responses, streaming contracts, and LLM token usage. Normalization converts RailsResult/GenerationResponse/stream chunks to dicts. Utilities provide API key management for record mode selection. Fake cassettes validates YAML header metadata. Snapshots re-exports inline-snapshot.
Cassette inspection and analysis tooling
tests/recorded/inspect_cassette.py
CLI utility reads cassette YAML, extracts request/response metadata (model, stream flags, usage, response text), decodes SSE and JSON payloads with error handling, and outputs formatted JSON summaries.
Infrastructure validation test suite
tests/recorded/test_cassette_sanitization.py, tests/recorded/test_fake_cassettes.py, tests/recorded/test_inspect_cassette.py
Comprehensive tests validating cassette sanitization, YAML serialization, body matching, header filtering, token redaction, smart-character normalization, SSE parsing, response metadata preservation, fake cassette metadata, and cassette inspection functionality.
Rails-specific test configuration
tests/recorded/rails_config.py, tests/recorded/rails/conftest.py
Config loading from filesystem or inline YAML/Colang with caching and deep-copying. Streaming enablement with optional chunk/context/stream-first parameter overrides. Async fixture for managing default LLM framework state during tests.
Documentation and package structure
tests/recorded/README.md
Comprehensive guide covering test conventions, module-level markers, fixture-based credentials, replay/refresh commands, cassette naming and serialization, inline snapshot workflows, and fake cassette requirements including metadata validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "test(recorded): add replay harness (1/5)" clearly describes the main change: introducing a recorded-test replay harness infrastructure as part 1 of a multi-part stack.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch stack/recorded-tests-01-harness
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch stack/recorded-tests-01-harness

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
Makefile (1)

61-75: ⚡ Quick win

Document the new record-tests target in the help section.

The help output does not mention the newly added record-tests target. Users running make help will not discover this workflow.

📝 Proposed addition to help output
 help:
 	`@echo` '----'
 	`@echo` 'test                         - run unit tests'
 	`@echo` 'tests                        - run unit tests'
 	`@echo` 'test TEST_FILE=<test_file>   - run all tests in given file'
 	`@echo` 'test_watch                   - run unit tests in watch mode'
 	`@echo` 'test_coverage                - run unit tests with coverage'
+	`@echo` 'record-tests                 - refresh recorded-test cassettes against live providers (requires API keys)'
 	`@echo` 'docs                         - build docs, if you installed the docs dependencies'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 61 - 75, The help target is missing the newly added
record-tests make target; update the help recipe (the help: target) to include a
line describing "record-tests - run tests in record mode" (or similar) so users
see it when running make help; locate the help target block and add a new echo
line for "record-tests" matching the existing formatting used for other targets.
tests/recorded/inspect_cassette.py (1)

25-25: ⚡ Quick win

Consider making the imported functions public or restructuring.

The import of _decode_body_json, _decode_body_text, and _stream_payloads_from_body couples this module to internal implementation details of the cassette module. Since these functions are being reused in a related tool, consider either:

  1. Making these functions public in cassette.py (removing the underscore prefix)
  2. Moving cassette inspection functionality into cassette.py as a public API

This concern was previously raised in an earlier review.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/recorded/inspect_cassette.py` at line 25, This module imports private
functions _decode_body_json, _decode_body_text, and _stream_payloads_from_body
from cassette, which couples tests/recorded/inspect_cassette.py to cassette's
internals; either make those functions public by renaming to decode_body_json,
decode_body_text, and stream_payloads_from_body in cassette (and update
callers), or move the cassette inspection utilities into a new public API in
cassette (expose functions with those public names) and update inspect_cassette
to import the public symbols instead; update any references to the old
underscored names to use the new public function names (decode_body_json,
decode_body_text, stream_payloads_from_body) so the dependency is on the public
API rather than private internals.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/recorded/cassette.py`:
- Around line 123-147: The SSE helpers _sse_body_payloads and _sse_payloads_body
must only attempt a structured parse/rehydration when the input exactly matches
the simple single-line-per-event "data: <json>" format; update
_sse_body_payloads to validate the entire stream (reject any lines starting with
"event:", "id:", ":" comment lines, blank lines between events, or multi-line
data: blocks) and return None to force raw-body storage if it isn't a strict
single-line JSON-per-data event stream, and update _sse_payloads_body to only
rehydrate when given payloads that were produced by that strict parser
(otherwise avoid emitting a transformed body); apply the same strict-format
guard to the other SSE helpers referenced in the diff so replay always falls
back to raw-body unless a lossless round-trip is guaranteed.

In `@tests/recorded/conftest.py`:
- Around line 249-255: The current logic only scrubs when
_decode_json_body(request.body) returns a dict, leaving non-object JSON (lists,
strings, numbers, booleans) un-scrubbed; change the branch so that whenever
_decode_json_body succeeds (returns any JSON value), you call
_scrub_request_json(data) and then set request.body =
_encode_body_like(request.body, <scrubbed>), i.e. remove the isinstance(data,
dict) guard and always re-encode the scrubbed result (still keeping the existing
except block for decode errors) so batch payloads like lists get scrubbed too.

In `@tests/recorded/normalization.py`:
- Line 82: The code sets text = chunk.get("text") if it's a string but falls
back to chunk.get("content") without validation; update the assignment so the
fallback is only used when chunk.get("content") is also a string (e.g., text =
chunk.get("text") if isinstance(chunk.get("text"), str) else
chunk.get("content") if isinstance(chunk.get("content"), str) else None or ""),
ensuring both chunk.get("text") and chunk.get("content") are type-checked before
assigning to text.

---

Nitpick comments:
In `@Makefile`:
- Around line 61-75: The help target is missing the newly added record-tests
make target; update the help recipe (the help: target) to include a line
describing "record-tests - run tests in record mode" (or similar) so users see
it when running make help; locate the help target block and add a new echo line
for "record-tests" matching the existing formatting used for other targets.

In `@tests/recorded/inspect_cassette.py`:
- Line 25: This module imports private functions _decode_body_json,
_decode_body_text, and _stream_payloads_from_body from cassette, which couples
tests/recorded/inspect_cassette.py to cassette's internals; either make those
functions public by renaming to decode_body_json, decode_body_text, and
stream_payloads_from_body in cassette (and update callers), or move the cassette
inspection utilities into a new public API in cassette (expose functions with
those public names) and update inspect_cassette to import the public symbols
instead; update any references to the old underscored names to use the new
public function names (decode_body_json, decode_body_text,
stream_payloads_from_body) so the dependency is on the public API rather than
private internals.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c85977e8-5813-462b-90cc-5fc52b1fe6d4

📥 Commits

Reviewing files that changed from the base of the PR and between 7285f2c and c0ff0c7.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • Makefile
  • pyproject.toml
  • pytest.ini
  • tests/recorded/README.md
  • tests/recorded/__init__.py
  • tests/recorded/assertions.py
  • tests/recorded/cassette.py
  • tests/recorded/conftest.py
  • tests/recorded/fake_cassettes.py
  • tests/recorded/inspect_cassette.py
  • tests/recorded/normalization.py
  • tests/recorded/rails/__init__.py
  • tests/recorded/rails/conftest.py
  • tests/recorded/rails_config.py
  • tests/recorded/sanitization.py
  • tests/recorded/snapshots.py
  • tests/recorded/test_cassette_sanitization.py
  • tests/recorded/test_fake_cassettes.py
  • tests/recorded/test_inspect_cassette.py
  • tests/recorded/utils.py

Comment thread tests/recorded/cassette.py
Comment thread tests/recorded/conftest.py
Comment thread tests/recorded/normalization.py Outdated
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces the recorded-test replay harness: cassette serialization with readable parsed_body blocks, header/secret sanitization, VCR configuration, pytest fixtures for API-key management, and helpers for asserting cassette content — along with self-tests for all of these components.

  • Cassette I/O (cassette.py): lru_cache + deepcopy for mutation-safe interactions, isinstance guard on yaml.safe_load, or [] on interactions, and a defensive try/except in _stream_payloads — previously flagged regressions are all resolved.
  • Sanitization (conftest.py, sanitization.py): before_record_request/before_record_response scrub secrets, volatile headers, and SSE events before VCR writes cassettes; patterns cover OpenAI keys, NVIDIA keys, bearer tokens, and org/project IDs.
  • Makefile target: three-step record-tests workflow (record → snapshot-fill → verify) with correct fake_cassette exclusion from the record step only.

Confidence Score: 5/5

Safe to merge — all changes are test infrastructure additions with no impact on production code paths.

The harness is self-contained test infrastructure. The two findings are both in assertion helpers used during test execution: unguarded json.loads calls that would surface as JSONDecodeError instead of a clean AssertionError on a malformed error chunk. No production code is touched and the cassette sanitization logic is well-tested by the new harness self-tests.

tests/recorded/normalization.py and tests/recorded/assertions.py — the unguarded json.loads calls on error-prefix chunks.

Important Files Changed

Filename Overview
tests/recorded/cassette.py Core cassette helpers: deepcopy mutation-safety, isinstance guard on yaml.safe_load, or-[] on interactions, defensive try/except in _stream_payloads — all previously flagged issues are addressed.
tests/recorded/conftest.py Fixtures, VCR config, scrubbing hooks: logic is sound; the autouse asyncio.run() teardown is function-scoped and enforced by monkeypatch dependency.
tests/recorded/sanitization.py Sanitization constants and regex patterns look comprehensive; SECRET_PATTERNS cover OpenAI keys, NVIDIA keys, bearer tokens, org/project IDs.
tests/recorded/utils.py set_api_key_for_record_mode correctly sets the real env key in record mode but always returns dummy_value — fixture consumers that use the return value directly receive the wrong key (previously flagged).
tests/recorded/normalization.py normalize_stream_chunks calls json.loads without try/except on error-prefix strings; a malformed error chunk raises JSONDecodeError instead of a clean assertion failure.
tests/recorded/assertions.py assert_blocked_stream_error uses an unguarded json.loads list comprehension on error-prefix chunks; same class of issue as normalization.py.
Makefile record-tests target is well-structured: record → snapshot-fill → verify; fake_cassette exclusion is correctly applied only to the record step.
tests/recorded/rails_config.py RailsConfigSource descriptor and lru_cache-backed loader with model_copy deep copy look correct.
tests/recorded/fake_cassettes.py Fake-cassette header parsing and metadata validation are straightforward and correct.
tests/recorded/inspect_cassette.py cassette_summary uses isinstance(data, dict) guard (from the previous fix); CLI entry point looks clean.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Dev as Developer
    participant Make as make record-tests
    participant VCR as pytest-recording (VCR)
    participant Provider as LLM Provider
    participant Cassette as cassette .yaml

    Dev->>Make: make record-tests
    Make->>VCR: "pytest --record-mode=all -m "not fake_cassette""
    VCR->>Provider: real HTTP request (scrubbed via before_record_request)
    Provider-->>VCR: real HTTP response
    VCR->>VCR: before_record_response (sanitize secrets, normalize SSE)
    VCR->>Cassette: write readable YAML (parsed_body blocks)

    Make->>VCR: "pytest --block-network --inline-snapshot=create"
    VCR->>Cassette: read + cassette_with_rehydrated_bodies
    VCR-->>VCR: replay recorded response
    VCR->>VCR: assertions (recorded_chat_response, cassette_request_jsons)

    Make->>VCR: pytest --block-network (verify snapshots)
    VCR->>Cassette: read + replay
    VCR-->>Dev: all tests pass
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Dev as Developer
    participant Make as make record-tests
    participant VCR as pytest-recording (VCR)
    participant Provider as LLM Provider
    participant Cassette as cassette .yaml

    Dev->>Make: make record-tests
    Make->>VCR: "pytest --record-mode=all -m "not fake_cassette""
    VCR->>Provider: real HTTP request (scrubbed via before_record_request)
    Provider-->>VCR: real HTTP response
    VCR->>VCR: before_record_response (sanitize secrets, normalize SSE)
    VCR->>Cassette: write readable YAML (parsed_body blocks)

    Make->>VCR: "pytest --block-network --inline-snapshot=create"
    VCR->>Cassette: read + cassette_with_rehydrated_bodies
    VCR-->>VCR: replay recorded response
    VCR->>VCR: assertions (recorded_chat_response, cassette_request_jsons)

    Make->>VCR: pytest --block-network (verify snapshots)
    VCR->>Cassette: read + replay
    VCR-->>Dev: all tests pass
Loading

Reviews (8): Last reviewed commit: "test(recorded): make replay proxy-indepe..." | Re-trigger Greptile

Comment thread tests/recorded/cassette.py Outdated
Comment thread tests/recorded/cassette.py
Comment thread tests/recorded/inspect_cassette.py Outdated
Comment thread tests/recorded/conftest.py Outdated
@Pouyanpi Pouyanpi force-pushed the stack/recorded-tests-01-harness branch 2 times, most recently from b3c164e to f2aa414 Compare June 11, 2026 12:44
@github-actions

Copy link
Copy Markdown
Contributor

Comment thread tests/recorded/cassette.py Outdated
Comment thread tests/recorded/cassette.py Outdated
@Pouyanpi Pouyanpi force-pushed the stack/recorded-tests-01-harness branch from d44488d to 68db0d2 Compare June 15, 2026 12:00
Comment thread tests/recorded/utils.py Outdated
@Pouyanpi Pouyanpi force-pushed the stack/recorded-tests-01-harness branch from 6bf95dc to b6d7a5b Compare June 23, 2026 10:16

@tgasser-nv tgasser-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few minor comments before merging

Comment thread tests/recorded/conftest.py Outdated
Comment thread pyproject.toml Outdated
Comment thread tests/recorded/conftest.py Outdated
Comment thread tests/recorded/rails/__init__.py
Comment thread tests/recorded/README.md Outdated
Pouyanpi added 10 commits June 26, 2026 09:12
Foundation for converging the recorded suite's cross-surface drift, consumed by the
public_api and library layers above:

- rails/helpers.py: shared build_rails() construction helper + async_chunks()
  (replaces the LLMRails(load_config(...)) boilerplate inlined per test, D11/F).
- assertions.py: assert_blocked_generation() asserts refusal + rail stop semantics,
  not just non-empty text (D6).
Replay under --block-network must not depend on ambient proxy env: a SOCKS
proxy makes httpx raise ImportError (missing socksio) on a cassette hit,
turning a deterministic replay into a shell-dependent error. Add an autouse
fixture that strips proxy vars during replay (record_mode == none) while
leaving them intact for recording.

Also fix the README 'Adding a test' snippet to include the imports it relies
on (LLMRails, load_config, suite-local snapshot, OPENAI_BASELINE_CONFIG) so a
new contributor can copy-paste it and land on the intended snapshot re-export.
@Pouyanpi Pouyanpi force-pushed the stack/recorded-tests-01-harness branch from e06a4fe to e522751 Compare June 26, 2026 07:18
@Pouyanpi Pouyanpi self-assigned this Jun 26, 2026
@Pouyanpi Pouyanpi added this to the v0.23.0 milestone Jun 26, 2026
@Pouyanpi Pouyanpi merged commit bed94c2 into develop Jun 26, 2026
14 checks passed
@Pouyanpi Pouyanpi deleted the stack/recorded-tests-01-harness branch June 26, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants