Skip to content

test: add unit coverage for eval, evaluate, server, and actions#2078

Draft
Pouyanpi wants to merge 2 commits into
developfrom
chore/coverage-server-tests
Draft

test: add unit coverage for eval, evaluate, server, and actions#2078
Pouyanpi wants to merge 2 commits into
developfrom
chore/coverage-server-tests

Conversation

@Pouyanpi

@Pouyanpi Pouyanpi commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Description

Add focused unit tests across the eval and evaluate pipelines, the FastAPI server and chat app, the math and validation actions, and the v2_x generation actions.

Fix LLMJudgeComplianceChecker model initialization to use init_llm_model; the prior LLMRails.get_model_cls_and_kwargs path was removed in an earlier refactor.

Related Issue(s)

Verification

AI Assistance

  • No AI tools were used.
  • AI tools were used; a human reviewed and can explain every change (tool: ___).

Checklist

  • I've read the CONTRIBUTING guidelines.
  • This PR links to a triaged issue assigned to me.
  • My PR title follows the project commit convention.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • I've noted any verification beyond CI and any checks I couldn't run.
  • I did not update generated changelog files manually.
  • I addressed all CodeRabbit, Greptile, and other review comments, or replied with why no change is needed.
  • @mentions of the person or team responsible for reviewing proposed changes.

Summary by CodeRabbit

  • Bug Fixes

    • Improved evaluation and compliance workflows for more reliable results and clearer handling of edge cases.
    • Fixed authentication precedence so environment-provided API keys override forwarded request credentials when present.
    • Made server and streaming behavior more robust, including better config loading, caching, and fallback handling.
  • New Features

    • Added broader test coverage for evaluation, UI, server, and action flows to help prevent regressions.
    • Expanded support for model selection and judge initialization in evaluation workflows.

@Pouyanpi Pouyanpi added the status: triaged Triaged by a maintainer; eligible for automated review (CodeRabbit/Greptile). label Jun 26, 2026
@Pouyanpi Pouyanpi self-assigned this Jun 26, 2026
@github-actions github-actions Bot added status: needs triage New issues that have not yet been reviewed or categorized. size: XL and removed status: needs triage New issues that have not yet been reviewed or categorized. labels Jun 26, 2026
@Pouyanpi Pouyanpi added status: triaged Triaged by a maintainer; eligible for automated review (CodeRabbit/Greptile). and removed status: triaged Triaged by a maintainer; eligible for automated review (CodeRabbit/Greptile). labels Jun 26, 2026
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds broad unit test coverage across the eval/evaluate pipelines, FastAPI server, actions, and v2.x generation, and fixes a broken LLMJudgeComplianceChecker initialization path that referenced the removed LLMRails.get_model_cls_and_kwargs method.

  • nemoguardrails/eval/check.py: Replaces the removed LLMRails.get_model_cls_and_kwargs call with init_llm_model, adds a _prepare_model_kwargs helper, and improves the placeholder "main" model creation to use model_copy instead of a blank Model(engine=\"\", model=\"\").
  • 15 new or updated test files: Unit tests for compliance checking, eval CLI/runtime/UI, evaluate CLI/runtime classes/topical evaluation, server API/coverage/OpenAI integration, schema utils, Wolfram Alpha math action, filter_secrets validation, and v2.x generation actions; also tightens live-test guards in test_openai_integration.py to require LIVE_TEST_MODE in addition to an API key.

Confidence Score: 5/5

Safe to merge — the only production change is a well-motivated refactoring in check.py (replacing a removed API with the correct one), and all new code is test-only.

The production change in check.py correctly replaces the removed LLMRails.get_model_cls_and_kwargs with init_llm_model and improves the placeholder main-model creation. All other changes are new or updated unit tests. No broken logic, data loss, or auth regressions were found.

No files require special attention. The empty-string api_key edge case in _prepare_model_kwargs was flagged in a prior review cycle.

Important Files Changed

Filename Overview
nemoguardrails/eval/check.py Core production change: replaces removed LLMRails.get_model_cls_and_kwargs with init_llm_model; adds _prepare_model_kwargs helper; improves placeholder main-model creation via model_copy. The empty-string env var case in _prepare_model_kwargs was flagged in a prior review.
tests/eval/test_eval_check.py New tests covering LLMJudgeComplianceChecker init, print helpers, all compliance verdict branches (yes/no/n/a/invalid), reset/force flags, and the full run() orchestration. Well-structured with a shared new-based builder fixture.
tests/server/test_server_coverage.py New tests covering lifespan startup paths, _get_rails caching/validation/model-override, streaming formatter, auto-reload monitoring, chat-app message flow (success/unsupported/error), and Redis/DataStore base classes.
tests/server/test_api.py Adds test_list_models_prefers_openai_api_key_over_auth_header; fixes pre-existing tests to pin MAIN_MODEL_ENGINE=openai and OPENAI_API_KEY="" so ambient env keys cannot interfere with auth-forwarding assertions.
tests/server/test_openai_integration.py Guards all live-provider tests behind LIVE_TEST_MODE flag; refactors openai_client fixture into a factory; switches os.environ mutations to monkeypatch for proper cleanup.
tests/eval/test_eval_ui_utils.py New module-level plotly stub + cleanup fixture; tests EvalData helpers, compliance/resource dataframes, chart utils rendering, streamlit sidebar, README page, and run summary filtering. The module-level stub injection timing may interact with other test workers collecting this module alongside files that import plotly (noted in a prior outside-diff comment).
tests/v2_x/test_generation_actions_unit.py New unit tests for LLMGenerationActionsV2dotx: flow-index building, user-intent collection, embedding-only fallback, LLM-based intent/bot-action generation, passthrough, flow helpers, and generate_value/flow/continuation error paths.
tests/evaluate/test_evaluate_runtime_classes.py New tests for ModerationRailsEvaluation, HallucinationRailsEvaluation, and FactCheckEvaluation: init, LLM call flows, retry exhaustion, output writing, and negative-sample generation.
tests/eval/test_eval_cli.py New tests for eval CLI: run command validation, invocation of run_eval, check-compliance with explicit/discovered output paths, UI launch, and missing-streamlit exit.
tests/eval/test_eval_runtime.py New tests for eval runtime: interaction-output extraction, incremental output loading, span tracing, log extraction, span metric aggregation, compliance computation, dict helpers, and full run_eval end-to-end.
tests/evaluate/test_evaluate_cli_and_data.py New tests for evaluate CLI (topical/moderation/hallucination/fact-checking commands), Anthropic dataset processing, SimplifyFormatter, and DatasetConnector helpers including Banking77 and ChitChat connectors.
tests/evaluate/test_evaluate_topical.py New tests for TopicalRailsEvaluation: dataset loading, cosine similarity, sync_wrapper, test-set splitting, embeddings model init/compute, intent helpers, main-model detection, and full evaluate_topical_rails end-to-end.
tests/server/conftest.py New conftest autouse fixture that sets NEMO_GUARDRAILS_DISABLE_CHAT_UI=true for all server tests, preventing chat-UI imports from affecting test isolation.
tests/test_actions_math.py New tests for wolfram_alpha_request: missing query, missing APP_ID, successful 200 response, and non-200 error ActionResult.
tests/test_actions_validation.py Extends existing validation tests with contains_secrets coverage: detected secret, all-false result, and missing detect_secrets dependency error handling.
tests/server/test_schema_utils.py Adds test_fetch_env_auth_precedes_forwarded_auth; fixes test_fetch_auth_forwarded to set OPENAI_API_KEY="" so the env-key guard does not shadow the forwarded-token path in CI.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LLMJudgeComplianceChecker.__init__] --> B[EvalConfig.from_path]
    B --> C{Find model_config\nmatching llm_judge_model}
    C -->|not found| D[exit 1]
    C -->|found| E[_prepare_model_kwargs\nparameters + api_key_env_var]
    E --> F[init_llm_model\nmodel_name, provider_name, mode, kwargs]
    F --> G[self.llm]
    A --> H{any model.type == main?}
    H -->|yes| I[use eval_config.models as-is]
    H -->|no| J[model_config.model_copy\nupdate type to main]
    J --> I
    I --> K[RailsConfig models=task_manager_models]
    K --> L[LLMTaskManager config]
    L --> M[self.llm_task_manager]
    G & M --> N[EvalData initialized\nready for compliance checks]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[LLMJudgeComplianceChecker.__init__] --> B[EvalConfig.from_path]
    B --> C{Find model_config\nmatching llm_judge_model}
    C -->|not found| D[exit 1]
    C -->|found| E[_prepare_model_kwargs\nparameters + api_key_env_var]
    E --> F[init_llm_model\nmodel_name, provider_name, mode, kwargs]
    F --> G[self.llm]
    A --> H{any model.type == main?}
    H -->|yes| I[use eval_config.models as-is]
    H -->|no| J[model_config.model_copy\nupdate type to main]
    J --> I
    I --> K[RailsConfig models=task_manager_models]
    K --> L[LLMTaskManager config]
    L --> M[self.llm_task_manager]
    G & M --> N[EvalData initialized\nready for compliance checks]
Loading

Reviews (3): Last reviewed commit: "apply review suggestions" | Re-trigger Greptile

Comment on lines +51 to +54
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key:
kwargs["api_key"] = api_key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When api_key_env_var is configured but the environment variable is set to an empty string (e.g. export JUDGE_API_KEY=""), the check if api_key: silently drops the key from kwargs. Providers that require an explicit API key will then fail at inference time with an opaque authentication error rather than a clear initialization error. Consider using is not None to distinguish "not set" from "explicitly empty".

Suggested change
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key:
kwargs["api_key"] = api_key
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key is not None:
kwargs["api_key"] = api_key
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemoguardrails/eval/check.py
Line: 51-54

Comment:
When `api_key_env_var` is configured but the environment variable is set to an empty string (e.g. `export JUDGE_API_KEY=""`), the check `if api_key:` silently drops the key from `kwargs`. Providers that require an explicit API key will then fail at inference time with an opaque authentication error rather than a clear initialization error. Consider using `is not None` to distinguish "not set" from "explicitly empty".

```suggestion
    if model_config.api_key_env_var:
        api_key = os.environ.get(model_config.api_key_env_var)
        if api_key is not None:
            kwargs["api_key"] = api_key
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai this is resolved in the latest commit. Do you confirm?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue has not been resolved. The current code on the PR head still reads:

if model_config.api_key_env_var:
    api_key = os.environ.get(model_config.api_key_env_var)
    if api_key:
        kwargs["api_key"] = api_key

The inner if api_key: still silently drops the key when the environment variable is set to an empty string. The fix would be:

Suggested change
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key:
kwargs["api_key"] = api_key
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key is not None:
kwargs["api_key"] = api_key

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The judge checker now initializes through init_llm_model with kwargs derived from model parameters and env-backed API keys. The PR also adds tests across eval, evaluate, server, actions, and v2 generation flows for CLI, runtime, auth, and parsing behavior.

Changes

Eval checker and UI coverage

Layer / File(s) Summary
Judge model bootstrap
nemoguardrails/eval/check.py, tests/eval/test_eval_check.py
LLMJudgeComplianceChecker now initializes its judge model through init_llm_model, and the init test asserts the updated task-manager model setup and env-backed API key handling.
Compliance checks
tests/eval/test_eval_check.py
Compliance parsing, policy inclusion handling, force/reset behavior, invalid output handling, and progress printing are covered.
CLI entrypoints
tests/eval/test_eval_cli.py
The eval run, check-compliance, and ui commands are tested for argument parsing, output-path discovery, and Streamlit launch failure handling.
Runtime metrics and tables
tests/eval/test_eval_runtime.py, tests/eval/test_eval_ui_utils.py
Eval output extraction, span/log aggregation, dict/file utilities, chart creation, and run_eval integration are exercised with mocked data.
UI rendering and dashboard behavior
tests/eval/test_eval_ui_utils.py
The Streamlit dashboard helpers build compliance and resource views, render charts and tables, compute span colors, reload eval data, and filter summary/sidebar selections.

Evaluate package command coverage

Layer / File(s) Summary
CLI wiring
tests/evaluate/test_evaluate_cli_and_data.py
The topical, moderation, hallucination, and fact-checking CLI commands are wired to the expected evaluator constructors and execution methods.
Dataset connectors
tests/evaluate/test_evaluate_cli_and_data.py
Log formatting, Anthropic dataset processing, and the Banking77 and ChitChat connector helpers load, transform, and write dataset content.
Topical evaluation flow
tests/evaluate/test_evaluate_topical.py
Dataset loading, seeded splits, embedding initialization, intent similarity, and topical result generation are tested.
Runtime evaluator classes
tests/evaluate/test_evaluate_runtime_classes.py
Moderation, hallucination, and fact-check evaluator classes are initialized and exercised through retry logic, scoring, and output writing.

Server auth and lifecycle coverage

Layer / File(s) Summary
Auth header precedence
tests/server/test_api.py, tests/server/test_schema_utils.py
Model-list forwarding is tested for plain request headers and for environment-provided OPENAI_API_KEY precedence.
Live model-list integration setup
tests/server/test_openai_integration.py
Live model-list tests gate execution on LIVE_TEST_MODE, build clients through TestClient(api.app), and cover OpenAI, Anthropic, Cohere, and Azure variants.
Server lifecycle and storage coverage
tests/server/conftest.py, tests/server/test_server_coverage.py
Server lifespan, auto-reload, config loading, chat app behavior, cache invalidation, and RedisStore/datastore helpers are covered with new fixtures and stubs.

Action helper coverage

Layer / File(s) Summary
Wolfram Alpha action flow
tests/test_actions_math.py
Wolfram Alpha action tests cover missing query, missing app ID, success response parsing, and non-200 failures.
Secret detection validation
tests/test_actions_validation.py
contains_secrets tests cover positive and negative detector results and the missing detect_secrets error path.

V2 generation actions coverage

Layer / File(s) Summary
Flow index and helpers
tests/v2_x/test_generation_actions_unit.py
The test module adds fake index and task-manager helpers, a runtime flow config helper, and the shared actions fixture used across the suite.
User-intent generation and passthrough
tests/v2_x/test_generation_actions_unit.py
User-intent collection, embedding-only behavior, LLM fallback parsing, and passthrough action validation are covered.
Flow matching and continuation
tests/v2_x/test_generation_actions_unit.py
Flow existence checks, instruction and name-based flow generation, and continuation handling are covered.
Flow assembly and value parsing
tests/v2_x/test_generation_actions_unit.py
Flow creation, value parsing, and end-to-end generate_flow output assembly are tested, including missing-input error cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • cparisien
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning The PR is a large test/refactor change, but the description’s Verification section is blank and includes no test results or benchmark info. Add concrete verification details to the PR description: commands run, outcomes, and any relevant performance/numerics checks if applicable.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change by describing added unit test coverage across eval, evaluate, server, and actions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/coverage-server-tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (4)
tests/eval/test_eval_check.py (1)

81-87: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Exercise the real RailsConfig/LLMTaskManager construction path.

This test mocks the exact contract changed in LLMJudgeComplianceChecker.__init__, so a broken RailsConfig(models=..., prompts=...) shape would still pass. Since init_llm_model is already patched, using the real config/task-manager path should stay offline.

Proposed test tightening
-    task_manager = MagicMock()
-
     with (
         patch("nemoguardrails.eval.check.EvalConfig.from_path", return_value=eval_config) as mock_from_path,
         patch("nemoguardrails.eval.check.init_llm_model", return_value="llm") as mock_init_llm_model,
-        patch(
-            "nemoguardrails.eval.check.RailsConfig", side_effect=lambda **kwargs: SimpleNamespace(**kwargs)
-        ) as mock_config,
-        patch("nemoguardrails.eval.check.LLMTaskManager", return_value=task_manager) as mock_task_manager,
     ):
         checker = LLMJudgeComplianceChecker(
             eval_config_path=str(tmp_path),
@@
-    task_manager_models = mock_config.call_args.kwargs["models"]
+    task_manager_models = checker.llm_task_manager.config.models
     assert task_manager_models[0] == eval_config.models[0]
     assert task_manager_models[1].type == "main"
     assert task_manager_models[1].model == "judge-model"
-    mock_task_manager.assert_called_once()
+    assert checker.llm_task_manager.config.prompts == eval_config.prompts

As per coding guidelines, “For maintainer-approved refactors, add or update characterization tests before changing subtle behavior.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_eval_check.py` around lines 81 - 87, The test is over-mocking
the construction path used by LLMJudgeComplianceChecker.__init__, so it won’t
catch regressions in how RailsConfig and LLMTaskManager are actually built.
Update the test to use the real RailsConfig and LLMTaskManager creation path
while keeping init_llm_model patched, and keep only the external dependencies
mocked so the test still runs offline. This will validate the changed
models/prompts contract through the real constructor flow rather than a
SimpleNamespace stand-in.

Source: Coding guidelines

tests/evaluate/test_evaluate_topical.py (2)

48-52: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Cover the asyncio.run(...) fallback explicitly.

This only validates whichever branch asyncio.get_event_loop() takes under the current test runner. sync_wrapper has separate error-handling behavior when get_event_loop() raises RuntimeError, so add a monkeypatched case for that branch to keep the fallback from regressing unnoticed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/evaluate/test_evaluate_topical.py` around lines 48 - 52, The current
test only covers the normal event-loop path in sync_wrapper and misses the
asyncio.run fallback when asyncio.get_event_loop() raises RuntimeError. Update
test_sync_wrapper_runs_async_function to add a monkeypatched case that forces
get_event_loop() to fail, then verify sync_wrapper(add) still returns the
expected result through the fallback branch so that path stays covered.

55-69: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add a random_seed=0 characterization case.

_split_test_set_from_config currently checks if random_seed:, so seed 0 skips the deterministic shuffle path. This test covers seeded behavior, but it misses the one boundary value that flips the branch and would let that regression pass silently. As per coding guidelines, "For maintainer-approved refactors, add or update characterization tests before changing subtle behavior; when the existing suite does not cover the refactored code, keep the equivalence check used to prove behavior is unchanged."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/evaluate/test_evaluate_topical.py` around lines 55 - 69, The
`_split_test_set_from_config` test currently covers a nonzero seed but misses
the `random_seed=0` branch boundary where `if random_seed:` skips deterministic
shuffling. Update the characterization tests in
`test_split_test_set_from_config_uses_seed_and_limits_remaining_samples` (or add
a sibling case) to exercise `random_seed=0` and verify the split remains
deterministic and equivalent to the seeded path, using
`_split_test_set_from_config`, `config.user_messages`, and `test_set` to assert
the same invariants.

Source: Coding guidelines

tests/test_actions_math.py (1)

67-73: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the utterance event on both failure paths.

These tests only verify the BotIntent sequence. nemoguardrails/actions/math.py:49-68 also defines a StartUtteranceBotAction.script for both fallback branches, so a regression in the user-facing message would still pass here. Add one assertion for that event in each test.

Also applies to: 97-102

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_actions_math.py` around lines 67 - 73, The math action tests only
assert the BotIntent sequence, so they miss regressions in the user-facing
fallback message. In the relevant test cases for the failure paths, also assert
the StartUtteranceBotAction.script event produced by the math action so both
branches are covered. Use the existing result.events checks in
tests/test_actions_math.py and match the utterance event emitted by the math
action logic in nemoguardrails/actions/math.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/eval/test_eval_ui_utils.py`:
- Around line 33-56: The Plotly stub cleanup only removes plotly modules, but
ui_common and chart_utils also stay cached after importing with the stubbed
plotly.express, making later tests order-dependent. Update the
_cleanup_plotly_stub fixture in test_eval_ui_utils.py to also remove the cached
nemoguardrails.eval.ui.common and nemoguardrails.eval.ui.chart_utils modules
during teardown, alongside the existing plotly modules, so future imports rebind
the real dependencies.

In `@tests/evaluate/test_evaluate_runtime_classes.py`:
- Around line 390-404: The fact-check runtime test currently does not verify
that the value returned by create_negative_samples is actually used by run()
before check_facts is called. Update
test_factcheck_run_creates_negative_samples_when_enabled to have
create_negative_samples return a distinct sentinel dataset and assert that the
evaluator’s dataset, or the argument consumed by check_facts, reflects that
returned value. Keep the existing assertions around create_negative_samples and
check_facts call count so the test proves the coroutine result is assigned back
into the flow.
- Around line 88-99: The retry behavior in get_jailbreak_results is not being
verified, since the test only checks the final error state and would still pass
if retries stop early. Update the test around _moderation_evaluator and
nemoguardrails.evaluate.evaluate_moderation.llm_call to capture the mock and
assert it is awaited three times, so the max_tries loop is actually exercised.
Also apply the same characterization check in the hallucination test by
asserting two calls to evaluator.llm, using the existing evaluator methods and
mocks to confirm the retry path really exhausts the configured attempts.

In `@tests/server/conftest.py`:
- Around line 21-24: The autouse fixture `_disable_chat_ui` should force the
test environment into the isolated path instead of preserving any pre-existing
value. Update the fixture so it always sets `NEMO_GUARDRAILS_DISABLE_CHAT_UI` to
`"true"` via `monkeypatch.setenv`, regardless of whether the variable is already
present in `os.environ`, to keep the server tests hermetic.

In `@tests/server/test_api.py`:
- Line 1031: Pin the provider environment in the `/v1/models` route tests by
setting MAIN_MODEL_ENGINE to openai in the patch.dict blocks used around the
affected assertions. The tests currently rely on whatever MAIN_MODEL_ENGINE was
left by prior tests, so update the env setup in the relevant test cases in
test_api.py to control that value alongside MAIN_MODEL_BASE_URL and
OPENAI_API_KEY. Use the existing route test setup around fetch_models and the
/v1/models auth assertions as the place to make the change.

In `@tests/server/test_openai_integration.py`:
- Around line 328-329: The live OpenAI integration tests are mutating global
environment state by assigning MAIN_MODEL_ENGINE directly, which can leak the
provider selection into later tests. Update each affected test block in
test_openai_integration.py to use monkeypatch.setenv(...) or patch.dict(...)
instead of bare os.environ assignments, and ensure the environment is
automatically restored after each test; apply the same fix at all mentioned
MAIN_MODEL_ENGINE writes.

In `@tests/v2_x/test_generation_actions_unit.py`:
- Around line 207-221: Tighten the negative-path fixtures in
generate_user_intent_and_bot_action tests so each assertion isolates the
intended failure mode: use a syntactically valid LLM payload for the call that
should fail, then vary only the specific extractor or field being exercised.
Update the affected test cases around patch_llm_call, get_first_bot_action, and
the LlmResponseError assertions so failures come from bot-action
extraction/parsing behavior rather than an already malformed response.
- Around line 361-362: The test for generate_value currently catches base
Exception, which is too broad and can hide unrelated failures. Update this
assertion to expect the specific exception type raised by actions.generate_value
for invalid LLM responses, using the generate_value() path and the
invalid-response contract message to locate the correct exception class.

---

Nitpick comments:
In `@tests/eval/test_eval_check.py`:
- Around line 81-87: The test is over-mocking the construction path used by
LLMJudgeComplianceChecker.__init__, so it won’t catch regressions in how
RailsConfig and LLMTaskManager are actually built. Update the test to use the
real RailsConfig and LLMTaskManager creation path while keeping init_llm_model
patched, and keep only the external dependencies mocked so the test still runs
offline. This will validate the changed models/prompts contract through the real
constructor flow rather than a SimpleNamespace stand-in.

In `@tests/evaluate/test_evaluate_topical.py`:
- Around line 48-52: The current test only covers the normal event-loop path in
sync_wrapper and misses the asyncio.run fallback when asyncio.get_event_loop()
raises RuntimeError. Update test_sync_wrapper_runs_async_function to add a
monkeypatched case that forces get_event_loop() to fail, then verify
sync_wrapper(add) still returns the expected result through the fallback branch
so that path stays covered.
- Around line 55-69: The `_split_test_set_from_config` test currently covers a
nonzero seed but misses the `random_seed=0` branch boundary where `if
random_seed:` skips deterministic shuffling. Update the characterization tests
in `test_split_test_set_from_config_uses_seed_and_limits_remaining_samples` (or
add a sibling case) to exercise `random_seed=0` and verify the split remains
deterministic and equivalent to the seeded path, using
`_split_test_set_from_config`, `config.user_messages`, and `test_set` to assert
the same invariants.

In `@tests/test_actions_math.py`:
- Around line 67-73: The math action tests only assert the BotIntent sequence,
so they miss regressions in the user-facing fallback message. In the relevant
test cases for the failure paths, also assert the StartUtteranceBotAction.script
event produced by the math action so both branches are covered. Use the existing
result.events checks in tests/test_actions_math.py and match the utterance event
emitted by the math action logic in nemoguardrails/actions/math.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd8771b7-17e6-4add-a105-1390e279e63d

📥 Commits

Reviewing files that changed from the base of the PR and between 15f87ee and 0458b87.

📒 Files selected for processing (16)
  • nemoguardrails/eval/check.py
  • tests/eval/test_eval_check.py
  • tests/eval/test_eval_cli.py
  • tests/eval/test_eval_runtime.py
  • tests/eval/test_eval_ui_utils.py
  • tests/evaluate/test_evaluate_cli_and_data.py
  • tests/evaluate/test_evaluate_runtime_classes.py
  • tests/evaluate/test_evaluate_topical.py
  • tests/server/conftest.py
  • tests/server/test_api.py
  • tests/server/test_openai_integration.py
  • tests/server/test_schema_utils.py
  • tests/server/test_server_coverage.py
  • tests/test_actions_math.py
  • tests/test_actions_validation.py
  • tests/v2_x/test_generation_actions_unit.py

Comment on lines +33 to +56
# The eval UI modules import `plotly.express` at module load time, so a stub must
# be present before the imports below. We only inject it when plotly is absent and
# remove our own stubs after this module's tests so they do not leak into other
# files sharing the same worker.
_injected_plotly_modules = {}
for _module_name in ("plotly", "plotly.express"):
if _module_name not in sys.modules:
_injected_plotly_modules[_module_name] = types.ModuleType(_module_name)
sys.modules[_module_name] = _injected_plotly_modules[_module_name]


@pytest.fixture(scope="module", autouse=True)
def _cleanup_plotly_stub():
yield
for _module_name, stub_module in _injected_plotly_modules.items():
if sys.modules.get(_module_name) is stub_module:
del sys.modules[_module_name]


ui_common = importlib.import_module("nemoguardrails.eval.ui.common")
ui_utils = importlib.import_module("nemoguardrails.eval.ui.utils")
chart_utils = importlib.import_module("nemoguardrails.eval.ui.chart_utils")
streamlit_utils = importlib.import_module("nemoguardrails.eval.ui.streamlit_utils")
readme_page = importlib.import_module("nemoguardrails.eval.ui.README")

@coderabbitai coderabbitai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== test file ==\n'
cat -n tests/eval/test_eval_ui_utils.py | sed -n '1,120p'

printf '\n== related UI modules/tests ==\n'
rg -n "nemoguardrails\.eval\.ui\.(common|utils|chart_utils|streamlit_utils|README)|plotly\.express|import plotly" tests nemoguardrails/eval/ui -g '!**/__pycache__/**'

Repository: NVIDIA-NeMo/Guardrails

Length of output: 7216


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== file list ==\n'
git ls-files | rg '(^|/)(tests/eval/test_eval_ui_utils\.py|nemoguardrails/eval/ui/|tests/eval/|tests/).*\.py$'

printf '\n== module outlines ==\n'
for f in nemoguardrails/eval/ui/common.py nemoguardrails/eval/ui/utils.py nemoguardrails/eval/ui/chart_utils.py nemoguardrails/eval/ui/streamlit_utils.py nemoguardrails/eval/ui/README.py; do
  if [ -f "$f" ]; then
    echo "--- $f ---"
    wc -l "$f"
    sed -n '1,120p' "$f"
  fi
done

Repository: NVIDIA-NeMo/Guardrails

Length of output: 30699


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== test file around stub/import block ==\n'
sed -n '1,120p' tests/eval/test_eval_ui_utils.py | cat -n

printf '\n== references in tests ==\n'
rg -n "nemoguardrails\.eval\.ui\.(common|utils|chart_utils|streamlit_utils|README)|plotly\.express|import plotly" tests -g '!**/__pycache__/**'

printf '\n== references in ui package ==\n'
rg -n "plotly\.express|import plotly|from plotly" nemoguardrails/eval/ui -g '!**/__pycache__/**'

Repository: NVIDIA-NeMo/Guardrails

Length of output: 6259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== plotly imports in UI modules ==\n'
for f in nemoguardrails/eval/ui/common.py nemoguardrails/eval/ui/utils.py nemoguardrails/eval/ui/chart_utils.py nemoguardrails/eval/ui/streamlit_utils.py nemoguardrails/eval/ui/README.py; do
  if [ -f "$f" ]; then
    echo "--- $f ---"
    rg -n "plotly|px" "$f"
  fi
done

printf '\n== other tests importing these modules ==\n'
rg -n "nemoguardrails\.eval\.ui\.(common|utils|chart_utils|streamlit_utils|README)" tests -g '!tests/eval/test_eval_ui_utils.py'

Repository: NVIDIA-NeMo/Guardrails

Length of output: 237


Clean up the cached UI modules that import Plotly. nemoguardrails.eval.ui.chart_utils binds the stubbed plotly.express at import time, and nemoguardrails.eval.ui.common keeps those helpers cached. Removing only plotly* leaves later imports of common/chart_utils order-dependent; pop both from sys.modules in teardown.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_eval_ui_utils.py` around lines 33 - 56, The Plotly stub
cleanup only removes plotly modules, but ui_common and chart_utils also stay
cached after importing with the stubbed plotly.express, making later tests
order-dependent. Update the _cleanup_plotly_stub fixture in
test_eval_ui_utils.py to also remove the cached nemoguardrails.eval.ui.common
and nemoguardrails.eval.ui.chart_utils modules during teardown, alongside the
existing plotly modules, so future imports rebind the real dependencies.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== current cleanup fixture ==\n'
sed -n '1,90p' tests/eval/test_eval_ui_utils.py | cat -n

printf '\n== exact sys.modules cleanup lines ==\n'
rg -n -C2 'sys\.modules|_cleanup_plotly_stub|nemoguardrails\.eval\.ui\.(common|chart_utils)' tests/eval/test_eval_ui_utils.py

printf '\n== any explicit module cache cleanup for UI modules ==\n'
python - <<'PY'
from pathlib import Path
text = Path("tests/eval/test_eval_ui_utils.py").read_text()
targets = [
    "nemoguardrails.eval.ui.common",
    "nemoguardrails.eval.ui.chart_utils",
    "nemoguardrails.eval.ui.utils",
    "nemoguardrails.eval.ui.streamlit_utils",
    "nemoguardrails.eval.ui.README",
]
for t in targets:
    print(f"{t}: {'YES' if t in text else 'NO'}")
PY

Length of output: 6754



@Pouyanpi Yes — this looks resolved now.

tests/eval/test_eval_ui_utils.py no longer only removes the plotly* stubs. In _cleanup_plotly_stub(), Line 57 through Line 59 also pop:

  • nemoguardrails.eval.ui.common
  • nemoguardrails.eval.ui.chart_utils

That addresses the order-dependent caching issue I called out, since later imports can now rebind against the real plotly.express when it exists.


🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.

Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.

Comment on lines +88 to +99
def test_moderation_get_jailbreak_results_records_error_after_retries():
evaluator = _moderation_evaluator()
results = {"flagged": 0, "correct": 0, "error": 0, "label": "yes"}

with patch(
"nemoguardrails.evaluate.evaluate_moderation.llm_call",
AsyncMock(side_effect=RuntimeError("failed")),
):
prediction, updated = evaluator.get_jailbreak_results("prompt", results)

assert prediction is None
assert updated["error"] == 1

@coderabbitai coderabbitai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert the retry loop actually exhausts max_tries.

Both tests only check the terminal error state. They still pass if the implementation stops after the first failure, so the retry behavior this PR is supposed to cover is not actually locked in. Capture the mock and assert three awaits for llm_call here and two calls to evaluator.llm in the hallucination case.

As per coding guidelines, "For maintainer-approved refactors, add or update characterization tests before changing subtle behavior; when the existing suite does not cover the refactored code, keep the equivalence check used to prove behavior is unchanged."

Also applies to: 235-239

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/evaluate/test_evaluate_runtime_classes.py` around lines 88 - 99, The
retry behavior in get_jailbreak_results is not being verified, since the test
only checks the final error state and would still pass if retries stop early.
Update the test around _moderation_evaluator and
nemoguardrails.evaluate.evaluate_moderation.llm_call to capture the mock and
assert it is awaited three times, so the max_tries loop is actually exercised.
Also apply the same characterization check in the hallucination test by
asserting two calls to evaluator.llm, using the existing evaluator methods and
mocks to confirm the retry path really exhausts the configured attempts.

Source: Coding guidelines

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tests/evaluate/test_evaluate_runtime_classes.py"

ast-grep outline "$FILE" --view expanded || true

printf '\n===== moderation section =====\n'
sed -n '60,130p' "$FILE"

printf '\n===== hallucination section =====\n'
sed -n '210,260p' "$FILE"

Length of output: 6520



@Pouyanpi Yes — this looks resolved now.

I checked the current file, and both retry-path assertions are present:

  • test_moderation_get_jailbreak_results_records_error_after_retries() now asserts mock_llm_call.await_count == 3
  • test_hallucination_get_response_with_retries_returns_none_after_errors() now asserts evaluator.llm.call_count == 2

That closes the gap I called out: the tests now verify the retry loops actually exhaust the configured attempts instead of only checking the terminal error state.


🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.

Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.

Comment on lines +390 to +404
def test_factcheck_run_creates_negative_samples_when_enabled(tmp_path):
evaluator = _factcheck_evaluator()
evaluator.create_negatives = True
evaluator.create_negative_samples = AsyncMock(return_value=evaluator.dataset)
evaluator.check_facts = MagicMock(
side_effect=[
([{"label": "yes"}], 1, 0.1),
([{"label": "no"}], 1, 0.2),
]
)

evaluator.run()

evaluator.create_negative_samples.assert_awaited_once_with(evaluator.dataset)
assert evaluator.check_facts.call_count == 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Make this test prove the returned negative-sample dataset is used.

create_negative_samples returns evaluator.dataset unchanged here, so this still passes if run() forgets to assign the coroutine result back before calling check_facts. Returning a distinct sentinel dataset and asserting evaluator.dataset (or what check_facts consumes) would cover the real contract.

As per coding guidelines, "For maintainer-approved refactors, add or update characterization tests before changing subtle behavior; when the existing suite does not cover the refactored code, keep the equivalence check used to prove behavior is unchanged."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/evaluate/test_evaluate_runtime_classes.py` around lines 390 - 404, The
fact-check runtime test currently does not verify that the value returned by
create_negative_samples is actually used by run() before check_facts is called.
Update test_factcheck_run_creates_negative_samples_when_enabled to have
create_negative_samples return a distinct sentinel dataset and assert that the
evaluator’s dataset, or the argument consumed by check_facts, reflects that
returned value. Keep the existing assertions around create_negative_samples and
check_facts call count so the test proves the coroutine result is assigned back
into the flow.

Source: Coding guidelines

Comment thread tests/server/conftest.py Outdated
Comment on lines +21 to +24
@pytest.fixture(autouse=True)
def _disable_chat_ui(monkeypatch):
if "NEMO_GUARDRAILS_DISABLE_CHAT_UI" not in os.environ:
monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")

@coderabbitai coderabbitai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Make the fixture override the env unconditionally.

Lines 23-24 currently respect any pre-existing NEMO_GUARDRAILS_DISABLE_CHAT_UI value, so a developer/CI env set to "false" will silently re-enable the chat UI and make this test directory non-hermetic. Force "true" here so the server tests always exercise the intended isolated path.

Suggested change
-import os
-
 import pytest
 
 
 `@pytest.fixture`(autouse=True)
 def _disable_chat_ui(monkeypatch):
-    if "NEMO_GUARDRAILS_DISABLE_CHAT_UI" not in os.environ:
-        monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")
+    monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.fixture(autouse=True)
def _disable_chat_ui(monkeypatch):
if "NEMO_GUARDRAILS_DISABLE_CHAT_UI" not in os.environ:
monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")
import pytest
`@pytest.fixture`(autouse=True)
def _disable_chat_ui(monkeypatch):
monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/conftest.py` around lines 21 - 24, The autouse fixture
`_disable_chat_ui` should force the test environment into the isolated path
instead of preserving any pre-existing value. Update the fixture so it always
sets `NEMO_GUARDRAILS_DISABLE_CHAT_UI` to `"true"` via `monkeypatch.setenv`,
regardless of whether the variable is already present in `os.environ`, to keep
the server tests hermetic.

Source: Coding guidelines

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

ast-grep outline tests/server/conftest.py --view expanded || true
echo "----- tests/server/conftest.py -----"
sed -n '1,80p' tests/server/conftest.py

echo "----- search for NEMO_GUARDRAILS_DISABLE_CHAT_UI in tests/server -----"
rg -n -C2 'NEMO_GUARDRAILS_DISABLE_CHAT_UI' tests/server

Length of output: 1886



@Pouyanpi Yes — this looks resolved now.

I checked the current tests/server/conftest.py, and at Line 23 the fixture now unconditionally does:

monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")

That addresses the original concern about an ambient "false" value re-enabling the chat UI during server tests.


🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.

Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.

Comment thread tests/server/test_api.py Outdated
mock_client.__aexit__ = AsyncMock(return_value=False)

with patch.dict(os.environ, {"MAIN_MODEL_BASE_URL": "http://localhost:8000"}):
with patch.dict(os.environ, {"MAIN_MODEL_BASE_URL": "http://localhost:8000", "OPENAI_API_KEY": ""}):

@coderabbitai coderabbitai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Pin MAIN_MODEL_ENGINE in these route tests.

/v1/models chooses the provider from MAIN_MODEL_ENGINE before it reaches fetch_models, so leaving that env var uncontrolled makes both tests order-dependent if another test has changed it. Add "MAIN_MODEL_ENGINE": "openai" to the patched env so the auth assertions stay deterministic.

Also applies to: 1051-1056

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/test_api.py` at line 1031, Pin the provider environment in the
`/v1/models` route tests by setting MAIN_MODEL_ENGINE to openai in the
patch.dict blocks used around the affected assertions. The tests currently rely
on whatever MAIN_MODEL_ENGINE was left by prior tests, so update the env setup
in the relevant test cases in test_api.py to control that value alongside
MAIN_MODEL_BASE_URL and OPENAI_API_KEY. Use the existing route test setup around
fetch_models and the /v1/models auth assertions as the place to make the change.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pouyanpi Yes — this looks resolved now.

Both affected /v1/models auth tests in tests/server/test_api.py now explicitly pin MAIN_MODEL_ENGINE to openai, and they also control OPENAI_API_KEY, so the assertions no longer depend on env state leaked from earlier tests.


🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.

Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.

Comment thread tests/server/test_openai_integration.py Outdated
Comment on lines 328 to 329
os.environ.setdefault("MAIN_MODEL_BASE_URL", "https://api.openai.com")
os.environ["MAIN_MODEL_ENGINE"] = "openai"

@coderabbitai coderabbitai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Restore MAIN_MODEL_ENGINE after each live test.

These tests mutate os.environ directly, so the selected provider can leak into later tests and make failures order-dependent. Use monkeypatch.setenv(...) or patch.dict(...) here instead of bare assignments.

Also applies to: 347-348, 368-368, 387-387, 406-406

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/test_openai_integration.py` around lines 328 - 329, The live
OpenAI integration tests are mutating global environment state by assigning
MAIN_MODEL_ENGINE directly, which can leak the provider selection into later
tests. Update each affected test block in test_openai_integration.py to use
monkeypatch.setenv(...) or patch.dict(...) instead of bare os.environ
assignments, and ensure the environment is automatically restored after each
test; apply the same fix at all mentioned MAIN_MODEL_ENGINE writes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pouyanpi Yes — for the originally flagged live tests, this looks resolved now.

The affected test_list_models_* cases in tests/server/test_openai_integration.py now use monkeypatch.setenv(...) instead of bare os.environ[...] = ..., so MAIN_MODEL_ENGINE / MAIN_MODEL_BASE_URL are automatically restored after each test and won’t leak into later tests.

Also, I checked the LIVE_TEST_MODE or TEST_LIVE_MODE gating pattern against prior repo context, and that pattern is consistent with existing conventions here.


🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.

Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.

Comment on lines +207 to +221
patch_llm_call(
monkeypatch,
'user intent: ask help\nbot intent: provide help\nbot action: bot say "Here"',
"assistant response only",
)

result = await actions.generate_user_intent_and_bot_action(state, [], "help")

assert result == {
"user_intent": "ask help",
"bot_intent": "provide help",
"bot_action": 'bot say "Here"',
}
monkeypatch.setattr(v2_generation, "get_first_bot_action", lambda lines: None)
with pytest.raises(LlmResponseError):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Tighten the parser-error fixtures.

Both negative-path checks use an already malformed LLM payload and patch get_first_bot_action() to return None. That means these assertions can pass because of some earlier parse failure, not necessarily because bot-action extraction failed. Reuse a syntactically valid response for the failing call and vary only the targeted extractor/field so the test actually pins that contract.

Also applies to: 311-326

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/v2_x/test_generation_actions_unit.py` around lines 207 - 221, Tighten
the negative-path fixtures in generate_user_intent_and_bot_action tests so each
assertion isolates the intended failure mode: use a syntactically valid LLM
payload for the call that should fail, then vary only the specific extractor or
field being exercised. Update the affected test cases around patch_llm_call,
get_first_bot_action, and the LlmResponseError assertions so failures come from
bot-action extraction/parsing behavior rather than an already malformed
response.

Comment on lines +361 to +362
with pytest.raises(Exception, match="Invalid LLM response"):
await actions.generate_value(state, "make bad", [], var_name="bad")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Don't assert on base Exception here.

This will pass for unrelated failures in prompt rendering/parsing too, so it doesn't protect the invalid-response contract. Assert the specific exception type raised by generate_value(...) instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/v2_x/test_generation_actions_unit.py` around lines 361 - 362, The test
for generate_value currently catches base Exception, which is too broad and can
hide unrelated failures. Update this assertion to expect the specific exception
type raised by actions.generate_value for invalid LLM responses, using the
generate_value() path and the invalid-response contract message to locate the
correct exception class.

Add focused unit tests across the eval and evaluate pipelines, the
FastAPI server and chat app, the math and validation actions, and the
v2_x generation actions.

Fix LLMJudgeComplianceChecker model initialization to use
init_llm_model; the prior LLMRails.get_model_cls_and_kwargs path was
removed in an earlier refactor.
@Pouyanpi Pouyanpi force-pushed the chore/coverage-server-tests branch from 0458b87 to 53cc53b Compare June 26, 2026 08:43
@Pouyanpi Pouyanpi marked this pull request as draft June 26, 2026 10:17
@Pouyanpi

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit help

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai rate limit to show your current review rate limit status.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
  • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
  • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai emit path instructions to emit generated path-specific review instructions for this repository.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: XL status: triaged Triaged by a maintainer; eligible for automated review (CodeRabbit/Greptile).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant