test: add unit coverage for eval, evaluate, server, and actions#2078
test: add unit coverage for eval, evaluate, server, and actions#2078Pouyanpi wants to merge 2 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR adds broad unit test coverage across the eval/evaluate pipelines, FastAPI server, actions, and v2.x generation, and fixes a broken
|
| Filename | Overview |
|---|---|
| nemoguardrails/eval/check.py | Core production change: replaces removed LLMRails.get_model_cls_and_kwargs with init_llm_model; adds _prepare_model_kwargs helper; improves placeholder main-model creation via model_copy. The empty-string env var case in _prepare_model_kwargs was flagged in a prior review. |
| tests/eval/test_eval_check.py | New tests covering LLMJudgeComplianceChecker init, print helpers, all compliance verdict branches (yes/no/n/a/invalid), reset/force flags, and the full run() orchestration. Well-structured with a shared new-based builder fixture. |
| tests/server/test_server_coverage.py | New tests covering lifespan startup paths, _get_rails caching/validation/model-override, streaming formatter, auto-reload monitoring, chat-app message flow (success/unsupported/error), and Redis/DataStore base classes. |
| tests/server/test_api.py | Adds test_list_models_prefers_openai_api_key_over_auth_header; fixes pre-existing tests to pin MAIN_MODEL_ENGINE=openai and OPENAI_API_KEY="" so ambient env keys cannot interfere with auth-forwarding assertions. |
| tests/server/test_openai_integration.py | Guards all live-provider tests behind LIVE_TEST_MODE flag; refactors openai_client fixture into a factory; switches os.environ mutations to monkeypatch for proper cleanup. |
| tests/eval/test_eval_ui_utils.py | New module-level plotly stub + cleanup fixture; tests EvalData helpers, compliance/resource dataframes, chart utils rendering, streamlit sidebar, README page, and run summary filtering. The module-level stub injection timing may interact with other test workers collecting this module alongside files that import plotly (noted in a prior outside-diff comment). |
| tests/v2_x/test_generation_actions_unit.py | New unit tests for LLMGenerationActionsV2dotx: flow-index building, user-intent collection, embedding-only fallback, LLM-based intent/bot-action generation, passthrough, flow helpers, and generate_value/flow/continuation error paths. |
| tests/evaluate/test_evaluate_runtime_classes.py | New tests for ModerationRailsEvaluation, HallucinationRailsEvaluation, and FactCheckEvaluation: init, LLM call flows, retry exhaustion, output writing, and negative-sample generation. |
| tests/eval/test_eval_cli.py | New tests for eval CLI: run command validation, invocation of run_eval, check-compliance with explicit/discovered output paths, UI launch, and missing-streamlit exit. |
| tests/eval/test_eval_runtime.py | New tests for eval runtime: interaction-output extraction, incremental output loading, span tracing, log extraction, span metric aggregation, compliance computation, dict helpers, and full run_eval end-to-end. |
| tests/evaluate/test_evaluate_cli_and_data.py | New tests for evaluate CLI (topical/moderation/hallucination/fact-checking commands), Anthropic dataset processing, SimplifyFormatter, and DatasetConnector helpers including Banking77 and ChitChat connectors. |
| tests/evaluate/test_evaluate_topical.py | New tests for TopicalRailsEvaluation: dataset loading, cosine similarity, sync_wrapper, test-set splitting, embeddings model init/compute, intent helpers, main-model detection, and full evaluate_topical_rails end-to-end. |
| tests/server/conftest.py | New conftest autouse fixture that sets NEMO_GUARDRAILS_DISABLE_CHAT_UI=true for all server tests, preventing chat-UI imports from affecting test isolation. |
| tests/test_actions_math.py | New tests for wolfram_alpha_request: missing query, missing APP_ID, successful 200 response, and non-200 error ActionResult. |
| tests/test_actions_validation.py | Extends existing validation tests with contains_secrets coverage: detected secret, all-false result, and missing detect_secrets dependency error handling. |
| tests/server/test_schema_utils.py | Adds test_fetch_env_auth_precedes_forwarded_auth; fixes test_fetch_auth_forwarded to set OPENAI_API_KEY="" so the env-key guard does not shadow the forwarded-token path in CI. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[LLMJudgeComplianceChecker.__init__] --> B[EvalConfig.from_path]
B --> C{Find model_config\nmatching llm_judge_model}
C -->|not found| D[exit 1]
C -->|found| E[_prepare_model_kwargs\nparameters + api_key_env_var]
E --> F[init_llm_model\nmodel_name, provider_name, mode, kwargs]
F --> G[self.llm]
A --> H{any model.type == main?}
H -->|yes| I[use eval_config.models as-is]
H -->|no| J[model_config.model_copy\nupdate type to main]
J --> I
I --> K[RailsConfig models=task_manager_models]
K --> L[LLMTaskManager config]
L --> M[self.llm_task_manager]
G & M --> N[EvalData initialized\nready for compliance checks]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[LLMJudgeComplianceChecker.__init__] --> B[EvalConfig.from_path]
B --> C{Find model_config\nmatching llm_judge_model}
C -->|not found| D[exit 1]
C -->|found| E[_prepare_model_kwargs\nparameters + api_key_env_var]
E --> F[init_llm_model\nmodel_name, provider_name, mode, kwargs]
F --> G[self.llm]
A --> H{any model.type == main?}
H -->|yes| I[use eval_config.models as-is]
H -->|no| J[model_config.model_copy\nupdate type to main]
J --> I
I --> K[RailsConfig models=task_manager_models]
K --> L[LLMTaskManager config]
L --> M[self.llm_task_manager]
G & M --> N[EvalData initialized\nready for compliance checks]
Reviews (3): Last reviewed commit: "apply review suggestions" | Re-trigger Greptile
| if model_config.api_key_env_var: | ||
| api_key = os.environ.get(model_config.api_key_env_var) | ||
| if api_key: | ||
| kwargs["api_key"] = api_key |
There was a problem hiding this comment.
When
api_key_env_var is configured but the environment variable is set to an empty string (e.g. export JUDGE_API_KEY=""), the check if api_key: silently drops the key from kwargs. Providers that require an explicit API key will then fail at inference time with an opaque authentication error rather than a clear initialization error. Consider using is not None to distinguish "not set" from "explicitly empty".
| if model_config.api_key_env_var: | |
| api_key = os.environ.get(model_config.api_key_env_var) | |
| if api_key: | |
| kwargs["api_key"] = api_key | |
| if model_config.api_key_env_var: | |
| api_key = os.environ.get(model_config.api_key_env_var) | |
| if api_key is not None: | |
| kwargs["api_key"] = api_key |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemoguardrails/eval/check.py
Line: 51-54
Comment:
When `api_key_env_var` is configured but the environment variable is set to an empty string (e.g. `export JUDGE_API_KEY=""`), the check `if api_key:` silently drops the key from `kwargs`. Providers that require an explicit API key will then fail at inference time with an opaque authentication error rather than a clear initialization error. Consider using `is not None` to distinguish "not set" from "explicitly empty".
```suggestion
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key is not None:
kwargs["api_key"] = api_key
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
@greptileai this is resolved in the latest commit. Do you confirm?
There was a problem hiding this comment.
The issue has not been resolved. The current code on the PR head still reads:
if model_config.api_key_env_var:
api_key = os.environ.get(model_config.api_key_env_var)
if api_key:
kwargs["api_key"] = api_keyThe inner if api_key: still silently drops the key when the environment variable is set to an empty string. The fix would be:
| if model_config.api_key_env_var: | |
| api_key = os.environ.get(model_config.api_key_env_var) | |
| if api_key: | |
| kwargs["api_key"] = api_key | |
| if model_config.api_key_env_var: | |
| api_key = os.environ.get(model_config.api_key_env_var) | |
| if api_key is not None: | |
| kwargs["api_key"] = api_key |
📝 WalkthroughWalkthroughThe judge checker now initializes through ChangesEval checker and UI coverage
Evaluate package command coverage
Server auth and lifecycle coverage
Action helper coverage
V2 generation actions coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (4)
tests/eval/test_eval_check.py (1)
81-87: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExercise the real RailsConfig/LLMTaskManager construction path.
This test mocks the exact contract changed in
LLMJudgeComplianceChecker.__init__, so a brokenRailsConfig(models=..., prompts=...)shape would still pass. Sinceinit_llm_modelis already patched, using the real config/task-manager path should stay offline.Proposed test tightening
- task_manager = MagicMock() - with ( patch("nemoguardrails.eval.check.EvalConfig.from_path", return_value=eval_config) as mock_from_path, patch("nemoguardrails.eval.check.init_llm_model", return_value="llm") as mock_init_llm_model, - patch( - "nemoguardrails.eval.check.RailsConfig", side_effect=lambda **kwargs: SimpleNamespace(**kwargs) - ) as mock_config, - patch("nemoguardrails.eval.check.LLMTaskManager", return_value=task_manager) as mock_task_manager, ): checker = LLMJudgeComplianceChecker( eval_config_path=str(tmp_path), @@ - task_manager_models = mock_config.call_args.kwargs["models"] + task_manager_models = checker.llm_task_manager.config.models assert task_manager_models[0] == eval_config.models[0] assert task_manager_models[1].type == "main" assert task_manager_models[1].model == "judge-model" - mock_task_manager.assert_called_once() + assert checker.llm_task_manager.config.prompts == eval_config.promptsAs per coding guidelines, “For maintainer-approved refactors, add or update characterization tests before changing subtle behavior.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/test_eval_check.py` around lines 81 - 87, The test is over-mocking the construction path used by LLMJudgeComplianceChecker.__init__, so it won’t catch regressions in how RailsConfig and LLMTaskManager are actually built. Update the test to use the real RailsConfig and LLMTaskManager creation path while keeping init_llm_model patched, and keep only the external dependencies mocked so the test still runs offline. This will validate the changed models/prompts contract through the real constructor flow rather than a SimpleNamespace stand-in.Source: Coding guidelines
tests/evaluate/test_evaluate_topical.py (2)
48-52: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winCover the
asyncio.run(...)fallback explicitly.This only validates whichever branch
asyncio.get_event_loop()takes under the current test runner.sync_wrapperhas separate error-handling behavior whenget_event_loop()raisesRuntimeError, so add a monkeypatched case for that branch to keep the fallback from regressing unnoticed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/evaluate/test_evaluate_topical.py` around lines 48 - 52, The current test only covers the normal event-loop path in sync_wrapper and misses the asyncio.run fallback when asyncio.get_event_loop() raises RuntimeError. Update test_sync_wrapper_runs_async_function to add a monkeypatched case that forces get_event_loop() to fail, then verify sync_wrapper(add) still returns the expected result through the fallback branch so that path stays covered.
55-69: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd a
random_seed=0characterization case.
_split_test_set_from_configcurrently checksif random_seed:, so seed0skips the deterministic shuffle path. This test covers seeded behavior, but it misses the one boundary value that flips the branch and would let that regression pass silently. As per coding guidelines, "For maintainer-approved refactors, add or update characterization tests before changing subtle behavior; when the existing suite does not cover the refactored code, keep the equivalence check used to prove behavior is unchanged."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/evaluate/test_evaluate_topical.py` around lines 55 - 69, The `_split_test_set_from_config` test currently covers a nonzero seed but misses the `random_seed=0` branch boundary where `if random_seed:` skips deterministic shuffling. Update the characterization tests in `test_split_test_set_from_config_uses_seed_and_limits_remaining_samples` (or add a sibling case) to exercise `random_seed=0` and verify the split remains deterministic and equivalent to the seeded path, using `_split_test_set_from_config`, `config.user_messages`, and `test_set` to assert the same invariants.Source: Coding guidelines
tests/test_actions_math.py (1)
67-73: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the utterance event on both failure paths.
These tests only verify the
BotIntentsequence.nemoguardrails/actions/math.py:49-68also defines aStartUtteranceBotAction.scriptfor both fallback branches, so a regression in the user-facing message would still pass here. Add one assertion for that event in each test.Also applies to: 97-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_actions_math.py` around lines 67 - 73, The math action tests only assert the BotIntent sequence, so they miss regressions in the user-facing fallback message. In the relevant test cases for the failure paths, also assert the StartUtteranceBotAction.script event produced by the math action so both branches are covered. Use the existing result.events checks in tests/test_actions_math.py and match the utterance event emitted by the math action logic in nemoguardrails/actions/math.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/eval/test_eval_ui_utils.py`:
- Around line 33-56: The Plotly stub cleanup only removes plotly modules, but
ui_common and chart_utils also stay cached after importing with the stubbed
plotly.express, making later tests order-dependent. Update the
_cleanup_plotly_stub fixture in test_eval_ui_utils.py to also remove the cached
nemoguardrails.eval.ui.common and nemoguardrails.eval.ui.chart_utils modules
during teardown, alongside the existing plotly modules, so future imports rebind
the real dependencies.
In `@tests/evaluate/test_evaluate_runtime_classes.py`:
- Around line 390-404: The fact-check runtime test currently does not verify
that the value returned by create_negative_samples is actually used by run()
before check_facts is called. Update
test_factcheck_run_creates_negative_samples_when_enabled to have
create_negative_samples return a distinct sentinel dataset and assert that the
evaluator’s dataset, or the argument consumed by check_facts, reflects that
returned value. Keep the existing assertions around create_negative_samples and
check_facts call count so the test proves the coroutine result is assigned back
into the flow.
- Around line 88-99: The retry behavior in get_jailbreak_results is not being
verified, since the test only checks the final error state and would still pass
if retries stop early. Update the test around _moderation_evaluator and
nemoguardrails.evaluate.evaluate_moderation.llm_call to capture the mock and
assert it is awaited three times, so the max_tries loop is actually exercised.
Also apply the same characterization check in the hallucination test by
asserting two calls to evaluator.llm, using the existing evaluator methods and
mocks to confirm the retry path really exhausts the configured attempts.
In `@tests/server/conftest.py`:
- Around line 21-24: The autouse fixture `_disable_chat_ui` should force the
test environment into the isolated path instead of preserving any pre-existing
value. Update the fixture so it always sets `NEMO_GUARDRAILS_DISABLE_CHAT_UI` to
`"true"` via `monkeypatch.setenv`, regardless of whether the variable is already
present in `os.environ`, to keep the server tests hermetic.
In `@tests/server/test_api.py`:
- Line 1031: Pin the provider environment in the `/v1/models` route tests by
setting MAIN_MODEL_ENGINE to openai in the patch.dict blocks used around the
affected assertions. The tests currently rely on whatever MAIN_MODEL_ENGINE was
left by prior tests, so update the env setup in the relevant test cases in
test_api.py to control that value alongside MAIN_MODEL_BASE_URL and
OPENAI_API_KEY. Use the existing route test setup around fetch_models and the
/v1/models auth assertions as the place to make the change.
In `@tests/server/test_openai_integration.py`:
- Around line 328-329: The live OpenAI integration tests are mutating global
environment state by assigning MAIN_MODEL_ENGINE directly, which can leak the
provider selection into later tests. Update each affected test block in
test_openai_integration.py to use monkeypatch.setenv(...) or patch.dict(...)
instead of bare os.environ assignments, and ensure the environment is
automatically restored after each test; apply the same fix at all mentioned
MAIN_MODEL_ENGINE writes.
In `@tests/v2_x/test_generation_actions_unit.py`:
- Around line 207-221: Tighten the negative-path fixtures in
generate_user_intent_and_bot_action tests so each assertion isolates the
intended failure mode: use a syntactically valid LLM payload for the call that
should fail, then vary only the specific extractor or field being exercised.
Update the affected test cases around patch_llm_call, get_first_bot_action, and
the LlmResponseError assertions so failures come from bot-action
extraction/parsing behavior rather than an already malformed response.
- Around line 361-362: The test for generate_value currently catches base
Exception, which is too broad and can hide unrelated failures. Update this
assertion to expect the specific exception type raised by actions.generate_value
for invalid LLM responses, using the generate_value() path and the
invalid-response contract message to locate the correct exception class.
---
Nitpick comments:
In `@tests/eval/test_eval_check.py`:
- Around line 81-87: The test is over-mocking the construction path used by
LLMJudgeComplianceChecker.__init__, so it won’t catch regressions in how
RailsConfig and LLMTaskManager are actually built. Update the test to use the
real RailsConfig and LLMTaskManager creation path while keeping init_llm_model
patched, and keep only the external dependencies mocked so the test still runs
offline. This will validate the changed models/prompts contract through the real
constructor flow rather than a SimpleNamespace stand-in.
In `@tests/evaluate/test_evaluate_topical.py`:
- Around line 48-52: The current test only covers the normal event-loop path in
sync_wrapper and misses the asyncio.run fallback when asyncio.get_event_loop()
raises RuntimeError. Update test_sync_wrapper_runs_async_function to add a
monkeypatched case that forces get_event_loop() to fail, then verify
sync_wrapper(add) still returns the expected result through the fallback branch
so that path stays covered.
- Around line 55-69: The `_split_test_set_from_config` test currently covers a
nonzero seed but misses the `random_seed=0` branch boundary where `if
random_seed:` skips deterministic shuffling. Update the characterization tests
in `test_split_test_set_from_config_uses_seed_and_limits_remaining_samples` (or
add a sibling case) to exercise `random_seed=0` and verify the split remains
deterministic and equivalent to the seeded path, using
`_split_test_set_from_config`, `config.user_messages`, and `test_set` to assert
the same invariants.
In `@tests/test_actions_math.py`:
- Around line 67-73: The math action tests only assert the BotIntent sequence,
so they miss regressions in the user-facing fallback message. In the relevant
test cases for the failure paths, also assert the StartUtteranceBotAction.script
event produced by the math action so both branches are covered. Use the existing
result.events checks in tests/test_actions_math.py and match the utterance event
emitted by the math action logic in nemoguardrails/actions/math.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cd8771b7-17e6-4add-a105-1390e279e63d
📒 Files selected for processing (16)
nemoguardrails/eval/check.pytests/eval/test_eval_check.pytests/eval/test_eval_cli.pytests/eval/test_eval_runtime.pytests/eval/test_eval_ui_utils.pytests/evaluate/test_evaluate_cli_and_data.pytests/evaluate/test_evaluate_runtime_classes.pytests/evaluate/test_evaluate_topical.pytests/server/conftest.pytests/server/test_api.pytests/server/test_openai_integration.pytests/server/test_schema_utils.pytests/server/test_server_coverage.pytests/test_actions_math.pytests/test_actions_validation.pytests/v2_x/test_generation_actions_unit.py
| # The eval UI modules import `plotly.express` at module load time, so a stub must | ||
| # be present before the imports below. We only inject it when plotly is absent and | ||
| # remove our own stubs after this module's tests so they do not leak into other | ||
| # files sharing the same worker. | ||
| _injected_plotly_modules = {} | ||
| for _module_name in ("plotly", "plotly.express"): | ||
| if _module_name not in sys.modules: | ||
| _injected_plotly_modules[_module_name] = types.ModuleType(_module_name) | ||
| sys.modules[_module_name] = _injected_plotly_modules[_module_name] | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module", autouse=True) | ||
| def _cleanup_plotly_stub(): | ||
| yield | ||
| for _module_name, stub_module in _injected_plotly_modules.items(): | ||
| if sys.modules.get(_module_name) is stub_module: | ||
| del sys.modules[_module_name] | ||
|
|
||
|
|
||
| ui_common = importlib.import_module("nemoguardrails.eval.ui.common") | ||
| ui_utils = importlib.import_module("nemoguardrails.eval.ui.utils") | ||
| chart_utils = importlib.import_module("nemoguardrails.eval.ui.chart_utils") | ||
| streamlit_utils = importlib.import_module("nemoguardrails.eval.ui.streamlit_utils") | ||
| readme_page = importlib.import_module("nemoguardrails.eval.ui.README") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== test file ==\n'
cat -n tests/eval/test_eval_ui_utils.py | sed -n '1,120p'
printf '\n== related UI modules/tests ==\n'
rg -n "nemoguardrails\.eval\.ui\.(common|utils|chart_utils|streamlit_utils|README)|plotly\.express|import plotly" tests nemoguardrails/eval/ui -g '!**/__pycache__/**'Repository: NVIDIA-NeMo/Guardrails
Length of output: 7216
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== file list ==\n'
git ls-files | rg '(^|/)(tests/eval/test_eval_ui_utils\.py|nemoguardrails/eval/ui/|tests/eval/|tests/).*\.py$'
printf '\n== module outlines ==\n'
for f in nemoguardrails/eval/ui/common.py nemoguardrails/eval/ui/utils.py nemoguardrails/eval/ui/chart_utils.py nemoguardrails/eval/ui/streamlit_utils.py nemoguardrails/eval/ui/README.py; do
if [ -f "$f" ]; then
echo "--- $f ---"
wc -l "$f"
sed -n '1,120p' "$f"
fi
doneRepository: NVIDIA-NeMo/Guardrails
Length of output: 30699
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== test file around stub/import block ==\n'
sed -n '1,120p' tests/eval/test_eval_ui_utils.py | cat -n
printf '\n== references in tests ==\n'
rg -n "nemoguardrails\.eval\.ui\.(common|utils|chart_utils|streamlit_utils|README)|plotly\.express|import plotly" tests -g '!**/__pycache__/**'
printf '\n== references in ui package ==\n'
rg -n "plotly\.express|import plotly|from plotly" nemoguardrails/eval/ui -g '!**/__pycache__/**'Repository: NVIDIA-NeMo/Guardrails
Length of output: 6259
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== plotly imports in UI modules ==\n'
for f in nemoguardrails/eval/ui/common.py nemoguardrails/eval/ui/utils.py nemoguardrails/eval/ui/chart_utils.py nemoguardrails/eval/ui/streamlit_utils.py nemoguardrails/eval/ui/README.py; do
if [ -f "$f" ]; then
echo "--- $f ---"
rg -n "plotly|px" "$f"
fi
done
printf '\n== other tests importing these modules ==\n'
rg -n "nemoguardrails\.eval\.ui\.(common|utils|chart_utils|streamlit_utils|README)" tests -g '!tests/eval/test_eval_ui_utils.py'Repository: NVIDIA-NeMo/Guardrails
Length of output: 237
Clean up the cached UI modules that import Plotly. nemoguardrails.eval.ui.chart_utils binds the stubbed plotly.express at import time, and nemoguardrails.eval.ui.common keeps those helpers cached. Removing only plotly* leaves later imports of common/chart_utils order-dependent; pop both from sys.modules in teardown.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/eval/test_eval_ui_utils.py` around lines 33 - 56, The Plotly stub
cleanup only removes plotly modules, but ui_common and chart_utils also stay
cached after importing with the stubbed plotly.express, making later tests
order-dependent. Update the _cleanup_plotly_stub fixture in
test_eval_ui_utils.py to also remove the cached nemoguardrails.eval.ui.common
and nemoguardrails.eval.ui.chart_utils modules during teardown, alongside the
existing plotly modules, so future imports rebind the real dependencies.
There was a problem hiding this comment.
is this resolved now?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== current cleanup fixture ==\n'
sed -n '1,90p' tests/eval/test_eval_ui_utils.py | cat -n
printf '\n== exact sys.modules cleanup lines ==\n'
rg -n -C2 'sys\.modules|_cleanup_plotly_stub|nemoguardrails\.eval\.ui\.(common|chart_utils)' tests/eval/test_eval_ui_utils.py
printf '\n== any explicit module cache cleanup for UI modules ==\n'
python - <<'PY'
from pathlib import Path
text = Path("tests/eval/test_eval_ui_utils.py").read_text()
targets = [
"nemoguardrails.eval.ui.common",
"nemoguardrails.eval.ui.chart_utils",
"nemoguardrails.eval.ui.utils",
"nemoguardrails.eval.ui.streamlit_utils",
"nemoguardrails.eval.ui.README",
]
for t in targets:
print(f"{t}: {'YES' if t in text else 'NO'}")
PYLength of output: 6754
@Pouyanpi Yes — this looks resolved now.
tests/eval/test_eval_ui_utils.py no longer only removes the plotly* stubs. In _cleanup_plotly_stub(), Line 57 through Line 59 also pop:
nemoguardrails.eval.ui.commonnemoguardrails.eval.ui.chart_utils
That addresses the order-dependent caching issue I called out, since later imports can now rebind against the real plotly.express when it exists.
🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.
| def test_moderation_get_jailbreak_results_records_error_after_retries(): | ||
| evaluator = _moderation_evaluator() | ||
| results = {"flagged": 0, "correct": 0, "error": 0, "label": "yes"} | ||
|
|
||
| with patch( | ||
| "nemoguardrails.evaluate.evaluate_moderation.llm_call", | ||
| AsyncMock(side_effect=RuntimeError("failed")), | ||
| ): | ||
| prediction, updated = evaluator.get_jailbreak_results("prompt", results) | ||
|
|
||
| assert prediction is None | ||
| assert updated["error"] == 1 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Assert the retry loop actually exhausts max_tries.
Both tests only check the terminal error state. They still pass if the implementation stops after the first failure, so the retry behavior this PR is supposed to cover is not actually locked in. Capture the mock and assert three awaits for llm_call here and two calls to evaluator.llm in the hallucination case.
As per coding guidelines, "For maintainer-approved refactors, add or update characterization tests before changing subtle behavior; when the existing suite does not cover the refactored code, keep the equivalence check used to prove behavior is unchanged."
Also applies to: 235-239
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/evaluate/test_evaluate_runtime_classes.py` around lines 88 - 99, The
retry behavior in get_jailbreak_results is not being verified, since the test
only checks the final error state and would still pass if retries stop early.
Update the test around _moderation_evaluator and
nemoguardrails.evaluate.evaluate_moderation.llm_call to capture the mock and
assert it is awaited three times, so the max_tries loop is actually exercised.
Also apply the same characterization check in the hallucination test by
asserting two calls to evaluator.llm, using the existing evaluator methods and
mocks to confirm the retry path really exhausts the configured attempts.
Source: Coding guidelines
There was a problem hiding this comment.
is this resolved now?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tests/evaluate/test_evaluate_runtime_classes.py"
ast-grep outline "$FILE" --view expanded || true
printf '\n===== moderation section =====\n'
sed -n '60,130p' "$FILE"
printf '\n===== hallucination section =====\n'
sed -n '210,260p' "$FILE"Length of output: 6520
@Pouyanpi Yes — this looks resolved now.
I checked the current file, and both retry-path assertions are present:
test_moderation_get_jailbreak_results_records_error_after_retries()now assertsmock_llm_call.await_count == 3test_hallucination_get_response_with_retries_returns_none_after_errors()now assertsevaluator.llm.call_count == 2
That closes the gap I called out: the tests now verify the retry loops actually exhaust the configured attempts instead of only checking the terminal error state.
🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.
| def test_factcheck_run_creates_negative_samples_when_enabled(tmp_path): | ||
| evaluator = _factcheck_evaluator() | ||
| evaluator.create_negatives = True | ||
| evaluator.create_negative_samples = AsyncMock(return_value=evaluator.dataset) | ||
| evaluator.check_facts = MagicMock( | ||
| side_effect=[ | ||
| ([{"label": "yes"}], 1, 0.1), | ||
| ([{"label": "no"}], 1, 0.2), | ||
| ] | ||
| ) | ||
|
|
||
| evaluator.run() | ||
|
|
||
| evaluator.create_negative_samples.assert_awaited_once_with(evaluator.dataset) | ||
| assert evaluator.check_facts.call_count == 2 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Make this test prove the returned negative-sample dataset is used.
create_negative_samples returns evaluator.dataset unchanged here, so this still passes if run() forgets to assign the coroutine result back before calling check_facts. Returning a distinct sentinel dataset and asserting evaluator.dataset (or what check_facts consumes) would cover the real contract.
As per coding guidelines, "For maintainer-approved refactors, add or update characterization tests before changing subtle behavior; when the existing suite does not cover the refactored code, keep the equivalence check used to prove behavior is unchanged."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/evaluate/test_evaluate_runtime_classes.py` around lines 390 - 404, The
fact-check runtime test currently does not verify that the value returned by
create_negative_samples is actually used by run() before check_facts is called.
Update test_factcheck_run_creates_negative_samples_when_enabled to have
create_negative_samples return a distinct sentinel dataset and assert that the
evaluator’s dataset, or the argument consumed by check_facts, reflects that
returned value. Keep the existing assertions around create_negative_samples and
check_facts call count so the test proves the coroutine result is assigned back
into the flow.
Source: Coding guidelines
| @pytest.fixture(autouse=True) | ||
| def _disable_chat_ui(monkeypatch): | ||
| if "NEMO_GUARDRAILS_DISABLE_CHAT_UI" not in os.environ: | ||
| monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Make the fixture override the env unconditionally.
Lines 23-24 currently respect any pre-existing NEMO_GUARDRAILS_DISABLE_CHAT_UI value, so a developer/CI env set to "false" will silently re-enable the chat UI and make this test directory non-hermetic. Force "true" here so the server tests always exercise the intended isolated path.
Suggested change
-import os
-
import pytest
`@pytest.fixture`(autouse=True)
def _disable_chat_ui(monkeypatch):
- if "NEMO_GUARDRAILS_DISABLE_CHAT_UI" not in os.environ:
- monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")
+ monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.fixture(autouse=True) | |
| def _disable_chat_ui(monkeypatch): | |
| if "NEMO_GUARDRAILS_DISABLE_CHAT_UI" not in os.environ: | |
| monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true") | |
| import pytest | |
| `@pytest.fixture`(autouse=True) | |
| def _disable_chat_ui(monkeypatch): | |
| monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/server/conftest.py` around lines 21 - 24, The autouse fixture
`_disable_chat_ui` should force the test environment into the isolated path
instead of preserving any pre-existing value. Update the fixture so it always
sets `NEMO_GUARDRAILS_DISABLE_CHAT_UI` to `"true"` via `monkeypatch.setenv`,
regardless of whether the variable is already present in `os.environ`, to keep
the server tests hermetic.
Source: Coding guidelines
There was a problem hiding this comment.
is this resolved now?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ast-grep outline tests/server/conftest.py --view expanded || true
echo "----- tests/server/conftest.py -----"
sed -n '1,80p' tests/server/conftest.py
echo "----- search for NEMO_GUARDRAILS_DISABLE_CHAT_UI in tests/server -----"
rg -n -C2 'NEMO_GUARDRAILS_DISABLE_CHAT_UI' tests/serverLength of output: 1886
@Pouyanpi Yes — this looks resolved now.
I checked the current tests/server/conftest.py, and at Line 23 the fixture now unconditionally does:
monkeypatch.setenv("NEMO_GUARDRAILS_DISABLE_CHAT_UI", "true")That addresses the original concern about an ambient "false" value re-enabling the chat UI during server tests.
🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.
| mock_client.__aexit__ = AsyncMock(return_value=False) | ||
|
|
||
| with patch.dict(os.environ, {"MAIN_MODEL_BASE_URL": "http://localhost:8000"}): | ||
| with patch.dict(os.environ, {"MAIN_MODEL_BASE_URL": "http://localhost:8000", "OPENAI_API_KEY": ""}): |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Pin MAIN_MODEL_ENGINE in these route tests.
/v1/models chooses the provider from MAIN_MODEL_ENGINE before it reaches fetch_models, so leaving that env var uncontrolled makes both tests order-dependent if another test has changed it. Add "MAIN_MODEL_ENGINE": "openai" to the patched env so the auth assertions stay deterministic.
Also applies to: 1051-1056
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/server/test_api.py` at line 1031, Pin the provider environment in the
`/v1/models` route tests by setting MAIN_MODEL_ENGINE to openai in the
patch.dict blocks used around the affected assertions. The tests currently rely
on whatever MAIN_MODEL_ENGINE was left by prior tests, so update the env setup
in the relevant test cases in test_api.py to control that value alongside
MAIN_MODEL_BASE_URL and OPENAI_API_KEY. Use the existing route test setup around
fetch_models and the /v1/models auth assertions as the place to make the change.
There was a problem hiding this comment.
is this resolved now?
There was a problem hiding this comment.
@Pouyanpi Yes — this looks resolved now.
Both affected /v1/models auth tests in tests/server/test_api.py now explicitly pin MAIN_MODEL_ENGINE to openai, and they also control OPENAI_API_KEY, so the assertions no longer depend on env state leaked from earlier tests.
🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.
| os.environ.setdefault("MAIN_MODEL_BASE_URL", "https://api.openai.com") | ||
| os.environ["MAIN_MODEL_ENGINE"] = "openai" |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Restore MAIN_MODEL_ENGINE after each live test.
These tests mutate os.environ directly, so the selected provider can leak into later tests and make failures order-dependent. Use monkeypatch.setenv(...) or patch.dict(...) here instead of bare assignments.
Also applies to: 347-348, 368-368, 387-387, 406-406
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/server/test_openai_integration.py` around lines 328 - 329, The live
OpenAI integration tests are mutating global environment state by assigning
MAIN_MODEL_ENGINE directly, which can leak the provider selection into later
tests. Update each affected test block in test_openai_integration.py to use
monkeypatch.setenv(...) or patch.dict(...) instead of bare os.environ
assignments, and ensure the environment is automatically restored after each
test; apply the same fix at all mentioned MAIN_MODEL_ENGINE writes.
There was a problem hiding this comment.
is this resolved now?
There was a problem hiding this comment.
@Pouyanpi Yes — for the originally flagged live tests, this looks resolved now.
The affected test_list_models_* cases in tests/server/test_openai_integration.py now use monkeypatch.setenv(...) instead of bare os.environ[...] = ..., so MAIN_MODEL_ENGINE / MAIN_MODEL_BASE_URL are automatically restored after each test and won’t leak into later tests.
Also, I checked the LIVE_TEST_MODE or TEST_LIVE_MODE gating pattern against prior repo context, and that pattern is consistent with existing conventions here.
🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repo, do not include wall-clock/performance timing assertions as part of the unit-test suite (e.g., tests that measure execution time and fail based on a threshold). Such tests are inherently flaky on shared CI and provide little/none functional coverage. During review, require that any existing wall-clock performance test be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping these tests under any circumstances.
| patch_llm_call( | ||
| monkeypatch, | ||
| 'user intent: ask help\nbot intent: provide help\nbot action: bot say "Here"', | ||
| "assistant response only", | ||
| ) | ||
|
|
||
| result = await actions.generate_user_intent_and_bot_action(state, [], "help") | ||
|
|
||
| assert result == { | ||
| "user_intent": "ask help", | ||
| "bot_intent": "provide help", | ||
| "bot_action": 'bot say "Here"', | ||
| } | ||
| monkeypatch.setattr(v2_generation, "get_first_bot_action", lambda lines: None) | ||
| with pytest.raises(LlmResponseError): |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Tighten the parser-error fixtures.
Both negative-path checks use an already malformed LLM payload and patch get_first_bot_action() to return None. That means these assertions can pass because of some earlier parse failure, not necessarily because bot-action extraction failed. Reuse a syntactically valid response for the failing call and vary only the targeted extractor/field so the test actually pins that contract.
Also applies to: 311-326
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/v2_x/test_generation_actions_unit.py` around lines 207 - 221, Tighten
the negative-path fixtures in generate_user_intent_and_bot_action tests so each
assertion isolates the intended failure mode: use a syntactically valid LLM
payload for the call that should fail, then vary only the specific extractor or
field being exercised. Update the affected test cases around patch_llm_call,
get_first_bot_action, and the LlmResponseError assertions so failures come from
bot-action extraction/parsing behavior rather than an already malformed
response.
| with pytest.raises(Exception, match="Invalid LLM response"): | ||
| await actions.generate_value(state, "make bad", [], var_name="bad") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Don't assert on base Exception here.
This will pass for unrelated failures in prompt rendering/parsing too, so it doesn't protect the invalid-response contract. Assert the specific exception type raised by generate_value(...) instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/v2_x/test_generation_actions_unit.py` around lines 361 - 362, The test
for generate_value currently catches base Exception, which is too broad and can
hide unrelated failures. Update this assertion to expect the specific exception
type raised by actions.generate_value for invalid LLM responses, using the
generate_value() path and the invalid-response contract message to locate the
correct exception class.
Add focused unit tests across the eval and evaluate pipelines, the FastAPI server and chat app, the math and validation actions, and the v2_x generation actions. Fix LLMJudgeComplianceChecker model initialization to use init_llm_model; the prior LLMRails.get_model_cls_and_kwargs path was removed in an earlier refactor.
0458b87 to
53cc53b
Compare
|
@CodeRabbit help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
Description
Add focused unit tests across the eval and evaluate pipelines, the FastAPI server and chat app, the math and validation actions, and the v2_x generation actions.
Fix LLMJudgeComplianceChecker model initialization to use init_llm_model; the prior LLMRails.get_model_cls_and_kwargs path was removed in an earlier refactor.
Related Issue(s)
Verification
AI Assistance
Checklist
Summary by CodeRabbit
Bug Fixes
New Features