Clean up unit tests and enable coverage reporting by Am1n3e · Pull Request #29 · ServiceNow/webarena-verified

Am1n3e · 2026-03-07T11:42:02Z

Summary

Enable pytest coverage reporting in CI and upload coverage.xml as an artifact.
Remove hard skips in API evaluation tests and replace them with curated, deterministic, high-signal cases.
Clean up redundant/contradictory unit tests and strengthen type/config/comparator/CLI/data-reader coverage.

Validation

uv run pytest --ignore=tests/dataset/test_hf_dataset.py --ignore=tests/integration/environment_control/ --ignore=tests/integration/environments/ -q -rs
uv run pytest --cov=src/webarena_verified --cov-report=term-missing --cov-report=xml --ignore=tests/dataset/test_hf_dataset.py --ignore=tests/integration/environment_control/ --ignore=tests/integration/environments/ -q -rs

Reduce low-signal duplication in data type suites so failures map to real behavior contracts instead of repeated equality/container boilerplate.

Align test names and fixtures with actual behavior contracts, and replace dataset-size-dependent assertions with deterministic filter-semantic checks.

Add high-signal negative and round-trip assertions across type, config, comparator, CLI, and data reader suites to harden behavior contracts and reduce brittle assumptions.

Preserve JSON string content during canonicalization and re-enable skipped network trace, JsonString, and Number hash coverage with deterministic assertions.

Run pytest with coverage reports in the test task and upload coverage.xml as a workflow artifact without enforcing a fail-under threshold.

Replace brittle skip gates with curated high-signal case generation for navigation and retrieval suites, keeping deterministic coverage while eliminating runtime skips.

Am1n3e · 2026-03-07T11:44:04Z

wip/unitest-tasks.md

@@ -0,0 +1,179 @@
+# Unit Test Cleanup Tasks


remove this.

Use a tuple-typed model for tuple serialization coverage so static type checks pass without weakening the test intent.

Am1n3e · 2026-03-07T11:44:32Z

tests/api/test_evaluation_api_navigation_tasks.py

-        "Navigation evaluation tests are unstable due to regex URL templates and evaluator strictness. "
-        "See NEW_TESTS_ISSUES.md."
-    )
+CURATED_NAVIGATION_TASK_IDS = (


We should evaluate each task at least once.

Copilot

Pull request overview

This PR tightens and de-flakes the Python test suite while enabling coverage reporting in CI, aiming to increase signal and make failures more actionable.

Changes:

Enable pytest-cov in the invoke test task and upload coverage.xml from CI.
Replace/adjust skipped, redundant, or low-signal tests with deterministic, higher-signal assertions (including curated API evaluation cases).
Strengthen unit coverage around normalization/serialization, CLI output contracts, config URL round-trips, and data-reader filtering.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`wip/unitest-tasks.md`	Tracks test-suite cleanup actions and rationale for the refactors in this PR.
`tests/types/test_network_trace.py`	Unskips and de-brittles assertions around `NetworkTrace.evaluation_events`.
`tests/types/test_eval_types.py`	Adds typed-field serialization coverage relying on `NormalizedType`’s production Pydantic serializer.
`tests/types/test_config.py`	Adds deterministic `render_url`/`derender_url` round-trip tests (single + list inputs).
`tests/types/test_agent_response.py`	Adds negative-path ValidationError coverage and alias precedence checks.
`tests/core/evaluation/test_value_comparator.py`	Adds edge-case comparator tests and renames mischaracterized “performance” test.
`tests/core/evaluation/data_types/test_markdown_string.py`	Removes low-signal container/law boilerplate tests, keeps behavior-focused coverage.
`tests/core/evaluation/data_types/test_json_string.py`	Removes duplicate normalization tests; unskips and enforces Unicode/special-char behavior.
`tests/core/evaluation/data_types/test_base64_string.py`	Removes redundant container and equality-law boilerplate tests.
`tests/core/evaluation/data_types/test_base.py`	Simplifies/strengthens equality semantics tests; unskips previously skipped hash case.
`tests/cli/test_eval_commands.py`	Adds filesystem output-contract assertions for eval runs (per-task + summary behavior).
`tests/cli/test_dataset_commands.py`	Strengthens JSON output and `--fields` parsing/normalization expectations.
`tests/api/test_evaluation_api_retrieval_tasks.py`	Replaces runtime skips with generation-time pruning for selected valid variations.
`tests/api/test_evaluation_api_navigation_tasks.py`	Removes blanket skip; replaces combinatorial generation with curated task/variation matrix.
`tests/api/test_data_reader.py`	Replaces brittle min-count assertions with deterministic filter equivalence checks + synthetic dataset expectations.
`src/webarena_verified/core/evaluation/data_types/json_string.py`	Overrides normalization pipeline to avoid string pre-normalization before JSON parsing/dumping.
`dev/code_tasks.py`	Enables coverage collection/report generation in the `inv dev.code.test` task.
`.github/workflows/test.yml`	Uploads `coverage.xml` artifact from CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T11:48:13Z

src/webarena_verified/core/evaluation/data_types/json_string.py

        # Dump to compact string with sorted keys (top-level only)
        return json.dumps(parsed, sort_keys=True, separators=(",", ":"), ensure_ascii=False)
+
+    def _normalize_pipeline(self, value: Any) -> str:
+        """Override base pipeline to preserve raw JSON string content.


The comment/doc here says key sorting is “top-level only”, but json.dumps(..., sort_keys=True) sorts keys recursively (nested dicts too). Since tests now assert nested keys are sorted, please update the inline comment/docstring to match the actual normalization behavior to avoid confusion for future maintainers.

Copilot · 2026-03-07T11:48:13Z

tests/core/evaluation/data_types/test_base.py

 @pytest.mark.parametrize(
-    ("data_type", "value1", "value2"),
+    ("left", "right"),
    [
-        # Single values
-        (NormalizedString, "success", "success"),
-        (Number, 10, 10),
-        # Alternatives with overlap
-        (NormalizedString, ["success", "ok"], "success"),
-        (NormalizedString, ["success", "ok"], ["success", "completed"]),
-        (Number, [10, 17], 10),
-        (Number, [10, 17], [17, 25]),
-        # Multiple overlaps
-        (NormalizedString, ["success", "ok"], ["success", "ok"]),
-        (Number, [10, 17, 25], [17, 25, 30]),
+        (NormalizedString(["success", "ok"]), NormalizedString("success")),
+        (Number([10, 17]), Number([17, 25])),


This parametrization constructs NormalizedType instances at import/collection time. If normalization rules change or an input becomes invalid, pytest will error during collection instead of reporting a single failing test case. Consider parametrizing raw inputs (type + values) and instantiating inside the test body so failures are isolated and test IDs stay clearer.

Remove the temporary wip task document and add navigation smoke coverage that evaluates every navigation task at least once while keeping high-signal strict assertions on a curated subset.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T12:36:10Z

src/webarena_verified/core/evaluation/data_types/json_string.py

        # Dump to compact string with sorted keys (top-level only)
        return json.dumps(parsed, sort_keys=True, separators=(",", ":"), ensure_ascii=False)
+
+    def _normalize_pipeline(self, value: Any) -> str:
+        """Override base pipeline to preserve raw JSON string content.
+
+        Json payloads must not be transliterated, case-folded, or otherwise pre-normalized
+        before parsing. We only apply JSON parsing + canonical dumping in `_type_normalize`.
+        """


The in-code comments/docstrings still state JsonString sorts keys “top-level only”, but the implementation uses json.dumps(sort_keys=True) (which sorts nested keys too) and the tests now assert nested sorting. Please update the comment/docstring to reflect actual behavior to avoid misleading future changes (or change the implementation if top-level-only is still desired).

Copilot · 2026-03-07T12:36:10Z

tests/api/test_evaluation_api_navigation_tasks.py

+def _has_single_item_list(value: Any) -> bool:
+    """Detect single-item list structures that represent malformed alternatives."""
+    if isinstance(value, list):
+        if len(value) == 1:
+            return True
+        return any(_has_single_item_list(item) for item in value)
+
+    if isinstance(value, dict):
+        return any(_has_single_item_list(item) for item in value.values())
+
+    return False
+
+
+def _is_malformed_network_expected(task_id: int, dataset: MappingProxyType[int, MappingProxyType[str, Any]]) -> bool:
+    """Return True when the task's expected network config is currently malformed."""
+    network_config = _get_network_event_config(task_id, dataset)
+    expected = network_config.get("expected", {})
+    return _has_single_item_list(expected.get("post_data"))
+


_has_single_item_list treats any list of length 1 anywhere inside post_data as “malformed alternatives”. This is too broad: many tasks legitimately have single-element arrays (and may even include post_data_schema marking them as arrays), so the smoke test would incorrectly allow ERROR for tasks that should be expected to stay executable. Consider detecting “malformed” by attempting the same normalization the evaluator uses (e.g., ValueNormalizer with the task’s post_data_schema) and only flagging the specific failure mode (single-item alternatives) rather than any single-element array.

Copilot · 2026-03-07T12:36:10Z

tests/api/test_evaluation_api_navigation_tasks.py

+    project_root = Path(metafunc.config.rootpath)
+    dataset = _load_dataset(project_root)
+    navigation_task_ids = _get_navigation_task_ids(dataset)


pytest_generate_tests reloads and parses the full dataset JSON each time it’s called during collection. Since this hook runs once per collected test function, consider caching _load_dataset() (e.g., module-level cache or functools.lru_cache) to avoid repeated I/O/JSON parsing and keep test collection fast as the dataset grows.

Am1n3e added 6 commits March 7, 2026 11:11

Clean up redundant unit test law checks

a9d4c0f

Reduce low-signal duplication in data type suites so failures map to real behavior contracts instead of repeated equality/container boilerplate.

Fix contradictory and brittle unit test expectations

aa55717

Align test names and fixtures with actual behavior contracts, and replace dataset-size-dependent assertions with deterministic filter-semantic checks.

Strengthen deterministic unit test coverage

27115d1

Add high-signal negative and round-trip assertions across type, config, comparator, CLI, and data reader suites to harden behavior contracts and reduce brittle assumptions.

Unskip remaining high-value unit tests

b21b69a

Preserve JSON string content during canonicalization and re-enable skipped network trace, JsonString, and Number hash coverage with deterministic assertions.

Enable test coverage reporting in CI

72a83b7

Run pytest with coverage reports in the test task and upload coverage.xml as a workflow artifact without enforcing a fail-under threshold.

Remove hard skips from API evaluation tests

5f25fc1

Replace brittle skip gates with curated high-signal case generation for navigation and retrieval suites, keeping deterministic coverage while eliminating runtime skips.

Am1n3e requested a review from Copilot March 7, 2026 11:42

Copilot started reviewing on behalf of Am1n3e March 7, 2026 11:43 View session

Am1n3e commented Mar 7, 2026

View reviewed changes

wip/unitest-tasks.md Outdated

@@ -0,0 +1,179 @@

# Unit Test Cleanup Tasks

Copy link

Collaborator Author

Am1n3e Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this.

Fix tuple serialization test typing

f1fcc0c

Use a tuple-typed model for tuple serialization coverage so static type checks pass without weakening the test intent.

Am1n3e commented Mar 7, 2026

View reviewed changes

Copilot AI reviewed Mar 7, 2026

View reviewed changes

Address PR review comments on test scope

624982c

Remove the temporary wip task document and add navigation smoke coverage that evaluates every navigation task at least once while keeping high-signal strict assertions on a curated subset.

Am1n3e requested a review from Copilot March 7, 2026 12:30

Copilot started reviewing on behalf of Am1n3e March 7, 2026 12:31 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up unit tests and enable coverage reporting#29

Clean up unit tests and enable coverage reporting#29
Am1n3e wants to merge 8 commits intomainfrom
chore/clean-unit-tests

Am1n3e commented Mar 7, 2026

Uh oh!

Am1n3e Mar 7, 2026

Uh oh!

Am1n3e Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Am1n3e commented Mar 7, 2026

Summary

Validation

Uh oh!

Am1n3e Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Am1n3e Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants