Clean up unit tests and enable coverage reporting#29
Conversation
Reduce low-signal duplication in data type suites so failures map to real behavior contracts instead of repeated equality/container boilerplate.
Align test names and fixtures with actual behavior contracts, and replace dataset-size-dependent assertions with deterministic filter-semantic checks.
Add high-signal negative and round-trip assertions across type, config, comparator, CLI, and data reader suites to harden behavior contracts and reduce brittle assumptions.
Preserve JSON string content during canonicalization and re-enable skipped network trace, JsonString, and Number hash coverage with deterministic assertions.
Run pytest with coverage reports in the test task and upload coverage.xml as a workflow artifact without enforcing a fail-under threshold.
Replace brittle skip gates with curated high-signal case generation for navigation and retrieval suites, keeping deterministic coverage while eliminating runtime skips.
wip/unitest-tasks.md
Outdated
| @@ -0,0 +1,179 @@ | |||
| # Unit Test Cleanup Tasks | |||
Use a tuple-typed model for tuple serialization coverage so static type checks pass without weakening the test intent.
| "Navigation evaluation tests are unstable due to regex URL templates and evaluator strictness. " | ||
| "See NEW_TESTS_ISSUES.md." | ||
| ) | ||
| CURATED_NAVIGATION_TASK_IDS = ( |
There was a problem hiding this comment.
We should evaluate each task at least once.
There was a problem hiding this comment.
Pull request overview
This PR tightens and de-flakes the Python test suite while enabling coverage reporting in CI, aiming to increase signal and make failures more actionable.
Changes:
- Enable pytest-cov in the invoke test task and upload
coverage.xmlfrom CI. - Replace/adjust skipped, redundant, or low-signal tests with deterministic, higher-signal assertions (including curated API evaluation cases).
- Strengthen unit coverage around normalization/serialization, CLI output contracts, config URL round-trips, and data-reader filtering.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
wip/unitest-tasks.md |
Tracks test-suite cleanup actions and rationale for the refactors in this PR. |
tests/types/test_network_trace.py |
Unskips and de-brittles assertions around NetworkTrace.evaluation_events. |
tests/types/test_eval_types.py |
Adds typed-field serialization coverage relying on NormalizedType’s production Pydantic serializer. |
tests/types/test_config.py |
Adds deterministic render_url/derender_url round-trip tests (single + list inputs). |
tests/types/test_agent_response.py |
Adds negative-path ValidationError coverage and alias precedence checks. |
tests/core/evaluation/test_value_comparator.py |
Adds edge-case comparator tests and renames mischaracterized “performance” test. |
tests/core/evaluation/data_types/test_markdown_string.py |
Removes low-signal container/law boilerplate tests, keeps behavior-focused coverage. |
tests/core/evaluation/data_types/test_json_string.py |
Removes duplicate normalization tests; unskips and enforces Unicode/special-char behavior. |
tests/core/evaluation/data_types/test_base64_string.py |
Removes redundant container and equality-law boilerplate tests. |
tests/core/evaluation/data_types/test_base.py |
Simplifies/strengthens equality semantics tests; unskips previously skipped hash case. |
tests/cli/test_eval_commands.py |
Adds filesystem output-contract assertions for eval runs (per-task + summary behavior). |
tests/cli/test_dataset_commands.py |
Strengthens JSON output and --fields parsing/normalization expectations. |
tests/api/test_evaluation_api_retrieval_tasks.py |
Replaces runtime skips with generation-time pruning for selected valid variations. |
tests/api/test_evaluation_api_navigation_tasks.py |
Removes blanket skip; replaces combinatorial generation with curated task/variation matrix. |
tests/api/test_data_reader.py |
Replaces brittle min-count assertions with deterministic filter equivalence checks + synthetic dataset expectations. |
src/webarena_verified/core/evaluation/data_types/json_string.py |
Overrides normalization pipeline to avoid string pre-normalization before JSON parsing/dumping. |
dev/code_tasks.py |
Enables coverage collection/report generation in the inv dev.code.test task. |
.github/workflows/test.yml |
Uploads coverage.xml artifact from CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Dump to compact string with sorted keys (top-level only) | ||
| return json.dumps(parsed, sort_keys=True, separators=(",", ":"), ensure_ascii=False) | ||
|
|
||
| def _normalize_pipeline(self, value: Any) -> str: | ||
| """Override base pipeline to preserve raw JSON string content. |
There was a problem hiding this comment.
The comment/doc here says key sorting is “top-level only”, but json.dumps(..., sort_keys=True) sorts keys recursively (nested dicts too). Since tests now assert nested keys are sorted, please update the inline comment/docstring to match the actual normalization behavior to avoid confusion for future maintainers.
| @pytest.mark.parametrize( | ||
| ("data_type", "value1", "value2"), | ||
| ("left", "right"), | ||
| [ | ||
| # Single values | ||
| (NormalizedString, "success", "success"), | ||
| (Number, 10, 10), | ||
| # Alternatives with overlap | ||
| (NormalizedString, ["success", "ok"], "success"), | ||
| (NormalizedString, ["success", "ok"], ["success", "completed"]), | ||
| (Number, [10, 17], 10), | ||
| (Number, [10, 17], [17, 25]), | ||
| # Multiple overlaps | ||
| (NormalizedString, ["success", "ok"], ["success", "ok"]), | ||
| (Number, [10, 17, 25], [17, 25, 30]), | ||
| (NormalizedString(["success", "ok"]), NormalizedString("success")), | ||
| (Number([10, 17]), Number([17, 25])), |
There was a problem hiding this comment.
This parametrization constructs NormalizedType instances at import/collection time. If normalization rules change or an input becomes invalid, pytest will error during collection instead of reporting a single failing test case. Consider parametrizing raw inputs (type + values) and instantiating inside the test body so failures are isolated and test IDs stay clearer.
Remove the temporary wip task document and add navigation smoke coverage that evaluates every navigation task at least once while keeping high-signal strict assertions on a curated subset.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Dump to compact string with sorted keys (top-level only) | ||
| return json.dumps(parsed, sort_keys=True, separators=(",", ":"), ensure_ascii=False) | ||
|
|
||
| def _normalize_pipeline(self, value: Any) -> str: | ||
| """Override base pipeline to preserve raw JSON string content. | ||
|
|
||
| Json payloads must not be transliterated, case-folded, or otherwise pre-normalized | ||
| before parsing. We only apply JSON parsing + canonical dumping in `_type_normalize`. | ||
| """ |
There was a problem hiding this comment.
The in-code comments/docstrings still state JsonString sorts keys “top-level only”, but the implementation uses json.dumps(sort_keys=True) (which sorts nested keys too) and the tests now assert nested sorting. Please update the comment/docstring to reflect actual behavior to avoid misleading future changes (or change the implementation if top-level-only is still desired).
| def _has_single_item_list(value: Any) -> bool: | ||
| """Detect single-item list structures that represent malformed alternatives.""" | ||
| if isinstance(value, list): | ||
| if len(value) == 1: | ||
| return True | ||
| return any(_has_single_item_list(item) for item in value) | ||
|
|
||
| if isinstance(value, dict): | ||
| return any(_has_single_item_list(item) for item in value.values()) | ||
|
|
||
| return False | ||
|
|
||
|
|
||
| def _is_malformed_network_expected(task_id: int, dataset: MappingProxyType[int, MappingProxyType[str, Any]]) -> bool: | ||
| """Return True when the task's expected network config is currently malformed.""" | ||
| network_config = _get_network_event_config(task_id, dataset) | ||
| expected = network_config.get("expected", {}) | ||
| return _has_single_item_list(expected.get("post_data")) | ||
|
|
There was a problem hiding this comment.
_has_single_item_list treats any list of length 1 anywhere inside post_data as “malformed alternatives”. This is too broad: many tasks legitimately have single-element arrays (and may even include post_data_schema marking them as arrays), so the smoke test would incorrectly allow ERROR for tasks that should be expected to stay executable. Consider detecting “malformed” by attempting the same normalization the evaluator uses (e.g., ValueNormalizer with the task’s post_data_schema) and only flagging the specific failure mode (single-item alternatives) rather than any single-element array.
| project_root = Path(metafunc.config.rootpath) | ||
| dataset = _load_dataset(project_root) | ||
| navigation_task_ids = _get_navigation_task_ids(dataset) |
There was a problem hiding this comment.
pytest_generate_tests reloads and parses the full dataset JSON each time it’s called during collection. Since this hook runs once per collected test function, consider caching _load_dataset() (e.g., module-level cache or functools.lru_cache) to avoid repeated I/O/JSON parsing and keep test collection fast as the dataset grows.
Summary
coverage.xmlas an artifact.Validation
uv run pytest --ignore=tests/dataset/test_hf_dataset.py --ignore=tests/integration/environment_control/ --ignore=tests/integration/environments/ -q -rsuv run pytest --cov=src/webarena_verified --cov-report=term-missing --cov-report=xml --ignore=tests/dataset/test_hf_dataset.py --ignore=tests/integration/environment_control/ --ignore=tests/integration/environments/ -q -rs