Tests review by ian-ross · Pull Request #136 · MIT-LAE/AEIC

ian-ross · 2026-04-15T17:06:00Z

Clanker test review and fixes. Fixes #115.

Some notes on the process here:

I got Claude Code to generate a test review document, classifying issues by severity, type and region of the code.
I then used a mixture of models (Claude Sonnet and Opus, OpenAI ChatGPT 5.5) to fix all the issues, in order from most severe to least.
I performed spot checks on some of the fixes to check them for sanity.
I did cross-model reviews using Claude Opus 4.7, OpenAI ChatGPT 5.5 and GitHub Copilot, fixing issues using agents (again!). No agent found any significant problems with the test changes made by any other agent. There were a few documentation cleanups found by Copilot at the end, but that was all.

Overall, my confidence in this process is much higher than if the work was done by a human (that human being me or anyone else). This work is tedious, boring and fiddly, and the agents did a great job. They don't get bored, they're good at tracing the consequences of decisions through logically, they don't get lost in a long list of issues, etc.

One other thing: I needed to rebase this branch at one point. There were only a very small number of apparent conflicts, but there was a conceptual change to how performance models were represented on the rebase target branch, so the required changes to testing were much more extensive than a naive analysis would reveal. ChatGPT 5.5 discovered the "rebase rot", identified the locations where changes were needed, made an interactive rebase plan to surgically correct the relevant commits while preserving the overall branch history, and performed the rebase without any drama at all. That would have taken me days, or I would have fallen back on a much less principled approach.

Coding agent PR description follows below the line.

Summary

Multi-session test-suite audit of AEIC plus the full remediation sweep that followed. The audit log and per-finding detail live in test-quality.md; this branch lands the resulting fixes.

By the numbers:

115 audit findings closed: 11 High / 52 Medium / 52 Low.
Test count: ~225 → 316 passing (+1 skipped slow test gated on --run-slow).
~108 commits, organised one finding per commit so each item is bisectable.
Suite still runs cleanly via uv run pytest from the repo root.

What changed, at a glance

Logical errors: fixed cases where a test passed without verifying its stated invariant (e.g. a no-op .all truthiness check, a @patch on the wrong target, mocked thrust-categorization that ran the real SUT).
Suspicious data: replaced SUT-self-generated expected values with notebook-cited rounded results, ISA-formula derivations, or honest smoke + physical-envelope checks; marked irreducible cases explicitly as regression guards.
Coverage gaps: added focused tests for previously unreached branches (PerformanceTable.__post_init__ validators, EDBEntry.get_engine per-sheet absence, Config.escape, default_data_file_location's FileNotFoundError, etc.).
Hygiene: parametrized monolithic tests, decorated NetCDF-touching tests with @pytest.mark.forked, gated long-running tests behind a --run-slow flag, seeded previously-unseeded RNGs, and stripped traffic lights from the audit document.
Bonus SUT observations (separate from the test-quality findings): swapped Config.escape's decorator order; pinned pytest to testpaths = ["tests"] (and migrated [tool.pytest] → [tool.pytest.ini_options], which had been silently ignored); replaced print() in Builder.fly()'s exception path with a module logger; replaced isinstance(x, str | None | …) with concrete tuples in Container equality; documented calculate_nvPM_scope11_LTO's SN ∈ {-1, 0} sentinel handling.

See test-quality.md for the full findings list, the rubric used to triage them, the prioritized remediation tiers, and per-finding remediation notes (each annotated *[DONE]*, with *Actual remediation:* clauses where the fix diverged from the original suggestion).

Test plan

uv run pytest from repo root: 316 passing, 1 skipped (the gated slow test).
uv run pytest --run-slow tests/test_storage.py::test_create_reopen_large collects and runs the previously-permanently-skipped scaling test.
No 🟢 / 🟡 / 🔴 traffic lights remain in test-quality.md.
Reviewer spot-check: pick any commit whose subject names a finding ID (Phase N Low #M, Headline #N) and cross-reference the matching bullet in test-quality.md for context.

np.isclose(result, out_result).all evaluated the truthiness of a bound method object (always True) rather than calling it. Replace with np.testing.assert_allclose which also reports element-wise diffs on failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The test verifying that a sampled query uses a deterministic random function checked for 'random()' which matches both SQLite's built-in random() and the project's det_random(). Tighten to 'det_random()' to enforce the actual reproducibility contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

netCDF4/HDF5 maintains C-level global state that leaks across tests when run in the same process. Mark every test that opens a netCDF file with @pytest.mark.forked so each runs in an isolated subprocess: - test_emissions_storage, test_separate_emissions (test_emissions_storage.py) - test_trajectory_simulation_basic, test_trajectory_simulation_outside_weather_domain, test_trajectory_simulation_weather (test_trajectory_simulation.py) - test_compute_ground_speed (test_weather.py) - test_trajectory_simulation_golden (test_golden.py) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The file collected zero pytest items (all logic was inside main() guarded by if __name__ == '__main__') and contained a hardcoded path to a different developer's home directory. The proper test successor is test_matlab_verification.py and the script successor already exists as scripts/run_legacy_verification.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

flight_time is the independent variable used to align the two trajectories, so comparing it against itself yields MAPE ≈ 0 by construction — not a meaningful check. Remove it from TRAJ_FIELDS and its now-orphaned entry in TRAJ_FIELD_UNITS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The 7-bit literal 0b0000000 was inconsistent with the 7-bit masks used elsewhere in the same test. Use 0b00000000 to match the 8-bit (7 days + 1 reserved) width of the other constants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the manual if/raise pattern with a plain assert so pytest can introspect the failure properly and display it consistently with other test failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Notebook (notebooks/test-cases.ipynb): - Sections 2–6: add a "Rounded results for use in test" markdown cell and code cell(s) so tests can cite a specific cell rather than citing the notebook generically. - Section 5 (SCOPE 11): fix Q-coefficient typo 0.766 → 0.776; the corrected value exactly reproduces the test expected arrays (verified analytically). Data READMEs (new files): - tests/data/verification/legacy/README.md — MATLAB v2.1 provenance, mission list, matlab-output-orig/ → matlab-output/ transformation. - tests/data/weather/README.md — ERA5 source, valid_time, spatial extent, download date, note on self-generated expected value. - tests/data/oag/README.md — OAG extract provenance, filtering rules, expected 8-flight count. - tests/data/missions/oag-2019-test-subset.README.md — ingestion provenance, table row counts, query verification table for all five asserted counts (1197 / 99 / 36 / 307 / 13). - src/AEIC/data/engines/README.md — ICAO EDB source for sample_edb.xlsx and APU_data.toml. Typo fix (src/AEIC/weather.py): - Docstring: azmiuth / azmith → azimuth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously the test wrote custom fields to a TrajectoryStore, reopened it, and asserted only that the length matched — the #TODO acknowledged the real invariant was unverified. Now captures original field values before write and asserts np.array_equal on reload for every trajectory. Also seeds np.random.default_rng(0) and generates fields at the dtypes declared in FieldMetadata (float32 / int32) so round-trip is exact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds DONE markers to headline findings whose remediation has shipped, and annotates open findings with traffic-light emojis by severity (red=High, amber=Medium, green=Low). Closed findings are left unmarked. Totals preserved: 115 findings — 88 open with traffic lights, 27 DONE. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The test's expected_temp / expected_pressure / expected_mach arrays were captured from AtmosphericState itself (per notebook cell 11), making the check self-referential — a science bug that drifted consistently would be invisible. Now computes expected values in a new _isa_reference() helper using published ICAO Doc 7488 / US Std Atmosphere 1976 constants inlined locally, without importing AEIC.utils.standard_atmosphere (the SUT). Spot-check altitudes (h=0 sea level, h=11000 tropopause) match ICAO Table A published values directly. SLS block unchanged — separate sub-concern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test_atmospheric_state_and_sls_flow_shapes now uses independent ISA formulas; strip the red marker from the headline and Phase 4 detail. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The expected dict was a flat 11-species snapshot that mirrored _gse_nominal_profile() plus the SUT's own NOx-split, sulfur-balance, and H2O derivations — a regression in any of those branches would pass because both sides drifted together. Now hoists the 5 primary per-LTO constants (CO2/NOx/HC/CO/PM10 for WIDE) into named locals with a provenance comment, and computes the 6 derived species (H2O, NO/NO2/HONO, SO4/SO2, nvPM) inline from first principles. Shrinks the self-referential footprint to the 5 nominal constants, which trace to the AEIC v2 MATLAB legacy; primary external source (Stettler 2011 / AEDT / EDMS) flagged as TODO pending back-trace. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test_get_gse_emissions_matches_reference_profile now derives the expected dict from cited nominal constants plus inline algebra. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test_compute_ground_speed asserted gs == 191.02855126751604 with rel=1e-4, a value captured from the SUT and re-adjusted when the SUT changed. Any "independent" recomputation against the real ERA5 fixture duplicates the SUT (ISA pressure lookup + xarray interp + hypot), so the assertion cannot be made independent in a useful way. Replaced with a finite + physical envelope (100 < gs < 300 for TAS=200 m/s). Algorithm correctness is now covered by the synthetic- fixture tests later in the same file, where wind values are known and the expected ground speed is derivable on paper. The real- fixture test stays as an honest smoke check that the on-disk 20240901.nc still parses end-to-end. test-quality.md: annotated the finding with the actual remediation approach — neither of the originally-suggested options (a/b) was viable because both route back through the SUT. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test_oag_conversion previously checked only SELECT COUNT(*) FROM flights == 8 on the end-to-end convert_oag_data pipeline. A regression that inserted 8 malformed rows, failed to populate schedules, or emitted spurious warnings would pass. Now: - Full flights-row content for AS 1011 ORD->SEA (row 4 of the CSV, first row that survives is_row_valid). Asserts every column sourced from the CSV — carrier, flight_number, origin/destination IATA, day_of_week_mask, departure/arrival minutes, arrival_day_offset, service_type, aircraft_type (inpacft, not genacft), engine_type, seat_capacity, effective_from/to as ISO dates, number_of_flights (derived: 2 schedules fit the 2019-12-05..11 window at Wed+Thu), od_pair, and distance in km (statute miles x 1.609344). - COUNT(*) FROM schedules > 8 and COUNT(DISTINCT flight_id) == 8 — every accepted flight materialised at least one schedule row. - Warnings file NOT written, because rejected rows in the fixture are all filtered silently by is_row_valid (operating=N, service=V, stops>0) before distance / airport checks. A regression that started emitting spurious warnings on valid rows would fail. Out of scope (separate Medium COVERAGE-GAP finding): a second hand-crafted CSV fixture that deliberately triggers each of the four Warning.Type categories. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test_oag_conversion now verifies row content, schedules population, and warnings-file silence — not just the flights count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The test used a helper (_expected_trajectory_indices) that called the same BFFM2_EINOx / EI_HCCO / get_SLS_equivalent_fuel_flow / AtmosphericState functions the SUT uses, then asserted the SUT outputs matched. A science bug in any of those helpers would be invisible — both sides inherit it. But the test does legitimately catch wiring regressions: wrong arg order, swapped EI_HC/EI_CO, missing NOx speciation, etc. Full Tier 3 (replace helper with notebook-cited rounded arrays) was examined and rejected: notebook Sections 2 and 4 use EDB inputs that don't match DummyPerformanceModel.lto, so citing their arrays either duplicates TestBFFM2_EINOx (already notebook-grounded) or requires new notebook cells. Science correctness for BFFM2 / HCCO / NOx speciation / atmospheric state is already covered independently by test_emission_functions.py and test_atmospheric_state_and_sls_flow_shapes. Instead: - Rename test to test_compute_emissions_pipeline_wiring. - Add docstring spelling out what it verifies (wiring, algebraic identities, NOx speciation partition) and what it does not (science correctness — covered elsewhere). - Annotate the helper as DELIBERATELY SELF-REFERENTIAL. - Add a mass-conservation identity check no_prop+no2_prop+hono_prop ≈ 1.0 — a genuine correctness signal independent of the helper, catching the "fractions don't sum to a valid probability split" bug class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every raise-ValueError branch in PerformanceTable.__post_init__ was previously unreachable from the test suite. Add a parametrized negative test that mutates a single cell (or drops one row) from a per-phase baseline (climb/cruise/descent) to trip each named raise branch, plus a baselines-load smoke and dedicated mass-count rejects. Coverage spans all reachable branches: per-phase ROCD-sign mismatches (descent-not-all-negative, cruise-not-all-zero, climb-some-negative), per-phase mass-count violations, per-phase coverage gaps for climb / cruise, and FL-only dependency violations on cruise.tas, climb.tas, climb.fuel_flow. Descent's coverage and FL-only checks are unreachable once the mass-count == 1 invariant holds (annotated in the code). Adapted to split climb/cruise/descent tables landed in MIT-LAE#139. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Close the last open High headline. PerformanceTable.__post_init__ now has every raise branch exercised, and the orphan bad_performance_model fixtures have been deleted (the sibling Medium COVERAGE-GAP closes alongside). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pin the contract used by `_read_reproducibility_info` when replaying trajectory-store snapshots: the inner load must succeed under `Config.escape()` and the canonical singleton must survive untouched. Previously the branch was unreachable from the test suite, so a broken escape() would silently corrupt replay. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cover the property that controls NetCDF axis order: silent reordering or accidental inclusion of `POINT` would corrupt every emissions / trajectory file we write. Three cases pin the contract — POINT-only trajectory, POINT+SPECIES, and the THRUST_MODE branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The sqlite connect was passing a bare 'tmp.sqlite' string, which sqlite3 creates in the current working directory — seeding a stale 0-byte file at the repo root after every run. Move the file (and the three .nc opens it shares scope with) under `tmp_path` so the test is self-contained. Drop the dead `safe_*` wrappers while there: the audit hook fires regardless of whether the open/connect raises, so the try/except added no coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The bare `except Exception` swallowed any failure — a tmp_path error or import failure in the worker would have set result to 'FAILED' and silently passed the assertion. The guard the test was actually exercising is `TrajectoryStore.active_in_thread` at store.py:378, so capture the exception instance and assert the RuntimeError class and the guard message substring directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Capture each mission's `total_emissions` dict before writing and pin the per-species values after reopen, so a serialization bug that silently dropped a species or rounded the totals would be caught. The prior `hasattr(traj, 'total_emissions')` check was satisfied just by the field's presence in the schema. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`profile_first.mass is profile_second.mass` only proved `functools.cache` returned the same object — a cache that stored an empty placeholder twice would have passed. Add positivity on a TAKEOFF canary (the highest-thrust mode, where mass/number are largest) plus a per-mode finite check covering the full ThrustMode enum. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

MIT-LAE#2) The `total[s] == sum(components)` identity in `test_sum_total_emissions_matches_components` would pass even if a regression silently zeroed one of the four component buckets — the matching zero on the total side would still satisfy the equality. Pin a precondition that CO2 (a species every component must emit under the default fuel config) is positive in each of the trajectory / LTO / APU / GSE buckets before running the identity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`test_intercept_adjustment_uses_second_mode_value` only ever landed in the horizontal-line upper segment because its calibration parameters forced the intercept-adjustment branch in `hcco.py` to clamp `x_intercept` to `log_ff_cal1`. Add a focused `test_branches_split_at_intercept` that: - Uses a strictly-decreasing-with-flow `x_EI` (so the log-log slope is finite and negative, the intercept lands inside the calibration range, and the slanted formula actually runs). - Calls EI_HCCO with a single low-flow point, a single high-flow point, and a mixed array, asserting each lands in the correct segment. - Pins the upper-segment value to the geometric-mean closed form (sqrt(xEI[CLIMB] * xEI[TAKEOFF])) and the lower-segment value as strictly greater than that geometric mean (the decreasing-with-flow signal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ase 4 Low MIT-LAE#4) The `@patch('AEIC.emissions.utils.get_thrust_cat_cruise')` in `test_thrust_categorization` was patching the wrong symbol — `ei/nox.py` imports the name into its own module, so the bound reference inside `BFFM2_EINOx` was untouched and the SUT ran unmocked. The test reduced to a finiteness smoke that already had a sibling covering it, so any regression would have slipped through both sides. Drop the mock and pin the real per-point category-to-speciation identity: - pick an evaluation-flow array spanning all three categories; - recompute the categories independently via `get_thrust_cat_cruise(...)`, look up `NOx_speciation()` per category, and assert `BFFM2_EINOx`'s `noProp` / `no2Prop` / `honoProp` arrays match; - guard with `len(set(cats.data)) >= 2` so a regression that collapsed categorization to a constant bucket would still trip the test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The TestGetAPUEmissions class fixed a slate of arbitrary APU parameters (fuel_kg_per_s=0.1, NOx_g_per_kg=15.0, apu_time=2854, etc.) with no comment about their provenance. They were chosen to make outputs numerically distinguishable, not lifted from any published APU dataset, but a future reader would have no signal of that. Add a class-level docstring stating the synthetic origin and warning against porting the numbers as authoritative. The notebook-expansion suggestion in the audit is left out of scope — it's already on the notebook-index gap list separately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-LAE#6) `test_engine_type_scaling_and_invalid_smoke_numbers` mixed two unrelated branches. The engine-scaling half inline-computed CBC/AFR/kslm/Q against the SUT formula — a tautological copy that would pass even if the SUT regressed and the test recomputed the regression. The invalid-SN half checked only mass-channel zeroing. Split into: - `test_scope11_engine_type_scaling`: pins only the qualitative MTF > TF > 0 invariant at IDLE (the bypass-correction signal). Numerically grounding the formula against the notebook is already done by `test_SCOPE11_unit_test`, so re-sourcing values here would just duplicate that test. - `test_scope11_invalid_smoke_numbers_return_zero`: covers both engine types and pins zeroing on both the mass *and* number channels — a refactor that propagated -1.0 / 0.0 through to produce NaN would otherwise slip through a mass-only check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…IT-LAE#7) The 3-value expected `NOxEI` array had no provenance comment, so a future reader couldn't tell whether it was a notebook-sourced reference or a self-generated snapshot — and the test name ("consistency") suggested the former. Add a docstring stating it's a regression snapshot, *not* an independent reference, and pointing at the test that does carry the notebook-sourced correctness check (`test_matches_reference_component_values`). Inline comment by the assertion repeats the marker so a casual diff sees it. Switched to `np.testing.assert_allclose` so a failure reports the diff. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-LAE#8) `test_get_EDB_data_for_engine_raises_when_uid_absent` only covered the case where the UID was missing from both sheets — the SUT then raises about the gaseous sheet first, so the symmetric "present in gaseous, absent from nvPM" branch was unreached. Parametrize across both scenarios with the expected per-sheet error message pinned via a regex match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The positive-path EDB test was a single function with ~14 sequential asserts. A regression in (e.g.) the nvPM-num column extraction would fail at the first line and mask whatever else had broken in the same parser change. Convert to a parametrization keyed by `(attr, expected)` tuples so each attribute lands on its own test ID; share the parsed `EDBEntry` via a function-scoped fixture (function scope is required because `Config` is loaded by the autouse `default_config` fixture). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-LAE#1) The test appended 30 fully-specified points but only checked `len == 30` and the three phase counters. A regression that silently zeroed every per-point array on `fix()` would have passed. Add spot-checks after `fix()`: - `fuel_flow` constant at 1.4 across phases; - `latitude` follows the input pattern (41.0 + 0.02*i within each phase, so check index 0 and a non-trivial cross-phase index); - altitude trace matches climb 0→9000, cruise flat at 10000, descent 10000→1000; - aircraft mass decreases monotonically within each phase (the setup resets the mass formula at each phase so cross-phase monotonicity would not hold). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two small unit tests for the previously-untested branches in `Trajectory.interpolate_time` and `Trajectory.copy_point`: - `test_interpolate_time_out_of_bounds_yields_nan` pins the `left=nan / right=nan` contract for queries below and above the existing `flight_time` range, with finite values at the boundaries and inside. - `test_copy_point_rejects_out_of_bounds` exercises the two IndexError guards (one each for `from_idx` and `to_idx`) on both the negative and beyond-extent sides. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-LAE#4) `test_maxiters=1000` made the cap effectively unbounded for missions that converge in a handful of iterations (BOS→LAX sample mission converges in ~7 cycles against the sample performance model). A regression that, e.g., stopped converging quadratically would have run longer rather than failed. Lower to 20 — ~3× headroom over the empirical baseline — so any meaningful slowdown trips the cap and raises `RuntimeError`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`iterate_mass=True` was tested only at the loose ends of the `max_mass_iters` axis — generously above (1000) and trivially below (1). The boundary case where the mission converges *exactly* at the empirically-minimal `max_mass_iters` was unreached, leaving an "off-by-one in the loop bound" regression invisible. Add `test_trajectory_mass_iter_boundary` parametrized over the two directions: - `_MIN_ITERS_TO_CONVERGE - 1`: must raise `RuntimeError` with the "Mass iteration failed to converge" message; - `_MIN_ITERS_TO_CONVERGE`: must succeed. The constant `_MIN_ITERS_TO_CONVERGE = 7` is the empirical baseline for the BOS→LAX `example_mission` against the sample performance model at `reltol=1e-6`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The `set(['true_airspeed'])` magic constant had no comment about why the final point is skipped. The reason — the legacy MATLAB trajectory carries an extra post-landing point that the new SUT doesn't synthesize identically, so TAS at the tail is a guaranteed false positive — was implicit. Add an explanatory comment and switch to the `{...}` set-literal form (idiomatic, slightly faster than the `set([...])` constructor). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`process_matlab_csvs` was the only entry into `matlab-output/` generation and had no direct test — the 67 % coverage on `verification/legacy.py` was almost entirely this function's lines. Silent breakage of the (`tdf.t[:-1] != edf.t).any()` consistency check at line 58 or the per-key groupby split would have corrupted every downstream verification run unnoticed. Add three unit tests in `tests/test_matlab_verification.py`: - `test_process_matlab_csvs_per_mission_split`: happy path with two missions — pins the file split (`BOS_LAX_738.csv`, `JFK_ORD_320.csv`), sorted-time merge, dropped key columns, and the NaN-on-tail-point semantics where the trajectory keeps its post-landing row but the emissions side has nothing for it; - `test_process_matlab_csvs_rejects_inconsistent_time_columns`: pins the line-58 raise for mismatched time columns; - `test_process_matlab_csvs_rejects_missing_out_dir`: pins the line-28 raise for an absent output directory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

If `sample_missions` is extended without rebuilding the golden NetCDF, the existing `comparison_ts[idx]` lookup would raise `IndexError` partway through the loop with a stack trace that didn't make the actual cause obvious. Add the precondition assertion before the loop and have the failure message name the regeneration script (`scripts/make_golden_test_data.py`) so a contributor hitting it knows what to do next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Original `@contextmanager` over `@staticmethod` was unusual: `staticmethod` produces a descriptor, which `contextmanager` then wraps. Whether that works on every supported Python version is non-obvious — and the existing `test_escape_allows_re_validation_without_overwriting_singleton` test calls `Config.escape()` from a class context where the descriptor protocol is involved, so a future descriptor-handling change could surface here. `@staticmethod` outermost is the conventional order: contextmanager decorates the function first to produce a context-manager factory, and staticmethod wraps that callable for class-binding purposes. The existing escape test (Phase 1 medium) covers this path and still passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two related fixes in pyproject.toml's pytest configuration: 1. **Section name**: pytest reads `[tool.pytest.ini_options]`, not `[tool.pytest]`. The existing `[tool.pytest]` section was being silently ignored — including the numpy `filterwarnings` filter. Move to the canonical name so both that filter and the new `testpaths` directive actually take effect. 2. **testpaths = ["tests"]**: stops collection from descending into sibling worktrees / checkouts (e.g. `pretty-performance-model-toml/`, `trajectories-to-grid/`) whose own `tests/conftest.py` collides with this repo's `tests.conftest` module name and prevented `uv run pytest` from the repo root from collecting at all. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The function silently `continue`s on `SN == -1` or `SN == 0`, leaving the corresponding thrust mode at 0.0 in both `mass` and `number`. This was undocumented public behavior — the sentinel treatment was discoverable only by reading the loop body or by running the Phase 4 invalid-smoke-numbers test. Spell out the contract in the docstring so callers know to disambiguate "skipped because input was a sentinel" from "computed as 0.0" if they need to. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`Builder.fly()`'s `except Exception` block called `print(...)` with the failing flight's identity before re-raising. In a production run across millions of missions this would unconditionally dump to stdout per failure with no filtering and no log-level control. Switch to `logger.exception(...)` via a module-level `logging.getLogger(__name__)`, so callers can configure verbosity through Python's logging machinery. The exception is still re-raised so caller-side handling is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`Container.__eq__` and `approx_eq` both used `isinstance(vs, str | None | …)` to fall into the direct-equality arm. PEP 604 unions are accepted here on Python 3.12 (the supported version), but the form is non-obvious and was flagged as fragile on other CPython versions. Replace with concrete `(str, type(None), …)` tuples at both sites for clarity and version-portability. Add `test_container_equality_handles_str_and_none_fields` to tests/test_storage.py to pin the round-trip: `Trajectory.name` is the only `str | None` field in the schema today, so the test stuffs it with both legs (str-vs-str, None-vs-None, mixed) and exercises both `__eq__` and `approx_eq`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The legacy table was split into per-phase tables in MIT-LAE#139, so the prior `test_performance_table_subsetting` (which extended the combined-table `PerformanceTable.subset()` API) is no longer the right shape — `subset()` itself is gone. Re-pin the contracts that test was reaching for, directly on the sample model: - cruise (ROCDFilter.ZERO) is non-empty and all rows satisfy `|rocd| <= ZERO_ROCD_TOL` — it backs SimpleFlightRules.CRUISE and is the most-traversed path in trajectory evaluation; - climb has exactly 3 masses; descent has exactly 1 — the BADA-3 shape. The validator's `mass_ok = len(self.mass) in (2, 3)` rule for climb / cruise allows the sample to silently drift to 2 if a TOML edit dropped one mass; this test surfaces that. Folds in the intent of the dropped `Cover ROCDFilter.ZERO in test_performance_table_subsetting` and `Pin BADA mass-count contract in test_performance_table_subsetting` commits, retargeted at the new per-phase API. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR is a broad test-suite audit/remediation for AEIC ahead of the V1 release, tightening assertions, removing self-referential expected values, expanding branch coverage, and making a few small SUT fixes discovered during the audit (config/pytest setup, logging, docstrings, and runtime type guards).

Changes:

Strengthens and expands tests across weather, trajectories/storage, missions DB/OAG ingestion, performance models/tables, emissions, and verification; adds deterministic seeding and clearer invariants.
Adds pytest suite hygiene: --run-slow gating, @pytest.mark.forked for NetCDF-touching tests, and fixes pytest configuration in pyproject.toml so collection is constrained to tests/.
Includes small SUT adjustments observed during the audit (logger usage in builders, Config.escape decorator order, docstring fixes, Container isinstance guards, nvPM sentinel documentation).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_weather.py	Makes ground-speed test a real-fixture smoke test + adds targeted unit tests for azimuth override and domain errors.
tests/test_trajectory_simulation.py	Tightens trajectory invariants, seeds RNG, verifies NetCDF field round-trip, adds weather-effect assertion, and adds mass-iteration boundary + stub-builder tests.
tests/test_trajectories.py	Adds round-trip spot checks for point fields and pins out-of-bounds behavior for interpolation/copy.
tests/test_thrust_modes.py	Expands coverage of `ThrustModeValues` operations, constructor shapes, immutability, broadcasting, and `ThrustModeArray` validation.
tests/test_storage.py	Strengthens TrajectoryStore argument validation tests, adds numeric equality checks on reopen, gates slow test behind `--run-slow`, improves threading guard test, and improves file access recorder test realism.
tests/test_performance_table.py	Adds thorough `PerformanceTable` validation/negative-path coverage and interpolation behavior checks using the sample model.
tests/test_performance_model.py	Adds deeper content checks for initialized models, selector behavior (including warnings), and selector caching behavior.
tests/test_model_utilities.py	Adds CI model normalization edge-case coverage including top-level list normalization with `RootModel`.
tests/test_mission_db_creation.py	Makes OAG conversion tests assert concrete DB row content and adds warning-category coverage.
tests/test_mission_db.py	Refactors filter tests into parametrized form, pins deterministic sampling SQL, expands query/result coverage, and adds Mission/QueryResult conversion tests.
tests/test_matlab_verification.py	Documents final-point skipping rationale and adds unit tests for `process_matlab_csvs` splitting/consistency checks.
tests/test_legacy_verification.py	Removes legacy script-style verification test file from the suite.
tests/test_golden.py	Clarifies golden snapshot semantics, adds precondition for golden length, and adds unit-envelope tripwires.
tests/test_emissions_storage.py	Verifies emissions totals/indices round-trip through storage.
tests/test_emissions.py	Strengthens species set checks, adds ISA-independent reference helper, and expands coverage for APU disabled path and LTO/climb-descent mode behavior.
tests/test_emission_functions.py	Fixes no-op assertion, improves tolerance handling, removes ineffective patching, adds branch coverage and clearer provenance notes.
tests/test_edb.py	Parameterizes sheet-absence error paths and makes engine-attribute checks granular and fixture-backed.
tests/test_dimensions.py	Adds coverage for netCDF dim-name ordering, removal semantics, and dim-name parsing; splits str/repr into its own test.
tests/test_config.py	Improves clarity around fixture override, pins concrete defaults, adds `Config.escape` behavior test, and pins missing default-data file branch.
tests/data/weather/README.md	Documents the ERA5 weather fixture provenance and coverage.
tests/data/verification/legacy/README.md	Documents the MATLAB legacy verification dataset layout/provenance for the verification tests.
tests/data/oag/README.md	Documents the OAG CSV extract used for conversion tests.
tests/data/missions/oag-2019-test-subset.README.md	Documents the SQLite test DB fixture contents and expected query counts.
tests/conftest.py	Adds `--run-slow` flag, registers `slow` marker, and skips slow tests by default; clarifies AEIC_PATH import-time behavior.
test-quality-notebook-index.md	Adds an index mapping notebook sections to tests and provenance expectations.
src/AEIC/weather.py	Fixes docstring typos for `azimuth` parameter.
src/AEIC/trajectories/builders/base.py	Replaces `print()` on exception path with module logger + stacktrace (`logger.exception`).
src/AEIC/storage/container.py	Replaces PEP604 union usage in `isinstance` checks with tuples for runtime correctness.
src/AEIC/emissions/ei/nvpm.py	Documents -1/0 smoke number sentinel behavior and its effect on outputs.
src/AEIC/data/engines/README.md	Documents provenance/intent for the sample EDB workbook fixture.
src/AEIC/config/core.py	Fixes decorator order on `Config.escape` so `@contextmanager` wraps the underlying function correctly.
pyproject.toml	Moves pytest config to `[tool.pytest.ini_options]` and restricts collection to `tests/`.
notebooks/test-cases.ipynb	Adds “Rounded results for use in test” blocks for multiple sections to improve provenance.
tests/data/performance/bad_performance_model_2.toml	Removes large legacy bad-model fixture.
tests/data/performance/bad_performance_model_3.toml	Removes large legacy bad-model fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- tests/test_weather.py: fix fixture filename in comment (2024-09-01.nc) - tests/data/weather/README.md: replace stale pytest.approx note with envelope-based assertion strategy description - tests/data/oag/README.md: align provenance and expected-result text with actual CSVEntry.is_row_valid() filtering and warnings behavior Co-authored-by: Claude Opus 4.7 <anthropic-claude-opus-4-7@pi.local> Pi-Model: anthropic/claude-opus-4-7

ian-ross changed the title ~~Add test review notes~~ Tests review Apr 15, 2026

ian-ross modified the milestone: V1 Apr 15, 2026

ian-ross force-pushed the tests-review branch from 58c1f1f to 72a754e Compare April 24, 2026 11:56

ian-ross and others added 26 commits April 27, 2026 13:24

Add test review notes

cdf1207

Use assert not failed in test_trajectory_simulation_golden

a6907a9

Replace the manual if/raise pattern with a plain assert so pytest can introspect the failure properly and display it consistently with other test failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mark headline MIT-LAE#6 DONE in test-quality.md

02c0299

test_atmospheric_state_and_sls_flow_shapes now uses independent ISA formulas; strip the red marker from the headline and Phase 4 detail. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mark headline MIT-LAE#7 DONE in test-quality.md

1a70665

test_get_gse_emissions_matches_reference_profile now derives the expected dict from cited nominal constants plus inline algebra. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mark headline MIT-LAE#9 DONE in test-quality.md

834662d

test_oag_conversion now verifies row content, schedules population, and warnings-file silence — not just the flights count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ian-ross and others added 22 commits April 27, 2026 13:37

ian-ross force-pushed the tests-review branch from 3619ac7 to 6fa234f Compare April 27, 2026 11:41

ian-ross requested a review from Copilot April 28, 2026 08:32

Copilot started reviewing on behalf of ian-ross April 28, 2026 08:32 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread tests/test_weather.py Outdated

Comment thread tests/data/weather/README.md Outdated

Comment thread tests/data/oag/README.md Outdated

Comment thread tests/data/oag/README.md Outdated

ian-ross marked this pull request as ready for review April 28, 2026 09:07

ian-ross merged commit 35f634f into MIT-LAE:main Apr 28, 2026
4 checks passed

ian-ross deleted the tests-review branch April 28, 2026 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests review#136

Tests review#136
ian-ross merged 107 commits into
MIT-LAE:mainfrom
ian-ross:tests-review

ian-ross commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ian-ross commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed, at a glance

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ian-ross commented Apr 15, 2026 •

edited

Loading