Skip to content

Tests review#136

Merged
ian-ross merged 107 commits into
MIT-LAE:mainfrom
ian-ross:tests-review
Apr 28, 2026
Merged

Tests review#136
ian-ross merged 107 commits into
MIT-LAE:mainfrom
ian-ross:tests-review

Conversation

@ian-ross

@ian-ross ian-ross commented Apr 15, 2026

Copy link
Copy Markdown
Member

Clanker test review and fixes. Fixes #115.

Some notes on the process here:

  • I got Claude Code to generate a test review document, classifying issues by severity, type and region of the code.
  • I then used a mixture of models (Claude Sonnet and Opus, OpenAI ChatGPT 5.5) to fix all the issues, in order from most severe to least.
  • I performed spot checks on some of the fixes to check them for sanity.
  • I did cross-model reviews using Claude Opus 4.7, OpenAI ChatGPT 5.5 and GitHub Copilot, fixing issues using agents (again!). No agent found any significant problems with the test changes made by any other agent. There were a few documentation cleanups found by Copilot at the end, but that was all.

Overall, my confidence in this process is much higher than if the work was done by a human (that human being me or anyone else). This work is tedious, boring and fiddly, and the agents did a great job. They don't get bored, they're good at tracing the consequences of decisions through logically, they don't get lost in a long list of issues, etc.

One other thing: I needed to rebase this branch at one point. There were only a very small number of apparent conflicts, but there was a conceptual change to how performance models were represented on the rebase target branch, so the required changes to testing were much more extensive than a naive analysis would reveal. ChatGPT 5.5 discovered the "rebase rot", identified the locations where changes were needed, made an interactive rebase plan to surgically correct the relevant commits while preserving the overall branch history, and performed the rebase without any drama at all. That would have taken me days, or I would have fallen back on a much less principled approach.

Coding agent PR description follows below the line.


Summary

Multi-session test-suite audit of AEIC plus the full remediation sweep that followed. The audit log and per-finding detail live in test-quality.md; this branch lands the resulting fixes.

By the numbers:

  • 115 audit findings closed: 11 High / 52 Medium / 52 Low.
  • Test count: ~225 → 316 passing (+1 skipped slow test gated on --run-slow).
  • ~108 commits, organised one finding per commit so each item is bisectable.
  • Suite still runs cleanly via uv run pytest from the repo root.

What changed, at a glance

  • Logical errors: fixed cases where a test passed without verifying its stated invariant (e.g. a no-op .all truthiness check, a @patch on the wrong target, mocked thrust-categorization that ran the real SUT).
  • Suspicious data: replaced SUT-self-generated expected values with notebook-cited rounded results, ISA-formula derivations, or honest smoke + physical-envelope checks; marked irreducible cases explicitly as regression guards.
  • Coverage gaps: added focused tests for previously unreached branches (PerformanceTable.__post_init__ validators, EDBEntry.get_engine per-sheet absence, Config.escape, default_data_file_location's FileNotFoundError, etc.).
  • Hygiene: parametrized monolithic tests, decorated NetCDF-touching tests with @pytest.mark.forked, gated long-running tests behind a --run-slow flag, seeded previously-unseeded RNGs, and stripped traffic lights from the audit document.
  • Bonus SUT observations (separate from the test-quality findings): swapped Config.escape's decorator order; pinned pytest to testpaths = ["tests"] (and migrated [tool.pytest][tool.pytest.ini_options], which had been silently ignored); replaced print() in Builder.fly()'s exception path with a module logger; replaced isinstance(x, str | None | …) with concrete tuples in Container equality; documented calculate_nvPM_scope11_LTO's SN ∈ {-1, 0} sentinel handling.

See test-quality.md for the full findings list, the rubric used to triage them, the prioritized remediation tiers, and per-finding remediation notes (each annotated *[DONE]*, with *Actual remediation:* clauses where the fix diverged from the original suggestion).

Test plan

  • uv run pytest from repo root: 316 passing, 1 skipped (the gated slow test).
  • uv run pytest --run-slow tests/test_storage.py::test_create_reopen_large collects and runs the previously-permanently-skipped scaling test.
  • No 🟢 / 🟡 / 🔴 traffic lights remain in test-quality.md.
  • Reviewer spot-check: pick any commit whose subject names a finding ID (Phase N Low #M, Headline #N) and cross-reference the matching bullet in test-quality.md for context.

@ian-ross ian-ross changed the title Add test review notes Tests review Apr 15, 2026
@ian-ross ian-ross modified the milestone: V1 Apr 15, 2026
ian-ross and others added 26 commits April 27, 2026 13:24
np.isclose(result, out_result).all evaluated the truthiness of a bound
method object (always True) rather than calling it. Replace with
np.testing.assert_allclose which also reports element-wise diffs on
failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test verifying that a sampled query uses a deterministic random
function checked for 'random()' which matches both SQLite's built-in
random() and the project's det_random(). Tighten to 'det_random()' to
enforce the actual reproducibility contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
netCDF4/HDF5 maintains C-level global state that leaks across tests
when run in the same process. Mark every test that opens a netCDF file
with @pytest.mark.forked so each runs in an isolated subprocess:

- test_emissions_storage, test_separate_emissions (test_emissions_storage.py)
- test_trajectory_simulation_basic, test_trajectory_simulation_outside_weather_domain,
  test_trajectory_simulation_weather (test_trajectory_simulation.py)
- test_compute_ground_speed (test_weather.py)
- test_trajectory_simulation_golden (test_golden.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The file collected zero pytest items (all logic was inside main() guarded
by if __name__ == '__main__') and contained a hardcoded path to a
different developer's home directory. The proper test successor is
test_matlab_verification.py and the script successor already exists as
scripts/run_legacy_verification.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flight_time is the independent variable used to align the two
trajectories, so comparing it against itself yields MAPE ≈ 0 by
construction — not a meaningful check. Remove it from TRAJ_FIELDS and
its now-orphaned entry in TRAJ_FIELD_UNITS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 7-bit literal 0b0000000 was inconsistent with the 7-bit masks used
elsewhere in the same test. Use 0b00000000 to match the 8-bit (7 days +
1 reserved) width of the other constants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the manual if/raise pattern with a plain assert so pytest can
introspect the failure properly and display it consistently with other
test failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Notebook (notebooks/test-cases.ipynb):
- Sections 2–6: add a "Rounded results for use in test" markdown cell
  and code cell(s) so tests can cite a specific cell rather than citing
  the notebook generically.
- Section 5 (SCOPE 11): fix Q-coefficient typo 0.766 → 0.776; the
  corrected value exactly reproduces the test expected arrays (verified
  analytically).

Data READMEs (new files):
- tests/data/verification/legacy/README.md — MATLAB v2.1 provenance,
  mission list, matlab-output-orig/ → matlab-output/ transformation.
- tests/data/weather/README.md — ERA5 source, valid_time, spatial
  extent, download date, note on self-generated expected value.
- tests/data/oag/README.md — OAG extract provenance, filtering rules,
  expected 8-flight count.
- tests/data/missions/oag-2019-test-subset.README.md — ingestion
  provenance, table row counts, query verification table for all five
  asserted counts (1197 / 99 / 36 / 307 / 13).
- src/AEIC/data/engines/README.md — ICAO EDB source for sample_edb.xlsx
  and APU_data.toml.

Typo fix (src/AEIC/weather.py):
- Docstring: azmiuth / azmith → azimuth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the test wrote custom fields to a TrajectoryStore, reopened
it, and asserted only that the length matched — the #TODO acknowledged
the real invariant was unverified. Now captures original field values
before write and asserts np.array_equal on reload for every trajectory.
Also seeds np.random.default_rng(0) and generates fields at the dtypes
declared in FieldMetadata (float32 / int32) so round-trip is exact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds DONE markers to headline findings whose remediation has shipped,
and annotates open findings with traffic-light emojis by severity
(red=High, amber=Medium, green=Low). Closed findings are left unmarked.
Totals preserved: 115 findings — 88 open with traffic lights, 27 DONE.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The test's expected_temp / expected_pressure / expected_mach arrays
were captured from AtmosphericState itself (per notebook cell 11),
making the check self-referential — a science bug that drifted
consistently would be invisible. Now computes expected values in a
new _isa_reference() helper using published ICAO Doc 7488 / US Std
Atmosphere 1976 constants inlined locally, without importing
AEIC.utils.standard_atmosphere (the SUT). Spot-check altitudes
(h=0 sea level, h=11000 tropopause) match ICAO Table A published
values directly. SLS block unchanged — separate sub-concern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_atmospheric_state_and_sls_flow_shapes now uses independent ISA
formulas; strip the red marker from the headline and Phase 4 detail.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The expected dict was a flat 11-species snapshot that mirrored
_gse_nominal_profile() plus the SUT's own NOx-split, sulfur-balance,
and H2O derivations — a regression in any of those branches would
pass because both sides drifted together. Now hoists the 5 primary
per-LTO constants (CO2/NOx/HC/CO/PM10 for WIDE) into named locals
with a provenance comment, and computes the 6 derived species (H2O,
NO/NO2/HONO, SO4/SO2, nvPM) inline from first principles. Shrinks
the self-referential footprint to the 5 nominal constants, which
trace to the AEIC v2 MATLAB legacy; primary external source (Stettler
2011 / AEDT / EDMS) flagged as TODO pending back-trace.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_get_gse_emissions_matches_reference_profile now derives the
expected dict from cited nominal constants plus inline algebra.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_compute_ground_speed asserted gs == 191.02855126751604 with
rel=1e-4, a value captured from the SUT and re-adjusted when the SUT
changed. Any "independent" recomputation against the real ERA5
fixture duplicates the SUT (ISA pressure lookup + xarray interp +
hypot), so the assertion cannot be made independent in a useful
way. Replaced with a finite + physical envelope (100 < gs < 300 for
TAS=200 m/s). Algorithm correctness is now covered by the synthetic-
fixture tests later in the same file, where wind values are known
and the expected ground speed is derivable on paper. The real-
fixture test stays as an honest smoke check that the on-disk
20240901.nc still parses end-to-end.

test-quality.md: annotated the finding with the actual remediation
approach — neither of the originally-suggested options (a/b) was
viable because both route back through the SUT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_oag_conversion previously checked only SELECT COUNT(*) FROM
flights == 8 on the end-to-end convert_oag_data pipeline. A
regression that inserted 8 malformed rows, failed to populate
schedules, or emitted spurious warnings would pass. Now:

- Full flights-row content for AS 1011 ORD->SEA (row 4 of the CSV,
  first row that survives is_row_valid). Asserts every column sourced
  from the CSV — carrier, flight_number, origin/destination IATA,
  day_of_week_mask, departure/arrival minutes, arrival_day_offset,
  service_type, aircraft_type (inpacft, not genacft), engine_type,
  seat_capacity, effective_from/to as ISO dates, number_of_flights
  (derived: 2 schedules fit the 2019-12-05..11 window at Wed+Thu),
  od_pair, and distance in km (statute miles x 1.609344).
- COUNT(*) FROM schedules > 8 and COUNT(DISTINCT flight_id) == 8 —
  every accepted flight materialised at least one schedule row.
- Warnings file NOT written, because rejected rows in the fixture
  are all filtered silently by is_row_valid (operating=N, service=V,
  stops>0) before distance / airport checks. A regression that
  started emitting spurious warnings on valid rows would fail.

Out of scope (separate Medium COVERAGE-GAP finding): a second
hand-crafted CSV fixture that deliberately triggers each of the four
Warning.Type categories.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test_oag_conversion now verifies row content, schedules population,
and warnings-file silence — not just the flights count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The test used a helper (_expected_trajectory_indices) that called
the same BFFM2_EINOx / EI_HCCO / get_SLS_equivalent_fuel_flow /
AtmosphericState functions the SUT uses, then asserted the SUT
outputs matched. A science bug in any of those helpers would be
invisible — both sides inherit it. But the test does legitimately
catch wiring regressions: wrong arg order, swapped EI_HC/EI_CO,
missing NOx speciation, etc.

Full Tier 3 (replace helper with notebook-cited rounded arrays) was
examined and rejected: notebook Sections 2 and 4 use EDB inputs
that don't match DummyPerformanceModel.lto, so citing their arrays
either duplicates TestBFFM2_EINOx (already notebook-grounded) or
requires new notebook cells. Science correctness for BFFM2 / HCCO /
NOx speciation / atmospheric state is already covered independently
by test_emission_functions.py and test_atmospheric_state_and_sls_flow_shapes.

Instead:
- Rename test to test_compute_emissions_pipeline_wiring.
- Add docstring spelling out what it verifies (wiring, algebraic
  identities, NOx speciation partition) and what it does not
  (science correctness — covered elsewhere).
- Annotate the helper as DELIBERATELY SELF-REFERENTIAL.
- Add a mass-conservation identity check no_prop+no2_prop+hono_prop
  ≈ 1.0 — a genuine correctness signal independent of the helper,
  catching the "fractions don't sum to a valid probability split"
  bug class.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every raise-ValueError branch in PerformanceTable.__post_init__ was
previously unreachable from the test suite. Add a parametrized negative
test that mutates a single cell (or drops one row) from a per-phase
baseline (climb/cruise/descent) to trip each named raise branch, plus
a baselines-load smoke and dedicated mass-count rejects.

Coverage spans all reachable branches: per-phase ROCD-sign mismatches
(descent-not-all-negative, cruise-not-all-zero, climb-some-negative),
per-phase mass-count violations, per-phase coverage gaps for climb /
cruise, and FL-only dependency violations on cruise.tas, climb.tas,
climb.fuel_flow. Descent's coverage and FL-only checks are unreachable
once the mass-count == 1 invariant holds (annotated in the code).

Adapted to split climb/cruise/descent tables landed in MIT-LAE#139.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Close the last open High headline. PerformanceTable.__post_init__ now
has every raise branch exercised, and the orphan bad_performance_model
fixtures have been deleted (the sibling Medium COVERAGE-GAP closes
alongside).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pin the contract used by `_read_reproducibility_info` when replaying
trajectory-store snapshots: the inner load must succeed under
`Config.escape()` and the canonical singleton must survive untouched.
Previously the branch was unreachable from the test suite, so a broken
escape() would silently corrupt replay.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cover the property that controls NetCDF axis order: silent reordering
or accidental inclusion of `POINT` would corrupt every emissions /
trajectory file we write. Three cases pin the contract — POINT-only
trajectory, POINT+SPECIES, and the THRUST_MODE branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The sqlite connect was passing a bare 'tmp.sqlite' string, which
sqlite3 creates in the current working directory — seeding a stale
0-byte file at the repo root after every run. Move the file (and the
three .nc opens it shares scope with) under `tmp_path` so the test is
self-contained. Drop the dead `safe_*` wrappers while there: the audit
hook fires regardless of whether the open/connect raises, so the
try/except added no coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The bare `except Exception` swallowed any failure — a tmp_path error
or import failure in the worker would have set result to 'FAILED' and
silently passed the assertion. The guard the test was actually
exercising is `TrajectoryStore.active_in_thread` at store.py:378, so
capture the exception instance and assert the RuntimeError class and
the guard message substring directly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture each mission's `total_emissions` dict before writing and pin
the per-species values after reopen, so a serialization bug that
silently dropped a species or rounded the totals would be caught. The
prior `hasattr(traj, 'total_emissions')` check was satisfied just by
the field's presence in the schema.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ian-ross and others added 22 commits April 27, 2026 13:37
`profile_first.mass is profile_second.mass` only proved
`functools.cache` returned the same object — a cache that stored an
empty placeholder twice would have passed. Add positivity on a
TAKEOFF canary (the highest-thrust mode, where mass/number are
largest) plus a per-mode finite check covering the full ThrustMode
enum.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MIT-LAE#2)

The `total[s] == sum(components)` identity in
`test_sum_total_emissions_matches_components` would pass even if a
regression silently zeroed one of the four component buckets — the
matching zero on the total side would still satisfy the equality.
Pin a precondition that CO2 (a species every component must emit
under the default fuel config) is positive in each of the
trajectory / LTO / APU / GSE buckets before running the identity
check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`test_intercept_adjustment_uses_second_mode_value` only ever landed
in the horizontal-line upper segment because its calibration
parameters forced the intercept-adjustment branch in `hcco.py` to
clamp `x_intercept` to `log_ff_cal1`. Add a focused
`test_branches_split_at_intercept` that:

- Uses a strictly-decreasing-with-flow `x_EI` (so the log-log slope
  is finite and negative, the intercept lands inside the
  calibration range, and the slanted formula actually runs).
- Calls EI_HCCO with a single low-flow point, a single high-flow
  point, and a mixed array, asserting each lands in the correct
  segment.
- Pins the upper-segment value to the geometric-mean
  closed form (sqrt(xEI[CLIMB] * xEI[TAKEOFF])) and the lower-segment
  value as strictly greater than that geometric mean (the
  decreasing-with-flow signal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ase 4 Low MIT-LAE#4)

The `@patch('AEIC.emissions.utils.get_thrust_cat_cruise')` in
`test_thrust_categorization` was patching the wrong symbol — `ei/nox.py`
imports the name into its own module, so the bound reference
inside `BFFM2_EINOx` was untouched and the SUT ran unmocked. The
test reduced to a finiteness smoke that already had a sibling
covering it, so any regression would have slipped through both
sides.

Drop the mock and pin the real per-point category-to-speciation
identity:
- pick an evaluation-flow array spanning all three categories;
- recompute the categories independently via
  `get_thrust_cat_cruise(...)`, look up `NOx_speciation()` per
  category, and assert `BFFM2_EINOx`'s `noProp` / `no2Prop` /
  `honoProp` arrays match;
- guard with `len(set(cats.data)) >= 2` so a regression that
  collapsed categorization to a constant bucket would still trip
  the test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The TestGetAPUEmissions class fixed a slate of arbitrary APU
parameters (fuel_kg_per_s=0.1, NOx_g_per_kg=15.0, apu_time=2854,
etc.) with no comment about their provenance. They were chosen to
make outputs numerically distinguishable, not lifted from any
published APU dataset, but a future reader would have no signal of
that. Add a class-level docstring stating the synthetic origin and
warning against porting the numbers as authoritative. The
notebook-expansion suggestion in the audit is left out of scope —
it's already on the notebook-index gap list separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-LAE#6)

`test_engine_type_scaling_and_invalid_smoke_numbers` mixed two
unrelated branches. The engine-scaling half inline-computed
CBC/AFR/kslm/Q against the SUT formula — a tautological copy that
would pass even if the SUT regressed and the test recomputed the
regression. The invalid-SN half checked only mass-channel zeroing.

Split into:

- `test_scope11_engine_type_scaling`: pins only the qualitative
  MTF > TF > 0 invariant at IDLE (the bypass-correction signal).
  Numerically grounding the formula against the notebook is
  already done by `test_SCOPE11_unit_test`, so re-sourcing values
  here would just duplicate that test.
- `test_scope11_invalid_smoke_numbers_return_zero`: covers both
  engine types and pins zeroing on both the mass *and* number
  channels — a refactor that propagated -1.0 / 0.0 through to
  produce NaN would otherwise slip through a mass-only check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…IT-LAE#7)

The 3-value expected `NOxEI` array had no provenance comment, so a
future reader couldn't tell whether it was a notebook-sourced
reference or a self-generated snapshot — and the test name
("consistency") suggested the former. Add a docstring stating it's
a regression snapshot, *not* an independent reference, and pointing
at the test that does carry the notebook-sourced correctness check
(`test_matches_reference_component_values`). Inline comment by the
assertion repeats the marker so a casual diff sees it. Switched to
`np.testing.assert_allclose` so a failure reports the diff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-LAE#8)

`test_get_EDB_data_for_engine_raises_when_uid_absent` only covered
the case where the UID was missing from both sheets — the SUT then
raises about the gaseous sheet first, so the symmetric
"present in gaseous, absent from nvPM" branch was unreached.
Parametrize across both scenarios with the expected per-sheet
error message pinned via a regex match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The positive-path EDB test was a single function with ~14 sequential
asserts. A regression in (e.g.) the nvPM-num column extraction would
fail at the first line and mask whatever else had broken in the same
parser change. Convert to a parametrization keyed by `(attr, expected)`
tuples so each attribute lands on its own test ID; share the parsed
`EDBEntry` via a function-scoped fixture (function scope is required
because `Config` is loaded by the autouse `default_config` fixture).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-LAE#1)

The test appended 30 fully-specified points but only checked
`len == 30` and the three phase counters. A regression that
silently zeroed every per-point array on `fix()` would have
passed. Add spot-checks after `fix()`:

- `fuel_flow` constant at 1.4 across phases;
- `latitude` follows the input pattern (41.0 + 0.02*i within each
  phase, so check index 0 and a non-trivial cross-phase index);
- altitude trace matches climb 0→9000, cruise flat at 10000,
  descent 10000→1000;
- aircraft mass decreases monotonically within each phase (the
  setup resets the mass formula at each phase so cross-phase
  monotonicity would not hold).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small unit tests for the previously-untested branches in
`Trajectory.interpolate_time` and `Trajectory.copy_point`:

- `test_interpolate_time_out_of_bounds_yields_nan` pins the
  `left=nan / right=nan` contract for queries below and above the
  existing `flight_time` range, with finite values at the
  boundaries and inside.
- `test_copy_point_rejects_out_of_bounds` exercises the two
  IndexError guards (one each for `from_idx` and `to_idx`) on
  both the negative and beyond-extent sides.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-LAE#4)

`test_maxiters=1000` made the cap effectively unbounded for missions
that converge in a handful of iterations (BOS→LAX sample mission
converges in ~7 cycles against the sample performance model). A
regression that, e.g., stopped converging quadratically would have
run longer rather than failed. Lower to 20 — ~3× headroom over the
empirical baseline — so any meaningful slowdown trips the cap and
raises `RuntimeError`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`iterate_mass=True` was tested only at the loose ends of the
`max_mass_iters` axis — generously above (1000) and trivially below
(1). The boundary case where the mission converges *exactly* at the
empirically-minimal `max_mass_iters` was unreached, leaving an
"off-by-one in the loop bound" regression invisible.

Add `test_trajectory_mass_iter_boundary` parametrized over the two
directions:
- `_MIN_ITERS_TO_CONVERGE - 1`: must raise `RuntimeError` with the
  "Mass iteration failed to converge" message;
- `_MIN_ITERS_TO_CONVERGE`: must succeed.

The constant `_MIN_ITERS_TO_CONVERGE = 7` is the empirical baseline
for the BOS→LAX `example_mission` against the sample performance
model at `reltol=1e-6`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `set(['true_airspeed'])` magic constant had no comment about why
the final point is skipped. The reason — the legacy MATLAB
trajectory carries an extra post-landing point that the new SUT
doesn't synthesize identically, so TAS at the tail is a guaranteed
false positive — was implicit. Add an explanatory comment and
switch to the `{...}` set-literal form (idiomatic, slightly faster
than the `set([...])` constructor).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`process_matlab_csvs` was the only entry into `matlab-output/`
generation and had no direct test — the 67 % coverage on
`verification/legacy.py` was almost entirely this function's lines.
Silent breakage of the (`tdf.t[:-1] != edf.t).any()` consistency
check at line 58 or the per-key groupby split would have corrupted
every downstream verification run unnoticed.

Add three unit tests in `tests/test_matlab_verification.py`:

- `test_process_matlab_csvs_per_mission_split`: happy path with two
  missions — pins the file split (`BOS_LAX_738.csv`, `JFK_ORD_320.csv`),
  sorted-time merge, dropped key columns, and the NaN-on-tail-point
  semantics where the trajectory keeps its post-landing row but the
  emissions side has nothing for it;
- `test_process_matlab_csvs_rejects_inconsistent_time_columns`: pins
  the line-58 raise for mismatched time columns;
- `test_process_matlab_csvs_rejects_missing_out_dir`: pins the
  line-28 raise for an absent output directory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
If `sample_missions` is extended without rebuilding the golden
NetCDF, the existing `comparison_ts[idx]` lookup would raise
`IndexError` partway through the loop with a stack trace that didn't
make the actual cause obvious. Add the precondition assertion before
the loop and have the failure message name the regeneration script
(`scripts/make_golden_test_data.py`) so a contributor hitting it
knows what to do next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Original `@contextmanager` over `@staticmethod` was unusual:
`staticmethod` produces a descriptor, which `contextmanager` then
wraps. Whether that works on every supported Python version is
non-obvious — and the existing
`test_escape_allows_re_validation_without_overwriting_singleton`
test calls `Config.escape()` from a class context where the
descriptor protocol is involved, so a future descriptor-handling
change could surface here.

`@staticmethod` outermost is the conventional order: contextmanager
decorates the function first to produce a context-manager factory,
and staticmethod wraps that callable for class-binding purposes.

The existing escape test (Phase 1 medium) covers this path and
still passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related fixes in pyproject.toml's pytest configuration:

1. **Section name**: pytest reads `[tool.pytest.ini_options]`, not
   `[tool.pytest]`. The existing `[tool.pytest]` section was being
   silently ignored — including the numpy `filterwarnings` filter.
   Move to the canonical name so both that filter and the new
   `testpaths` directive actually take effect.
2. **testpaths = ["tests"]**: stops collection from descending into
   sibling worktrees / checkouts (e.g.
   `pretty-performance-model-toml/`, `trajectories-to-grid/`) whose
   own `tests/conftest.py` collides with this repo's
   `tests.conftest` module name and prevented `uv run pytest` from
   the repo root from collecting at all.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The function silently `continue`s on `SN == -1` or `SN == 0`,
leaving the corresponding thrust mode at 0.0 in both `mass` and
`number`. This was undocumented public behavior — the sentinel
treatment was discoverable only by reading the loop body or by
running the Phase 4 invalid-smoke-numbers test. Spell out the
contract in the docstring so callers know to disambiguate
"skipped because input was a sentinel" from "computed as 0.0" if
they need to.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`Builder.fly()`'s `except Exception` block called `print(...)` with
the failing flight's identity before re-raising. In a production run
across millions of missions this would unconditionally dump to
stdout per failure with no filtering and no log-level control.
Switch to `logger.exception(...)` via a module-level
`logging.getLogger(__name__)`, so callers can configure verbosity
through Python's logging machinery. The exception is still re-raised
so caller-side handling is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`Container.__eq__` and `approx_eq` both used `isinstance(vs, str |
None | …)` to fall into the direct-equality arm. PEP 604 unions are
accepted here on Python 3.12 (the supported version), but the form
is non-obvious and was flagged as fragile on other CPython versions.
Replace with concrete `(str, type(None), …)` tuples at both sites
for clarity and version-portability.

Add `test_container_equality_handles_str_and_none_fields` to
tests/test_storage.py to pin the round-trip: `Trajectory.name` is
the only `str | None` field in the schema today, so the test stuffs
it with both legs (str-vs-str, None-vs-None, mixed) and exercises
both `__eq__` and `approx_eq`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The legacy table was split into per-phase tables in MIT-LAE#139, so the
prior `test_performance_table_subsetting` (which extended the
combined-table `PerformanceTable.subset()` API) is no longer the
right shape — `subset()` itself is gone. Re-pin the contracts that
test was reaching for, directly on the sample model:

  - cruise (ROCDFilter.ZERO) is non-empty and all rows satisfy
    `|rocd| <= ZERO_ROCD_TOL` — it backs SimpleFlightRules.CRUISE
    and is the most-traversed path in trajectory evaluation;
  - climb has exactly 3 masses; descent has exactly 1 — the BADA-3
    shape. The validator's `mass_ok = len(self.mass) in (2, 3)`
    rule for climb / cruise allows the sample to silently drift to
    2 if a TOML edit dropped one mass; this test surfaces that.

Folds in the intent of the dropped `Cover ROCDFilter.ZERO in
test_performance_table_subsetting` and `Pin BADA mass-count contract
in test_performance_table_subsetting` commits, retargeted at the
new per-phase API.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a broad test-suite audit/remediation for AEIC ahead of the V1 release, tightening assertions, removing self-referential expected values, expanding branch coverage, and making a few small SUT fixes discovered during the audit (config/pytest setup, logging, docstrings, and runtime type guards).

Changes:

  • Strengthens and expands tests across weather, trajectories/storage, missions DB/OAG ingestion, performance models/tables, emissions, and verification; adds deterministic seeding and clearer invariants.
  • Adds pytest suite hygiene: --run-slow gating, @pytest.mark.forked for NetCDF-touching tests, and fixes pytest configuration in pyproject.toml so collection is constrained to tests/.
  • Includes small SUT adjustments observed during the audit (logger usage in builders, Config.escape decorator order, docstring fixes, Container isinstance guards, nvPM sentinel documentation).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_weather.py Makes ground-speed test a real-fixture smoke test + adds targeted unit tests for azimuth override and domain errors.
tests/test_trajectory_simulation.py Tightens trajectory invariants, seeds RNG, verifies NetCDF field round-trip, adds weather-effect assertion, and adds mass-iteration boundary + stub-builder tests.
tests/test_trajectories.py Adds round-trip spot checks for point fields and pins out-of-bounds behavior for interpolation/copy.
tests/test_thrust_modes.py Expands coverage of ThrustModeValues operations, constructor shapes, immutability, broadcasting, and ThrustModeArray validation.
tests/test_storage.py Strengthens TrajectoryStore argument validation tests, adds numeric equality checks on reopen, gates slow test behind --run-slow, improves threading guard test, and improves file access recorder test realism.
tests/test_performance_table.py Adds thorough PerformanceTable validation/negative-path coverage and interpolation behavior checks using the sample model.
tests/test_performance_model.py Adds deeper content checks for initialized models, selector behavior (including warnings), and selector caching behavior.
tests/test_model_utilities.py Adds CI model normalization edge-case coverage including top-level list normalization with RootModel.
tests/test_mission_db_creation.py Makes OAG conversion tests assert concrete DB row content and adds warning-category coverage.
tests/test_mission_db.py Refactors filter tests into parametrized form, pins deterministic sampling SQL, expands query/result coverage, and adds Mission/QueryResult conversion tests.
tests/test_matlab_verification.py Documents final-point skipping rationale and adds unit tests for process_matlab_csvs splitting/consistency checks.
tests/test_legacy_verification.py Removes legacy script-style verification test file from the suite.
tests/test_golden.py Clarifies golden snapshot semantics, adds precondition for golden length, and adds unit-envelope tripwires.
tests/test_emissions_storage.py Verifies emissions totals/indices round-trip through storage.
tests/test_emissions.py Strengthens species set checks, adds ISA-independent reference helper, and expands coverage for APU disabled path and LTO/climb-descent mode behavior.
tests/test_emission_functions.py Fixes no-op assertion, improves tolerance handling, removes ineffective patching, adds branch coverage and clearer provenance notes.
tests/test_edb.py Parameterizes sheet-absence error paths and makes engine-attribute checks granular and fixture-backed.
tests/test_dimensions.py Adds coverage for netCDF dim-name ordering, removal semantics, and dim-name parsing; splits str/repr into its own test.
tests/test_config.py Improves clarity around fixture override, pins concrete defaults, adds Config.escape behavior test, and pins missing default-data file branch.
tests/data/weather/README.md Documents the ERA5 weather fixture provenance and coverage.
tests/data/verification/legacy/README.md Documents the MATLAB legacy verification dataset layout/provenance for the verification tests.
tests/data/oag/README.md Documents the OAG CSV extract used for conversion tests.
tests/data/missions/oag-2019-test-subset.README.md Documents the SQLite test DB fixture contents and expected query counts.
tests/conftest.py Adds --run-slow flag, registers slow marker, and skips slow tests by default; clarifies AEIC_PATH import-time behavior.
test-quality-notebook-index.md Adds an index mapping notebook sections to tests and provenance expectations.
src/AEIC/weather.py Fixes docstring typos for azimuth parameter.
src/AEIC/trajectories/builders/base.py Replaces print() on exception path with module logger + stacktrace (logger.exception).
src/AEIC/storage/container.py Replaces PEP604 union usage in isinstance checks with tuples for runtime correctness.
src/AEIC/emissions/ei/nvpm.py Documents -1/0 smoke number sentinel behavior and its effect on outputs.
src/AEIC/data/engines/README.md Documents provenance/intent for the sample EDB workbook fixture.
src/AEIC/config/core.py Fixes decorator order on Config.escape so @contextmanager wraps the underlying function correctly.
pyproject.toml Moves pytest config to [tool.pytest.ini_options] and restricts collection to tests/.
notebooks/test-cases.ipynb Adds “Rounded results for use in test” blocks for multiple sections to improve provenance.
tests/data/performance/bad_performance_model_2.toml Removes large legacy bad-model fixture.
tests/data/performance/bad_performance_model_3.toml Removes large legacy bad-model fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_weather.py Outdated
Comment thread tests/data/weather/README.md Outdated
Comment thread tests/data/oag/README.md Outdated
Comment thread tests/data/oag/README.md Outdated
- tests/test_weather.py: fix fixture filename in comment (2024-09-01.nc)
- tests/data/weather/README.md: replace stale pytest.approx note with
  envelope-based assertion strategy description
- tests/data/oag/README.md: align provenance and expected-result text
  with actual CSVEntry.is_row_valid() filtering and warnings behavior

Co-authored-by: Claude Opus 4.7 <anthropic-claude-opus-4-7@pi.local>
Pi-Model: anthropic/claude-opus-4-7
@ian-ross ian-ross marked this pull request as ready for review April 28, 2026 09:07
@ian-ross ian-ross merged commit 35f634f into MIT-LAE:main Apr 28, 2026
4 checks passed
@ian-ross ian-ross deleted the tests-review branch April 28, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review and improve tests for V1 release

2 participants