[RFR] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly by Tingting-Chang · Pull Request #3151 · Netflix/metaflow

Tingting-Chang · 2026-04-27T02:39:54Z

PR Type

Summary

Issue

Implements AIPMDM-888: refactor test/core/ to be compatible with standard Python tooling (pytest, tox) so contributors can run tests without understanding the custom orchestration layer.

R1: Replace `contexts.json` with tox environments and pytest fixtures

This change removes contexts.json, which duplicated an environment matrix that tox already supports natively.

test/core/tox.ini is added as the source of truth for core test environments. It defines one testenv:core-* per infrastructure backend: local, GCS, Azure, Batch, K8s, Argo, and SFN.
Each tox env now defines its test context through setenv, including Metaflow configuration (datastore, metadata service, credentials) and METAFLOW_CORE_* control variables (marker, top-level options, executors, disabled tests).
Shared settings are deduplicated through {[testenv]setenv} inheritance, with a _disabled section for common disabled-test lists.
test/core/conftest.py now reads all context from os.environ and uses pytest_generate_tests to parametrize (graph, test, executor) combinations. No Python context file is imported anywhere.
Checker selection has moved from the METAFLOW_CORE_CHECKS environment variable into a proper session-scoped core_checks fixture, which can now be overridden per directory without touching tox.
The root tox.ini no longer contains core-* envs and now points users to test/core/tox.ini.

R2: Eliminate the custom test runner

run_tests.py (643 lines) is deleted. Test execution is now handled entirely by pytest.
test/core/conftest.py now defines _iter_graphs() and _iter_tests() directly instead of importing them from run_tests.py.
test/core/test_core_pytest.py now defines _run_flow() directly in place of run_test(). It supports cli, api, and scheduler executors.
This also fixes two existing bugs:
- the api executor now catches RuntimeError from Runner.run() and converts it into a non-zero return code instead of surfacing an unhandled exception
- the resume path now returns early when the resume subprocess fails, avoiding a follow-on FileNotFoundError from open("run-id")

R3: Convert test flows to standard pytest tests

The core test suite now behaves like normal pytest code instead of relying on a subprocess-heavy custom harness.

MetaflowTest has been renamed to FlowDefinition across all 64 test classes and in metaflow_test/__init__.py.
The Test suffix is removed because these classes are flow templates combined with graph topologies by FlowFormatter, not pytest test cases.
A MetaflowTest = FlowDefinition alias is kept for external compatibility.
Verification now runs in-process instead of in a second subprocess. _run_flow() dynamically imports the generated test_flow.py, instantiates the flow class, and calls formatter.test.check_results(flow, checker) directly.
Check failures now surface as normal AssertionErrors with full pytest tracebacks instead of opaque subprocess exit codes.
FlowFormatter._check_lines() and check_code are removed.
MetaflowCheck no longer depends on sys.argv: run_id and cli_options are now explicit constructor parameters.
new_checker now accepts either a checker class or a checker class name.
243 assert_equals(a, b) calls across 54 files are replaced with plain assert a == b, enabling pytest assertion rewriting and better failure output.
10 uses of assert_exception(lambda: f(), E) are replaced with pytest.raises(E) in tag_mutation, merge_artifacts, merge_artifacts_include, and metadata_check.
assert_equals_metadata is removed and replaced with inline assertions in resume_end_step.py.

R4: Simplify test utilities

Test helpers are reduced to standard pytest patterns wherever possible.

assert_equals, assert_exception, and assert_equals_metadata are removed from metaflow_test/__init__.py.
ExpectationFailed, AssertArtifactFailed, AssertLogFailed, and AssertCardFailed now subclass AssertionError, so pytest reports them natively.
assert_artifact, assert_log, and assert_card are rewritten to use plain assert internally rather than manually raising custom exceptions.
artifact(step, name) is added to both CliCheck and MetadataCheck, returning {task_id: value} so tests can make direct assertions such as assert checker.artifact(step, "data") == {"task1": "abc"}.
test/core/pytest.ini is added to centralize pytest configuration, including norecursedirs, timeout = 1800, addopts = -v --tb=short, and the seven backend markers. Tox command lines now only need to pass the marker flag and parallelism settings.

R5: tox is now the orchestration layer

Core test environments can now be run directly with tox, without any custom orchestration layer:

tox -c test/core/tox.ini -e core-local   # local filesystem
tox -c test/core/tox.ini -e core-gcs     # GCS via fake-gcs-server
tox -c test/core/tox.ini -e core-azure   # Azure Blob via Azurite
tox -c test/core/tox.ini -e core-batch   # AWS Batch via localbatch + MinIO

GCS emulator support

This PR also adds first-class support for running against a local GCS emulator.

metaflow/plugins/gcp/gs_storage_client_factory.py now creates an anonymous storage.Client() when STORAGE_EMULATOR_HOST is set, instead of calling google.auth.default(). This allows flows to run against fake-gcs-server without real GCP credentials.
devtools/ now includes fake-gcs-server as a first-class service, with Kubernetes deployment and service definitions, bucket-init job, secret, a dedicated Tilt file, and integration into the main Tiltfile and pick_services.sh.
The emulator can be started with SERVICES_OVERRIDE=fake-gcs-server make up.
core-gcs and core-azure now set METAFLOW_DEFAULT_DATASTORE=gs/azure along with the corresponding sysroot and endpoint variables, so flows actually exercise cloud storage code paths against local emulators. This matches the existing core-batch pattern with MinIO.

Test Plan

tox -c test/core/tox.ini -e core-local — 470 tests collected and passing
tox -c test/core/tox.ini -e core-gcs — requires fake-gcs-server at localhost:4443 (SERVICES_OVERRIDE=fake-gcs-server make up)
tox -c test/core/tox.ini -e core-azure — requires Azurite at localhost:10000
tox -c test/core/tox.ini -e core-batch, core-k8s, core-argo, core-sfn — require the full devtools stack

Runtime:

Commands to run:

# paste exact commands

Where evidence shows up:

Before (error / log snippet)

paste here

After (evidence that fix works)

paste here

Root Cause

Why This Fix Is Correct

Failure Modes Considered

Tests

Unit tests added/updated
Reproduction script provided (required for Core Runtime)
CI passes
If tests are impractical: explain why below and provide manual evidence above

Non-Goals

AI Tool Usage

No AI tools were used in this contribution
[ X ] AI tools were used (describe below)
- Claude Code

greptile-apps · 2026-04-27T02:42:49Z

Greptile Summary

This PR refactors the Metaflow core test suite (90 files) to replace a 643-line custom orchestration layer (run_tests.py, contexts.json) with standard pytest/tox tooling, renames MetaflowTest → FlowDefinition across 64 test classes, and adds first-class GCS emulator support via AnonymousCredentials.

The bulk of the previous review threads have been addressed: import pytest is now added both to tag_mutation.py directly and conditionally to generated flows by the formatter; core-azure/core-gcs now use --datastore=azure/--datastore=gs with the correct non-s3-cloud disabled list (including S3Failure); the _skip_api_executor guard is present in test_flow_triple; the _iter_tests duplicate-class issue is fixed via obj.__module__ == mod.__name__; and the bare except Exception that silenced collection errors is removed. The runner.cleanup() / env-restore ordering issue (raised in a prior thread) is still present but unchanged.

Confidence Score: 4/5

Safe to merge; all previously flagged P1 issues are addressed and the two remaining comments are P2 style/diagnostic suggestions.

No P0 or new P1 findings. The one open P1 from prior threads (env restore skipped if runner.cleanup() raises) is pre-existing and unchanged. Changes are confined to test infrastructure with no runtime behaviour change to the Metaflow library itself, except for the targeted GCS emulator client fix.

test/core/test_core_pytest.py — api executor error-detail capture; test/core/conftest.py — disabled_tests case sensitivity note.

Important Files Changed

Filename	Overview
test/core/conftest.py	Rewrites test parametrization to read context entirely from env; adds `_iter_tests`/`_iter_graphs` discovery with correct `__module__` deduplication guard; drops the bare `except Exception` that previously silenced import errors.
test/core/test_core_pytest.py	New `_run_flow` inlines the old run_tests.py executor logic; fixes the api-executor RuntimeError and resume early-return bugs; runner.cleanup() is in a finally block, but an OSError from cleanup can still prevent env restore (already flagged in prior review).
test/core/tox.ini	New orchestration layer with per-backend envs; core-azure/core-gcs now correctly set --datastore=azure/gs, use the non-s3-cloud disabled list, and configure the Azurite/fake-gcs endpoints.
test/core/metaflow_test/formatter.py	Conditionally emits `import pytest` in the generated flow only when step bodies reference it; correctly detects `pytest` in step source text before yielding the import line.
test/core/metaflow_test/init.py	Renames MetaflowTest → FlowDefinition; removes assert_equals/assert_exception helpers; exception hierarchy subclasses AssertionError; new_checker accepts class name string or class object.
metaflow/plugins/gcp/gs_storage_client_factory.py	Correctly supplies AnonymousCredentials when STORAGE_EMULATOR_HOST is set, preventing google.auth.default() from running in CI environments without ADC.
test/core/pytest.ini	New file centralising pytest config: norecursedirs prevents collecting test flow definitions, --strict-markers enforces the 7 backend marker set, timeout=1800.

_{Reviews (33): Last reviewed commit: "fix precommit" | Re-trigger Greptile}

greptile-apps · 2026-04-28T04:40:12Z

Want your agent to iterate on Greptile's feedback? Try greploops.

romain-intel · 2026-04-28T08:33:46Z

                        for d in addl_spec.submodule_search_locations
                        if os.path.isdir(d)
                    ]
+                    if not new_dirs:


When did this happen? These things seem to keep evolving but I believe the previous case does take care of editable installs.

This changes is related to the modern pip package for CardExtensionsImportTest. Details:

The refactor changed which Python process runs the extension discovery: previously run_tests.py was a subprocess whose PYTHONPATH included only test/core/ (so the card packages were found via their finders); the new code calls run_test() in-process with the same PYTHONPATH. In both cases, the finders' find_spec('metaflow_extensions') returns the same broken result. The bug was always there, it was just masked because CardExtensionsImportTest had never actually been run successfully in this environment before core-local was wired up.

An alternative will be install the card packages non-editably in test/core/tox.ini:

# Remove -e prefix for the card extensions {toxinidir}/../../test/extensions/packages/card_via_extinit

Let me WDYT and I can make changes. Thanks

Hi @talsperre, could you take a look at ^^ and let me know WDYT? Thanks!

- Reset LocalMetadataProvider._INFO and LocalStorage.datastore_root class-level caches in _run_flow() finally block so deleted tempdirs from one test don't cause MetaflowNotFound in the next cli test - Save/restore _LMP._INFO in _isolated_client_globals() for same reason - Add AWS_ENDPOINT_URL_EVENTBRIDGE to core-sfn tox env so boto3 routes DisableRule calls to the local eventbridge_stub instead of real AWS - Add CI wait step for EventBridge stub readiness before sfn tests Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

DynamoDbClient uses get_aws_client("dynamodb") which reads the standard botocore env var AWS_ENDPOINT_URL_DYNAMODB — it does not read the old METAFLOW_SFN_DYNAMO_DB_CLIENT_PARAMS convention. Replace the injected var so Batch containers running foreach tasks (save_foreach_cardinality, save_parent_task_id_for_foreach_join, get_parent_task_ids_for_foreach_join) can reach ddb-local at host.docker.internal:8765 instead of hitting real AWS DynamoDB. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…duce argo disk BasicForeach (32 parallel pods) exhausts the 6 GB minikube node for argo and the GitHub runner RAM for sfn-batch; MergeArtifactsInclude imports pytest at module level which fails inside python:3.9 containers. Add both to [_disabled]scheduler so core-argo and core-sfn skip them (matching the existing cloud exclusion for core-batch/core-k8s). Reduce core-argo ephemeral disk request from 1024 MB to 50 MB — same value used by core-k8s — to avoid unnecessary ephemeral storage pressure even for small foreach tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

CardImport tests card extension packages (editable_import_test_card, non_editable_import_test_card from card_via_extinit/card_via_init) that are installed in the tox venv but not in batch/k8s container images. Without those packages, card generation silently fails and the check_results assertions on card presence would fail. Matches the pattern already in place for CardExtensionsImport (which tests the same class of packages and is already disabled for cloud/scheduler). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Argo, k8s TTL+cleanup, dump failures

Bitnami retired both the charts.bitnami.com helm repo index AND the CDN (charts.bitnami.com/bitnami/postgresql-12.5.6.tgz), so the previous curl-based workaround silently fails. The helm_remote call in Tilt then tries helm pull from the live bitnami repo, which no longer lists 12.5.6, causing the Tiltfile to error and the devstack to never become ready. Switch postgresql.tiltfile to the bitnami archive branch on GitHub (archive-full-index) which preserves all historical chart versions. Update all three CI workflows (core-tests, full-stack-test, ux-tests) to: - add the repo under the name 'postgresql' (matching repo_name in the Tiltfile so the Tilt helm cache path aligns) - pre-pull via 'helm pull postgresql/postgresql --version 12.5.6' instead of the broken CDN curl Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The previous rule accepted from MINIKUBE_GW (192.168.49.1, the host's own IP on the minikube Docker bridge) which never matched. kube-proxy on the minikube node MASQUERADE's pod traffic to the minikube node IP (192.168.49.2 = minikube ip) before it exits the container, so the host always sees the source as 192.168.49.2, not 192.168.49.1 or the pod CIDR. Switch to $(minikube ip) in the ACCEPT rule. Also make the pre-test verification hard-fail (exit 1) if localbatch is unreachable from the minikube container — previously it logged the failure silently and then all 156 tests failed with cryptic errors instead of one clear step failure. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

localbatch runs Batch containers on the CI runner's Docker daemon using python:3.10 as the default image. Without a pre-pull, each container startup hits Docker Hub (30-60s) before pip install (20s) and code-package download (10s) — 60-90s total per container. For graphs with multiple sequential container groups (simple-foreach has ~5, nested-branches has ~4+, simple_switch ~4), the per-test wall-clock time approaches or exceeds the 600 s scheduler timeout, causing ALL tests in those graphs to fail with "scheduler run timed out". Pre-caching python:3.10 in the CI runner's Docker daemon cuts container startup to ~30s, keeping every graph combination well within the 600 s budget. The k8s/argo fix (docker pull + minikube image load) already ran docker pull for those backends; the sfn/batch step does the same without the minikube load. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread test/core/conftest.py Outdated

Comment thread test/core/conftest.py Outdated

Comment thread test/core/conftest.py Outdated

Comment thread test/core/test_core_pytest.py Outdated

Tingting-Chang force-pushed the AIPMDM-888 branch from b4d9457 to 8dc62b3 Compare April 27, 2026 22:50

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread test/core/test_core_pytest.py Outdated

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread test/core/tox.ini

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread test/core/tests/tag_mutation.py

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread test/core/metaflow_test/formatter.py

Comment thread test/core/conftest.py Outdated

Tingting-Chang added testable ok-to-test labels Apr 28, 2026

Tingting-Chang changed the title ~~[WIP] Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly~~ [WIP] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly Apr 28, 2026

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread metaflow/plugins/gcp/gs_storage_client_factory.py

Tingting-Chang force-pushed the AIPMDM-888 branch from ac6f914 to 94d51ca Compare April 28, 2026 04:32

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread test/core/test_core_pytest.py

Comment thread test/core/conftest.py

Tingting-Chang assigned npow Apr 28, 2026

romain-intel reviewed Apr 28, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread test/core/tox.ini

Tingting-Chang force-pushed the AIPMDM-888 branch from fb6a0ff to d171700 Compare April 28, 2026 16:01

Tingting-Chang assigned talsperre and unassigned npow and talsperre Apr 28, 2026

Tingting-Chang requested review from npow and talsperre April 28, 2026 17:43

Tingting-Chang changed the title ~~[WIP] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly~~ [RFR] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly Apr 28, 2026

Tingting-Chang force-pushed the AIPMDM-888 branch 2 times, most recently from 10a94cb to b6c165b Compare May 1, 2026 17:15

greptile-apps Bot reviewed May 1, 2026

View reviewed changes

Comment thread test/core/tox.ini

Comment thread test/core/tox.ini

greptile-apps Bot reviewed May 1, 2026

View reviewed changes

Comment thread test/core/conftest.py

Tingting-Chang added 2 commits May 1, 2026 18:59

reformat

9283e42

tox -e core-local works, fixing core-gcs

a1255ef

Tingting-Chang added 5 commits May 1, 2026 23:54

abserving the error for azura

9628173

try arn:aws

30053c4

fix pre-commit

cd513ae

create container

2b9b406

fix pre-permit errors

9380ad8

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

Comment thread metaflow/_vendor/yaml/reader.py Fixed

Tingting-Chang added 2 commits May 2, 2026 03:39

fix precommit

b265870

fix precommit

d74cdb1

Tingting-Chang removed request for npow and talsperre May 2, 2026 06:50

Tingting-Chang and others added 20 commits May 2, 2026 06:51

restart azura gcs test

7cef4a3

tests are cancelled, try it again

8f21aff

Ensure minikube IP is routable

6ab7ed7

updated the route in env and extend the timeout limit

52594aa

env issues with aws

22bcf8e

fix MinIO port-forward race and .txt file not in code package for k8s

079a45a

fix SFN iptables, pre-pull python:3.10 for

28c5f79

Argo, k8s TTL+cleanup, dump failures

increase the timeout limit

2b4b1e4

bitnami-archive fix

4ed5cfc

fix precommit

becbbb3

found orphan ci run

af5530d

fix Argo/k8s image pull

e0d5439

trigger CI on current HEAD

ec645a1

Conversation

Tingting-Chang commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Summary

Issue

R1: Replace contexts.json with tox environments and pytest fixtures

R2: Eliminate the custom test runner

R3: Convert test flows to standard pytest tests

R4: Simplify test utilities

R5: tox is now the orchestration layer

GCS emulator support

Test Plan

Root Cause

Why This Fix Is Correct

Failure Modes Considered

Tests

Non-Goals

AI Tool Usage

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

romain-intel Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Tingting-Chang Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tingting-Chang May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Tingting-Chang commented Apr 27, 2026 •

edited

Loading

R1: Replace `contexts.json` with tox environments and pytest fixtures

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

Tingting-Chang Apr 28, 2026 •

edited

Loading