Skip to content

[RFR] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly#3151

Open
Tingting-Chang wants to merge 59 commits into
masterfrom
AIPMDM-888
Open

[RFR] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly#3151
Tingting-Chang wants to merge 59 commits into
masterfrom
AIPMDM-888

Conversation

@Tingting-Chang
Copy link
Copy Markdown

@Tingting-Chang Tingting-Chang commented Apr 27, 2026

PR Type

  • Bug fix
  • New feature
  • Core Runtime change (higher bar -- see CONTRIBUTING.md)
  • Docs / tooling
  • Refactoring following JIRA

Summary

Issue

Implements AIPMDM-888: refactor test/core/ to be compatible with standard Python tooling (pytest, tox) so contributors can run tests without understanding the custom orchestration layer.

R1: Replace contexts.json with tox environments and pytest fixtures

This change removes contexts.json, which duplicated an environment matrix that tox already supports natively.

  • test/core/tox.ini is added as the source of truth for core test environments. It defines one testenv:core-* per infrastructure backend: local, GCS, Azure, Batch, K8s, Argo, and SFN.
  • Each tox env now defines its test context through setenv, including Metaflow configuration (datastore, metadata service, credentials) and METAFLOW_CORE_* control variables (marker, top-level options, executors, disabled tests).
  • Shared settings are deduplicated through {[testenv]setenv} inheritance, with a _disabled section for common disabled-test lists.
  • test/core/conftest.py now reads all context from os.environ and uses pytest_generate_tests to parametrize (graph, test, executor) combinations. No Python context file is imported anywhere.
  • Checker selection has moved from the METAFLOW_CORE_CHECKS environment variable into a proper session-scoped core_checks fixture, which can now be overridden per directory without touching tox.
  • The root tox.ini no longer contains core-* envs and now points users to test/core/tox.ini.

R2: Eliminate the custom test runner

  • run_tests.py (643 lines) is deleted. Test execution is now handled entirely by pytest.

  • test/core/conftest.py now defines _iter_graphs() and _iter_tests() directly instead of importing them from run_tests.py.

  • test/core/test_core_pytest.py now defines _run_flow() directly in place of run_test(). It supports cli, api, and scheduler executors.

  • This also fixes two existing bugs:

    • the api executor now catches RuntimeError from Runner.run() and converts it into a non-zero return code instead of surfacing an unhandled exception
    • the resume path now returns early when the resume subprocess fails, avoiding a follow-on FileNotFoundError from open("run-id")

R3: Convert test flows to standard pytest tests

The core test suite now behaves like normal pytest code instead of relying on a subprocess-heavy custom harness.

  • MetaflowTest has been renamed to FlowDefinition across all 64 test classes and in metaflow_test/__init__.py.
  • The Test suffix is removed because these classes are flow templates combined with graph topologies by FlowFormatter, not pytest test cases.
  • A MetaflowTest = FlowDefinition alias is kept for external compatibility.
  • Verification now runs in-process instead of in a second subprocess. _run_flow() dynamically imports the generated test_flow.py, instantiates the flow class, and calls formatter.test.check_results(flow, checker) directly.
  • Check failures now surface as normal AssertionErrors with full pytest tracebacks instead of opaque subprocess exit codes.
  • FlowFormatter._check_lines() and check_code are removed.
  • MetaflowCheck no longer depends on sys.argv: run_id and cli_options are now explicit constructor parameters.
  • new_checker now accepts either a checker class or a checker class name.
  • 243 assert_equals(a, b) calls across 54 files are replaced with plain assert a == b, enabling pytest assertion rewriting and better failure output.
  • 10 uses of assert_exception(lambda: f(), E) are replaced with pytest.raises(E) in tag_mutation, merge_artifacts, merge_artifacts_include, and metadata_check.
  • assert_equals_metadata is removed and replaced with inline assertions in resume_end_step.py.

R4: Simplify test utilities

Test helpers are reduced to standard pytest patterns wherever possible.

  • assert_equals, assert_exception, and assert_equals_metadata are removed from metaflow_test/__init__.py.
  • ExpectationFailed, AssertArtifactFailed, AssertLogFailed, and AssertCardFailed now subclass AssertionError, so pytest reports them natively.
  • assert_artifact, assert_log, and assert_card are rewritten to use plain assert internally rather than manually raising custom exceptions.
  • artifact(step, name) is added to both CliCheck and MetadataCheck, returning {task_id: value} so tests can make direct assertions such as assert checker.artifact(step, "data") == {"task1": "abc"}.
  • test/core/pytest.ini is added to centralize pytest configuration, including norecursedirs, timeout = 1800, addopts = -v --tb=short, and the seven backend markers. Tox command lines now only need to pass the marker flag and parallelism settings.

R5: tox is now the orchestration layer

Core test environments can now be run directly with tox, without any custom orchestration layer:

tox -c test/core/tox.ini -e core-local   # local filesystem
tox -c test/core/tox.ini -e core-gcs     # GCS via fake-gcs-server
tox -c test/core/tox.ini -e core-azure   # Azure Blob via Azurite
tox -c test/core/tox.ini -e core-batch   # AWS Batch via localbatch + MinIO

GCS emulator support

This PR also adds first-class support for running against a local GCS emulator.

  • metaflow/plugins/gcp/gs_storage_client_factory.py now creates an anonymous storage.Client() when STORAGE_EMULATOR_HOST is set, instead of calling google.auth.default(). This allows flows to run against fake-gcs-server without real GCP credentials.
  • devtools/ now includes fake-gcs-server as a first-class service, with Kubernetes deployment and service definitions, bucket-init job, secret, a dedicated Tilt file, and integration into the main Tiltfile and pick_services.sh.
  • The emulator can be started with SERVICES_OVERRIDE=fake-gcs-server make up.
  • core-gcs and core-azure now set METAFLOW_DEFAULT_DATASTORE=gs/azure along with the corresponding sysroot and endpoint variables, so flows actually exercise cloud storage code paths against local emulators. This matches the existing core-batch pattern with MinIO.

Test Plan

tox -c test/core/tox.ini -e core-local — 470 tests collected and passing
tox -c test/core/tox.ini -e core-gcs — requires fake-gcs-server at localhost:4443 (SERVICES_OVERRIDE=fake-gcs-server make up)
tox -c test/core/tox.ini -e core-azure — requires Azurite at localhost:10000
tox -c test/core/tox.ini -e core-batch, core-k8s, core-argo, core-sfn — require the full devtools stack

Runtime:

Commands to run:

# paste exact commands

Where evidence shows up:

Before (error / log snippet)
paste here
After (evidence that fix works)
paste here

Root Cause

Why This Fix Is Correct

Failure Modes Considered

Tests

  • Unit tests added/updated
  • Reproduction script provided (required for Core Runtime)
  • CI passes
  • If tests are impractical: explain why below and provide manual evidence above

Non-Goals

AI Tool Usage

  • No AI tools were used in this contribution
  • [ X ] AI tools were used (describe below)
    • Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR refactors the Metaflow core test suite (90 files) to replace a 643-line custom orchestration layer (run_tests.py, contexts.json) with standard pytest/tox tooling, renames MetaflowTestFlowDefinition across 64 test classes, and adds first-class GCS emulator support via AnonymousCredentials.

The bulk of the previous review threads have been addressed: import pytest is now added both to tag_mutation.py directly and conditionally to generated flows by the formatter; core-azure/core-gcs now use --datastore=azure/--datastore=gs with the correct non-s3-cloud disabled list (including S3Failure); the _skip_api_executor guard is present in test_flow_triple; the _iter_tests duplicate-class issue is fixed via obj.__module__ == mod.__name__; and the bare except Exception that silenced collection errors is removed. The runner.cleanup() / env-restore ordering issue (raised in a prior thread) is still present but unchanged.

Confidence Score: 4/5

Safe to merge; all previously flagged P1 issues are addressed and the two remaining comments are P2 style/diagnostic suggestions.

No P0 or new P1 findings. The one open P1 from prior threads (env restore skipped if runner.cleanup() raises) is pre-existing and unchanged. Changes are confined to test infrastructure with no runtime behaviour change to the Metaflow library itself, except for the targeted GCS emulator client fix.

test/core/test_core_pytest.py — api executor error-detail capture; test/core/conftest.py — disabled_tests case sensitivity note.

Important Files Changed

Filename Overview
test/core/conftest.py Rewrites test parametrization to read context entirely from env; adds _iter_tests/_iter_graphs discovery with correct __module__ deduplication guard; drops the bare except Exception that previously silenced import errors.
test/core/test_core_pytest.py New _run_flow inlines the old run_tests.py executor logic; fixes the api-executor RuntimeError and resume early-return bugs; runner.cleanup() is in a finally block, but an OSError from cleanup can still prevent env restore (already flagged in prior review).
test/core/tox.ini New orchestration layer with per-backend envs; core-azure/core-gcs now correctly set --datastore=azure/gs, use the non-s3-cloud disabled list, and configure the Azurite/fake-gcs endpoints.
test/core/metaflow_test/formatter.py Conditionally emits import pytest in the generated flow only when step bodies reference it; correctly detects pytest in step source text before yielding the import line.
test/core/metaflow_test/init.py Renames MetaflowTest → FlowDefinition; removes assert_equals/assert_exception helpers; exception hierarchy subclasses AssertionError; new_checker accepts class name string or class object.
metaflow/plugins/gcp/gs_storage_client_factory.py Correctly supplies AnonymousCredentials when STORAGE_EMULATOR_HOST is set, preventing google.auth.default() from running in CI environments without ADC.
test/core/pytest.ini New file centralising pytest config: norecursedirs prevents collecting test flow definitions, --strict-markers enforces the 7 backend marker set, timeout=1800.

Reviews (33): Last reviewed commit: "fix precommit" | Re-trigger Greptile

Comment thread test/core/conftest.py Outdated
Comment thread test/core/conftest.py Outdated
Comment thread test/core/conftest.py Outdated
Comment thread test/core/test_core_pytest.py Outdated
Comment thread test/core/test_core_pytest.py Outdated
Comment thread test/core/tox.ini
Comment thread test/core/tests/tag_mutation.py
Comment thread test/core/metaflow_test/formatter.py
Comment thread test/core/conftest.py Outdated
@Tingting-Chang Tingting-Chang changed the title [WIP] Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly [WIP] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly Apr 28, 2026
Comment thread metaflow/plugins/gcp/gs_storage_client_factory.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment thread test/core/test_core_pytest.py
Comment thread test/core/conftest.py
for d in addl_spec.submodule_search_locations
if os.path.isdir(d)
]
if not new_dirs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When did this happen? These things seem to keep evolving but I believe the previous case does take care of editable installs.

Copy link
Copy Markdown
Author

@Tingting-Chang Tingting-Chang Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes is related to the modern pip package for CardExtensionsImportTest. Details:

The refactor changed which Python process runs the extension discovery: previously run_tests.py was a subprocess whose PYTHONPATH included only test/core/ (so the card packages were found via their finders); the new code calls run_test() in-process with the same PYTHONPATH. In both cases, the finders' find_spec('metaflow_extensions') returns the same broken result. The bug was always there, it was just masked because CardExtensionsImportTest had never actually been run successfully in this environment before core-local was wired up.

An alternative will be install the card packages non-editably in test/core/tox.ini:

# Remove -e prefix for the card extensions      
  {toxinidir}/../../test/extensions/packages/card_via_extinit

Let me WDYT and I can make changes. Thanks

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @talsperre, could you take a look at ^^ and let me know WDYT? Thanks!

Comment thread test/core/tox.ini
@Tingting-Chang Tingting-Chang assigned talsperre and unassigned npow and talsperre Apr 28, 2026
@Tingting-Chang Tingting-Chang changed the title [WIP] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly [RFR] AIPMDM-888: Simplify OSS Metaflow core tests to be more Pythonic / pytest-friendly / tox-friendly Apr 28, 2026
@Tingting-Chang Tingting-Chang force-pushed the AIPMDM-888 branch 2 times, most recently from 10a94cb to b6c165b Compare May 1, 2026 17:15
Comment thread test/core/tox.ini
Comment thread test/core/tox.ini
Comment thread test/core/conftest.py
Comment thread metaflow/_vendor/yaml/reader.py Fixed
Tingting-Chang and others added 20 commits May 2, 2026 06:51
- Reset LocalMetadataProvider._INFO and LocalStorage.datastore_root
  class-level caches in _run_flow() finally block so deleted tempdirs
  from one test don't cause MetaflowNotFound in the next cli test
- Save/restore _LMP._INFO in _isolated_client_globals() for same reason
- Add AWS_ENDPOINT_URL_EVENTBRIDGE to core-sfn tox env so boto3 routes
  DisableRule calls to the local eventbridge_stub instead of real AWS
- Add CI wait step for EventBridge stub readiness before sfn tests

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
DynamoDbClient uses get_aws_client("dynamodb") which reads the standard
botocore env var AWS_ENDPOINT_URL_DYNAMODB — it does not read the old
METAFLOW_SFN_DYNAMO_DB_CLIENT_PARAMS convention.

Replace the injected var so Batch containers running foreach tasks
(save_foreach_cardinality, save_parent_task_id_for_foreach_join,
get_parent_task_ids_for_foreach_join) can reach ddb-local at
host.docker.internal:8765 instead of hitting real AWS DynamoDB.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…duce argo disk

BasicForeach (32 parallel pods) exhausts the 6 GB minikube node for argo and
the GitHub runner RAM for sfn-batch; MergeArtifactsInclude imports pytest at
module level which fails inside python:3.9 containers. Add both to
[_disabled]scheduler so core-argo and core-sfn skip them (matching the existing
cloud exclusion for core-batch/core-k8s).

Reduce core-argo ephemeral disk request from 1024 MB to 50 MB — same value
used by core-k8s — to avoid unnecessary ephemeral storage pressure even for
small foreach tests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CardImport tests card extension packages (editable_import_test_card,
non_editable_import_test_card from card_via_extinit/card_via_init) that
are installed in the tox venv but not in batch/k8s container images.
Without those packages, card generation silently fails and the
check_results assertions on card presence would fail.

Matches the pattern already in place for CardExtensionsImport (which tests
the same class of packages and is already disabled for cloud/scheduler).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
   Argo, k8s TTL+cleanup, dump failures
Bitnami retired both the charts.bitnami.com helm repo index AND the
CDN (charts.bitnami.com/bitnami/postgresql-12.5.6.tgz), so the previous
curl-based workaround silently fails. The helm_remote call in Tilt then
tries helm pull from the live bitnami repo, which no longer lists 12.5.6,
causing the Tiltfile to error and the devstack to never become ready.

Switch postgresql.tiltfile to the bitnami archive branch on GitHub
(archive-full-index) which preserves all historical chart versions.
Update all three CI workflows (core-tests, full-stack-test, ux-tests)
to:
  - add the repo under the name 'postgresql' (matching repo_name in the
    Tiltfile so the Tilt helm cache path aligns)
  - pre-pull via 'helm pull postgresql/postgresql --version 12.5.6'
    instead of the broken CDN curl

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The previous rule accepted from MINIKUBE_GW (192.168.49.1, the host's own
IP on the minikube Docker bridge) which never matched.  kube-proxy on the
minikube node MASQUERADE's pod traffic to the minikube node IP (192.168.49.2
= minikube ip) before it exits the container, so the host always sees the
source as 192.168.49.2, not 192.168.49.1 or the pod CIDR.

Switch to $(minikube ip) in the ACCEPT rule.  Also make the pre-test
verification hard-fail (exit 1) if localbatch is unreachable from the
minikube container — previously it logged the failure silently and then
all 156 tests failed with cryptic errors instead of one clear step failure.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
localbatch runs Batch containers on the CI runner's Docker daemon using
python:3.10 as the default image.  Without a pre-pull, each container
startup hits Docker Hub (30-60s) before pip install (20s) and code-package
download (10s) — 60-90s total per container.

For graphs with multiple sequential container groups (simple-foreach has
~5, nested-branches has ~4+, simple_switch ~4), the per-test wall-clock
time approaches or exceeds the 600 s scheduler timeout, causing ALL tests
in those graphs to fail with "scheduler run timed out".

Pre-caching python:3.10 in the CI runner's Docker daemon cuts container
startup to ~30s, keeping every graph combination well within the 600 s
budget.  The k8s/argo fix (docker pull + minikube image load) already ran
docker pull for those backends; the sfn/batch step does the same without
the minikube load.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants