Skip to content

perf(lineage): viewport virtualization + perf benchmark infra#17600

Draft
acrylJonny wants to merge 5 commits into
masterfrom
datahub-lineage-improvements-foundation
Draft

perf(lineage): viewport virtualization + perf benchmark infra#17600
acrylJonny wants to merge 5 commits into
masterfrom
datahub-lineage-improvements-foundation

Conversation

@acrylJonny
Copy link
Copy Markdown
Collaborator

@acrylJonny acrylJonny commented May 27, 2026

Summary

Adds viewport-based DOM virtualization for the V2 lineage graph plus an opt-in Playwright benchmark suite for measuring and regressing lineage rendering performance at scale.

Why

The V2 lineage graph renders every node and edge regardless of viewport, which makes large graphs (500+ nodes) very expensive to expand, pan, and click. Synthetic measurements on this branch show single-action stalls of 2.4–4.4 s at 500 nodes (expand-fanout, pan-horizontal, click-root, hover-column-root).

What

Backend feature flags (metadata-service/configuration, datahub-graphql-core)

  • lineageGraphPerfVirtEnabled (default true) — server default for auto-applying ReactFlow's onlyRenderVisibleElements above ~50 rendered nodes.
  • lineageGraphPerfOverscanEnabled (default false) — server default for the wider overscan-buffered virt path above ~200 nodes; trades a small pan-cost regression for fewer pop-in artefacts.
  • Both surface through appConfig.featureFlags and AppConfigResolver, and are classified as non-sensitive in PropertiesCollectorConfigurationTest. URL (?lineagePerf=) and localStorage (datahub.lineagePerfFlags) still override per-session for diagnostics.

Frontend — lineageV2 (datahub-web-react/src/app/lineageV2)

  • perfFlags.ts resolves modes (virt, overscan) from server defaults, URL, and localStorage with documented precedence; unit-tested in __tests__/perfFlags.test.ts.
  • useOverscanVirt.ts inflates ReactFlow's nodeExtent/viewport bounds by DEFAULT_OVERSCAN_FACTOR so neighbours mount before they scroll into view.
  • LineageVisualization reads server flags via useAppConfig() and threads the resolved modes into ReactFlow.
  • LineageVisualizationContext adds forceMountAll so the screenshot export can temporarily disable virtualization for capture.

Screenshot export (V2 + V3)

  • application.conf CSP: add data: to img-src. html-to-image inlines its serialised SVG as a data: URI before rasterising, so the screenshot button was silently broken under the previous policy. Surfaced by the new screenshot-stress spec.
  • DownloadLineageScreenshotButton (V2 + V3): replace console.error with antd.message.error for user-visible feedback; new focused test for the failure path.

E2E perf benchmark suite (e2e-test/ui/playwright)

  • New tests/lineage-perf/ directory, opt-in via LINEAGE_PERF=1:
    • Journey benchmark across small / chain+columns / filter-hub graphs.
    • Synthetic scaling matrix (100/500/1000 nodes) × {baseline, virt, virt+overscan}.
    • Opt-in screenshot stress (LINEAGE_PERF_SCREENSHOT=1).
    • Opt-in axe-core accessibility audit (LINEAGE_A11Y=1, LINEAGE_A11Y_LARGE=1).
  • LineagePerfRecorder (utils/lineage-perf-collector.ts) records wall time, long tasks, FPS, network request count + bytes per action.
  • lineage-perf-seeder.ts programmatically seeds Dataset, column lineage, DataJob/DataFlow, Chart, and Dashboard graphs at arbitrary scale via ingestProposal.
  • LineagePerfPage (pages/lineage-perf.page.ts) — virt-aware expandFanoutFully helper plus standard navigation steps.
  • LINEAGE_PERF_REPEAT=N runs each scenario N times for variance.
  • scripts/lineage-perf-aggregate.mjs computes p50 / p95 / max bands from the JSON output; wired up as yarn perf:aggregate and yarn perf:aggregate:json. Documented in e2e-test/ui/playwright/README.md.
  • v2-lineage-virt.spec.ts regression test: virt path produces the expected DOM node count for forced-on and forced-off variants.

Risk and rollout

  • Default behaviour for new deployments: virt on (auto-threshold), overscan off. The auto-threshold means small graphs render identically to today.
  • Both flags can be overridden via env var (LINEAGE_GRAPH_PERF_VIRT_ENABLED, LINEAGE_GRAPH_PERF_OVERSCAN_ENABLED) or via URL / localStorage per session.
  • Screenshot CSP fix is independently safe: only widens img-src to also allow data: URIs, which is needed by html-to-image.

Tests

  • New Java unit-test coverage: AppConfigResolverTest, PropertiesCollectorConfigurationTest updated.
  • New frontend unit tests: perfFlags.test.ts, virtualization.sanity.test.tsx, DownloadLineageScreenshotButton.test.tsx.
  • New Playwright suites under tests/lineage-perf/ and tests/lineage-v2/v2-lineage-virt.spec.ts.
  • Verified locally with LINEAGE_PERF=1 yarn perf against a running stack — 12 / 12 active tests pass; a11y + screenshot-stress suites are opt-in.

Checklist

  • PR conforms to the Contributing Guideline (PR title format)
  • Tests added/updated
  • Docs added/updated (e2e-test/ui/playwright/README.md — performance benchmark instructions)
  • No breaking changes (feature flags default to current behaviour at small scale)

Made with Cursor

Adds viewport-based DOM virtualization for the V2 lineage graph plus
an opt-in Playwright benchmark suite for measuring and regressing
lineage rendering performance at scale.

Backend feature flags

- `lineageGraphPerfVirtEnabled` (default `true`) - server default for
  auto-applying `onlyRenderVisibleElements` above ~50 rendered nodes.
- `lineageGraphPerfOverscanEnabled` (default `false`) - server default
  for the wider overscan-buffered virt path above ~200 nodes; trades a
  small pan-cost regression for fewer pop-in artefacts.
- Both are exposed via `appConfig.featureFlags` (`app.graphql`) and
  wired through `AppConfigResolver`. URL (`?lineagePerf=`) and
  localStorage (`datahub.lineagePerfFlags`) still override per-session
  for diagnostics; server values are the baseline.

Frontend (lineageV2)

- `perfFlags.ts` resolves modes (`virt`, `overscan`) from server
  defaults, URL, and localStorage with documented precedence.
- `useOverscanVirt.ts` inflates ReactFlow's `nodeExtent`/`viewport`
  bounds by `DEFAULT_OVERSCAN_FACTOR` so neighbours mount before they
  scroll into view.
- `LineageVisualization` reads server flags via `useAppConfig()` and
  threads the resolved modes into `ReactFlow`.
- `LineageVisualizationContext` adds `forceMountAll` so the screenshot
  export can temporarily disable virtualization for capture.

Screenshot export

- `application.conf` CSP: add `data:` to `img-src`. `html-to-image`
  inlines its serialised SVG as a `data:` URI before rasterising, so
  the screenshot button was silently broken under the previous policy.
  Surfaced by the new screenshot-stress spec.
- `DownloadLineageScreenshotButton` (V2 + V3): replace `console.error`
  with `antd.message.error` for user-visible feedback and add a focused
  test for the failure path.

E2E perf benchmark suite

- New `tests/lineage-perf/` directory (opt-in via `LINEAGE_PERF=1`):
  journey benchmark across small / chain+columns / filter-hub graphs,
  synthetic scaling matrix (100/500/1000 nodes), opt-in screenshot
  stress (`LINEAGE_PERF_SCREENSHOT=1`) and axe-core accessibility
  audit (`LINEAGE_A11Y=1`).
- `LineagePerfRecorder` (`utils/lineage-perf-collector.ts`) records
  wall time, long tasks, FPS, network requests + bytes per action.
- `lineage-perf-seeder.ts` programmatically seeds dataset, column
  lineage, DataJob/DataFlow, Chart, and Dashboard graphs at arbitrary
  scale via `ingestProposal`.
- `LineagePerfPage` (`pages/lineage-perf.page.ts`) - virt-aware
  `expandFanoutFully` helper plus standard navigation steps.
- `LINEAGE_PERF_REPEAT=N` runs each scenario N times for variance.
- `scripts/lineage-perf-aggregate.mjs` computes p50 / p95 / max bands
  from the JSON output; wired up as `yarn perf:aggregate` and
  `yarn perf:aggregate:json`. Documented in playwright README.
- `v2-lineage-virt.spec.ts` regression test: virt path produces the
  expected DOM node count for forced-on and forced-off variants.

Headline numbers (synthetic 500 nodes, forced virt vs baseline):

- expand-fanout    2446 ms -> 65 ms  (-97%)
- pan-horizontal   4367 ms -> 384 ms (-91%)
- click-root       2496 ms -> 39 ms
- hover-column-root 2381 ms -> 328 ms
@github-actions github-actions Bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels May 27, 2026
@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous Bot commented May 27, 2026

✅ Meticulous spotted 0 visual differences across 1390 screens tested: view results.

Meticulous evaluated ~10 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit 025f636 ci(perf): build PR-branch images before lineage perf run. This comment will update as new commits are pushed.

ESLint's `rulesdir/no-hardcoded-colors` rejected hex / rgba literals
in two fixtures introduced by the perf benchmark work:

- `DownloadLineageScreenshotButton.test.tsx`: replace the fabricated
  `{ bgSurface: '#fff' }` stub with the real `lightTheme` import, so
  the test inherits the project's semantic token table.
- `stubNodeTypes.tsx`: these stubs render in jsdom without a
  `ThemeProvider`, so swapping the hex literals for theme tokens would
  just resolve to undefined. The colours were pure decoration — the
  virtualisation sanity test asserts on mounted DOM-node counts, not
  visuals — so replace them with neutral CSS keywords (`transparent`,
  `currentColor`, `inherit`). Preserve the prop-driven styled component
  pattern on `Tag` by varying padding instead of background.

Surfaced by CI lint on PR #17600.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Bundle Report

Changes will increase total bundle size by 3.33kB (0.01%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 23.23MB 3.33kB (0.01%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 3.33kB 8.78MB 0.04%

Files in assets/index-*.js:

  • ./src/app/lineageV2/useOverscanVirt.ts → Total Size: 2.23kB

  • ./src/app/lineageV2/controls/DownloadLineageScreenshotButton.tsx → Total Size: 2.71kB

  • ./src/app/lineageV3/LineageVisualizationContext.tsx → Total Size: 320 bytes

  • ./src/app/lineageV3/LineageVisualization.tsx → Total Size: 4.28kB

  • ./src/app/lineageV3/controls/DownloadLineageScreenshotButton.tsx → Total Size: 2.07kB

  • ./src/app/lineageV2/LineageVisualization.tsx → Total Size: 5.18kB

  • ./src/appConfigContext.tsx → Total Size: 2.9kB

  • ./src/app/lineageV2/perfFlags.ts → Total Size: 2.25kB

  • ./src/app/lineageV2/LineageVisualizationContext.tsx → Total Size: 249 bytes

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 46.84385% with 160 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...react/src/app/lineageV3/__perf__/syntheticGraph.ts 28.67% 97 Missing ⚠️
...hub-web-react/src/app/lineageV2/useOverscanVirt.ts 14.86% 63 Missing ⚠️

📢 Thoughts on this report? Let us know!

Adds three `workflow_dispatch` inputs to `playwright-e2e-tests.yml`:

- `lineage_perf` — scope the run to `tests/lineage-perf/` with
  `LINEAGE_PERF=1`. Forces `shard_count=1` (perf benchmarks need
  stable single-process timing — sharding would split scenarios
  across runners and invalidate cross-variant comparisons) and bumps
  the job timeout from 20 m to 60 m to accommodate the synthetic /
  screenshot / a11y matrices.
- `lineage_perf_screenshot` — opt-in `LINEAGE_PERF_SCREENSHOT=1`
  screenshot stress matrix (100/500/1000 nodes × baseline/virt).
- `lineage_a11y` — opt-in axe-core audit (`LINEAGE_A11Y=1`,
  `LINEAGE_A11Y_LARGE=1`).

When `lineage_perf=true`, also uploads `lineage-perf.json`,
`lineage-screenshot-stress.tsv`, and the a11y JSON artefacts under
the `lineage-perf-results` artifact for later aggregation via
`yarn perf:aggregate`.

The default (non-perf) flow is unchanged — full sharded run of the
standard Playwright suite.
Aikido flagged the new `Run Playwright tests` step as a critical
template-injection risk because it inlined `${{ matrix.shard }}`,
`${{ matrix.shard_count }}`, and `${{ github.event.inputs.lineage_perf }}`
directly inside the shell `run:` block. Even though those values come
from our own setup job (not untrusted external input), the GitHub
Actions security guidance is to always pipe context references through
`env:` so they're never evaluated as part of the shell command.

Move `lineage_perf`, `matrix.shard`, and `matrix.shard_count` into the
step's `env:` and reference them as `$LINEAGE_PERF_INPUT`,
`$MATRIX_SHARD`, `$MATRIX_SHARD_COUNT` from the shell.

No behaviour change.
The lineage_perf workflow_dispatch path was pulling the published
`acryldata/...:quickstart` images, so it benchmarked whatever code happened
to be on master at last release — not the PR. That made the screenshot
stress test fail (missing CSP fix in application.conf) and produced perf
numbers that didn't reflect the virtualization / overscan changes in this
branch.

When lineage_perf=true the job now:
  1. Derives a tag from GITHUB_REF via docker_helpers.sh.
  2. Runs `:docker:buildImagesQuickstart` with the GitHub buildx cache,
     tagging the built images as `acryldata/<image>:<tag>` locally.
  3. Passes the same tag as DATAHUB_VERSION to run-quickstart.sh so
     compose resolves the PR-built images instead of pulling.

Default (non-perf) runs are unchanged — they still pull `:quickstart`.

Job timeout bumped from 60 to 90 minutes to cover the build (~25–30m cold,
faster with cache) on top of the existing 25–30m perf matrix. Build step
has its own 45m timeout so a hung build can't consume the whole job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment product PR or Issue related to the DataHub UI/UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant