Skip to content

fix: cross-class filelock deadlock in datasets loading#1247

Closed
Luodian wants to merge 7 commits intomainfrom
brianli/dev
Closed

fix: cross-class filelock deadlock in datasets loading#1247
Luodian wants to merge 7 commits intomainfrom
brianli/dev

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Mar 10, 2026

Summary

  • Patch FileLockMeta.__call__ with a cross-class singleton cache so filelock.FileLock and datasets.utils._filelock.FileLock return the same instance for the same lock path
  • Fixes RuntimeError: Deadlock during datasets.load_dataset() in distributed eval — filelock 3.25.0's global _registry detects deadlocks across all classes, but is_singleton caches per-class, so two different classes targeting the same .lock file trigger a false deadlock
  • Adds regression test covering both same-class and cross-class singleton scenarios

Test plan

  • Cross-class singleton verified inside container (filelock 3.25.0 + datasets 4.x)
  • 31-task loading passes in single-process container test
  • Job 2451 running on 4-node dp32 FluidStack — passed task loading phase (previously deadlocked at this stage)

Luodian and others added 7 commits March 9, 2026 01:27
Merge dummy_video_reader into a single dummy model that serves both use cases:
- Default mode: instant no-op responses for dataset hydration and task smoke tests
- Video-bench mode (read_bytes/decode_num_frames > 0): full IO/decode latency tracking

The old name dummy_video_reader is kept as a MODEL_ALIASES alias for backward compat.
… inputs

SGLang's Engine runs its own Qwen3-VL processor internally. When
lmms-eval pre-tokenized inputs with the HF processor and passed the
expanded input_ids to SGLang, pad tokens were expanded twice, causing
IndexError on image inputs and potential failures on video inputs.

- Image path: pass prompt text directly to Engine.generate() instead of
  pre-tokenized input_ids, letting SGLang handle tokenization end-to-end
- Video path: pass prompt text + video_data to Engine.generate() using
  SGLang's native video support instead of pre-tokenizing and swapping
  video tokens to image tokens
- Fix tools check: use truthy check instead of 'is not None' so empty
  list from disabled MCP does not trigger tool-handling code paths
- Fix tools param: pass tools=None instead of tools=[] to
  apply_chat_template to avoid unexpected preprocessing
- Lazy-import MCP deps: avoid ImportError at module load when mcp
  package is not installed
- Broaden optional metric imports: catch Exception instead of
  ImportError so numpy/spacy binary incompatibilities do not crash
  metric aggregation for unrelated tasks
@Luodian
Copy link
Copy Markdown
Contributor Author

Luodian commented Mar 15, 2026

Closing in favor of #1253, which is a strict superset of this PR (same 7 commits + additional SGLang refactor, distributed eval fixes, and cache simplification). All changes from this PR are included there.

@Luodian Luodian closed this Mar 15, 2026
@Luodian Luodian deleted the brianli/dev branch March 15, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant