Skip to content

Conversation

@lewtun
Copy link
Member

@lewtun lewtun commented Dec 22, 2025

This PR removes a spurious file_path from document metadata.

When resuming inference from checkpoints, documents get an extra file_path metadata field that wasn't present in the original data. This causes schema mismatches when loading the final dataset, since some parquet files have the file_path column and others don't.

Error example:

CastError: Couldn't cast
text: string
id: string
solution: string
schema_0: list<element: struct<desc: string, points: int64, title: string>>
  child 0, element: struct<desc: string, points: int64, title: string>
      child 0, desc: string
      child 1, points: int64
      child 2, title: string
raw_responses: list<element: string>
  child 0, element: string
dataset: string
rollout_results: list<element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>>
  child 0, element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>
      child 0, finish_reason: string
      child 1, text: string
      child 2, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>
          child 0, completion_tokens: int64
          child 1, prompt_tokens: int64
          child 2, prompt_tokens_details: null
          child 3, total_tokens: int64
file_path: string
to
{'text': Value('string'), 'id': Value('string'), 'solution': Value('string'), 'schema_0': List({'desc': Value('string'), 'points': Value('int64'), 'title': Value('string')}), 'raw_responses': List(Value('string')), 'dataset': Value('string'), 'rollout_results': List({'finish_reason': Value('string'), 'text': Value('string'), 'usage': {'completion_tokens': Value('int64'), 'prompt_tokens': Value('int64'), 'prompt_tokens_details': Value('null'), 'total_tokens': Value('int64')}})}
because column names don't match

This metadata arises because JsonlReader extends BaseDiskReader, which automatically injects file_path into document metadata:

# base.py line 163
document.metadata.setdefault("file_path", self.data_folder.resolve_paths(source_file))

This means:

  • Documents processed fresh → no file_path
  • Documents restored from checkpoint → file_path added

Co-authored with Claude

This PR removes a spurious `file_path` from document metadata. 

When resuming inference from checkpoints, documents get an extra `file_path` metadata field that wasn't present in the original data. This causes schema mismatches when loading the final dataset, since some parquet files have the `file_path` column and others don't. 

Error example:

```
CastError: Couldn't cast
text: string
id: string
solution: string
schema_0: list<element: struct<desc: string, points: int64, title: string>>
  child 0, element: struct<desc: string, points: int64, title: string>
      child 0, desc: string
      child 1, points: int64
      child 2, title: string
raw_responses: list<element: string>
  child 0, element: string
dataset: string
rollout_results: list<element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>>
  child 0, element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>
      child 0, finish_reason: string
      child 1, text: string
      child 2, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>
          child 0, completion_tokens: int64
          child 1, prompt_tokens: int64
          child 2, prompt_tokens_details: null
          child 3, total_tokens: int64
file_path: string
to
{'text': Value('string'), 'id': Value('string'), 'solution': Value('string'), 'schema_0': List({'desc': Value('string'), 'points': Value('int64'), 'title': Value('string')}), 'raw_responses': List(Value('string')), 'dataset': Value('string'), 'rollout_results': List({'finish_reason': Value('string'), 'text': Value('string'), 'usage': {'completion_tokens': Value('int64'), 'prompt_tokens': Value('int64'), 'prompt_tokens_details': Value('null'), 'total_tokens': Value('int64')}})}
because column names don't match
```

This metadata arises because `JsonlReader` extends `BaseDiskReader`, which automatically injects `file_path` into document metadata:

```
# base.py line 163
document.metadata.setdefault("file_path", self.data_folder.resolve_paths(source_file))
```

This means:
* Documents processed fresh → no `file_path`
* Documents restored from checkpoint → `file_path` added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants