Remove 'file_path' from document metadata #412

lewtun · 2025-12-22T07:00:05Z

This PR removes a spurious file_path from document metadata.

When resuming inference from checkpoints, documents get an extra file_path metadata field that wasn't present in the original data. This causes schema mismatches when loading the final dataset, since some parquet files have the file_path column and others don't.

Error example:

CastError: Couldn't cast
text: string
id: string
solution: string
schema_0: list<element: struct<desc: string, points: int64, title: string>>
  child 0, element: struct<desc: string, points: int64, title: string>
      child 0, desc: string
      child 1, points: int64
      child 2, title: string
raw_responses: list<element: string>
  child 0, element: string
dataset: string
rollout_results: list<element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>>
  child 0, element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>
      child 0, finish_reason: string
      child 1, text: string
      child 2, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>
          child 0, completion_tokens: int64
          child 1, prompt_tokens: int64
          child 2, prompt_tokens_details: null
          child 3, total_tokens: int64
file_path: string
to
{'text': Value('string'), 'id': Value('string'), 'solution': Value('string'), 'schema_0': List({'desc': Value('string'), 'points': Value('int64'), 'title': Value('string')}), 'raw_responses': List(Value('string')), 'dataset': Value('string'), 'rollout_results': List({'finish_reason': Value('string'), 'text': Value('string'), 'usage': {'completion_tokens': Value('int64'), 'prompt_tokens': Value('int64'), 'prompt_tokens_details': Value('null'), 'total_tokens': Value('int64')}})}
because column names don't match

This metadata arises because JsonlReader extends BaseDiskReader, which automatically injects file_path into document metadata:

# base.py line 163
document.metadata.setdefault("file_path", self.data_folder.resolve_paths(source_file))

This means:

Documents processed fresh → no file_path
Documents restored from checkpoint → file_path added

Co-authored with Claude

This PR removes a spurious `file_path` from document metadata. When resuming inference from checkpoints, documents get an extra `file_path` metadata field that wasn't present in the original data. This causes schema mismatches when loading the final dataset, since some parquet files have the `file_path` column and others don't. Error example: ``` CastError: Couldn't cast text: string id: string solution: string schema_0: list<element: struct<desc: string, points: int64, title: string>> child 0, element: struct<desc: string, points: int64, title: string> child 0, desc: string child 1, points: int64 child 2, title: string raw_responses: list<element: string> child 0, element: string dataset: string rollout_results: list<element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>>> child 0, element: struct<finish_reason: string, text: string, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64>> child 0, finish_reason: string child 1, text: string child 2, usage: struct<completion_tokens: int64, prompt_tokens: int64, prompt_tokens_details: null, total_tokens: int64> child 0, completion_tokens: int64 child 1, prompt_tokens: int64 child 2, prompt_tokens_details: null child 3, total_tokens: int64 file_path: string to {'text': Value('string'), 'id': Value('string'), 'solution': Value('string'), 'schema_0': List({'desc': Value('string'), 'points': Value('int64'), 'title': Value('string')}), 'raw_responses': List(Value('string')), 'dataset': Value('string'), 'rollout_results': List({'finish_reason': Value('string'), 'text': Value('string'), 'usage': {'completion_tokens': Value('int64'), 'prompt_tokens': Value('int64'), 'prompt_tokens_details': Value('null'), 'total_tokens': Value('int64')}})} because column names don't match ``` This metadata arises because `JsonlReader` extends `BaseDiskReader`, which automatically injects `file_path` into document metadata: ``` # base.py line 163 document.metadata.setdefault("file_path", self.data_folder.resolve_paths(source_file)) ``` This means: * Documents processed fresh → no `file_path` * Documents restored from checkpoint → `file_path` added

lewtun requested review from guipenedo and hynky1999 December 22, 2025 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove 'file_path' from document metadata #412

Remove 'file_path' from document metadata #412

Uh oh!

lewtun commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove 'file_path' from document metadata #412

Are you sure you want to change the base?

Remove 'file_path' from document metadata #412

Uh oh!

Conversation

lewtun commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants