Skip to content

Inconsistent New Line Character Reading [Windows VS Linux] (table.to_pandas()) #428

@storesace-jorgelopes

Description

@storesace-jorgelopes

READ_FIRST: It ended up being quite a bit of text, but turned out necessary since it's about a specific and possibly inconsistent bug in the algorithm.

Description

I’m encountering an inconsistency when using table.to_pandas() across different operating systems and would like to clarify whether this is expected behavior or a potential bug.

Context

  • Development OS: Windows
  • Deployment OS: Linux
  • table.to_pandas() uses the default TextLinearizationConfig, which depends on os.linesep:
    # Windows:
    os.linesep  # Returns "\r\n"
    
    # Linux:
    os.linesep  # Returns "\n"

Issue

When extracting text from table cells:

  • On Windows, line breaks (\n) are preserved as expected.
  • On Linux, \n characters are replaced with spaces (" ").

Example:

# Windows:
df = table.to_pandas() 

Image

# Linux:
df = table.to_pandas()

Image

Expected Behavior

Consistent cell text output (with \n preserved) regardless of OS.

Possible Problem

After looking at the default configuration of TextLinearizationConfig, I noticed that one of the values is defined statically:

same_layout_element_separator: str = (
        "\n" #: Separator to use when two elements are in the same layout element
    )

Could this be conflicting with the other class values? Since, on Linux, os.linesep is also \n ?

Questions

  1. Is this behavior intentional?
  2. Should TextLinearizationConfig use a different platform-agnostic line separator (e.g., hardcode \n)?
  3. Possible workaround (?): Explicitly set os.linesep="\n" in the config?

The Solution I Implemented

In order to ensure consistent behavior across both Windows and Linux environments, I replaced the default "\n" value used in same_layout_element_separator with a custom marker.

custom_marker = "<<~~N~~>>"
table_df = table.to_pandas(config=TextLinearizationConfig(same_layout_element_separator=custom_marker))

Image

table_df = table_df.map(lambda x: x.replace(custom_marker, "\n") if isinstance(x, str) else x)

NOTE: During dataframe pre-processing, there's a specific step to handle existing newline characters (\n). This helps correct some specific table misread issues that may have occurred.

Problem Found with My Solution (?)

However, even after applying this solution, I noticed that when accessing the table.children cells from the Textractor table object, the cell.text values still contain spaces (" ") instead of preserving the originally expected "\n" characters.

How to Reproduce
If on Windows, i simply forced os.linesep="\n" (simulating Linux behavior) right before .to_pandas() call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions