-
Notifications
You must be signed in to change notification settings - Fork 163
Description
READ_FIRST: It ended up being quite a bit of text, but turned out necessary since it's about a specific and possibly inconsistent bug in the algorithm.
Description
I’m encountering an inconsistency when using table.to_pandas()
across different operating systems and would like to clarify whether this is expected behavior or a potential bug.
Context
- Development OS: Windows
- Deployment OS: Linux
table.to_pandas()
uses the defaultTextLinearizationConfig
, which depends onos.linesep
:# Windows: os.linesep # Returns "\r\n" # Linux: os.linesep # Returns "\n"
Issue
When extracting text from table cells:
- On Windows, line breaks (
\n
) are preserved as expected. - On Linux,
\n
characters are replaced with spaces (" "
).
Example:
# Windows:
df = table.to_pandas()
# Linux:
df = table.to_pandas()
Expected Behavior
Consistent cell text output (with \n
preserved) regardless of OS.
Possible Problem
After looking at the default configuration of TextLinearizationConfig
, I noticed that one of the values is defined statically:
same_layout_element_separator: str = (
"\n" #: Separator to use when two elements are in the same layout element
)
Could this be conflicting with the other class values? Since, on Linux, os.linesep
is also \n
?
Questions
- Is this behavior intentional?
- Should
TextLinearizationConfig
use a different platform-agnostic line separator (e.g., hardcode\n
)? - Possible workaround (?): Explicitly set
os.linesep="\n"
in the config?
The Solution I Implemented
In order to ensure consistent behavior across both Windows and Linux environments, I replaced the default "\n"
value used in same_layout_element_separator
with a custom marker.
custom_marker = "<<~~N~~>>"
table_df = table.to_pandas(config=TextLinearizationConfig(same_layout_element_separator=custom_marker))
table_df = table_df.map(lambda x: x.replace(custom_marker, "\n") if isinstance(x, str) else x)
NOTE: During dataframe pre-processing, there's a specific step to handle existing newline characters (\n
). This helps correct some specific table misread issues that may have occurred.
Problem Found with My Solution (?)
However, even after applying this solution, I noticed that when accessing the table.children
cells from the Textractor table object, the cell.text
values still contain spaces (" "
) instead of preserving the originally expected "\n"
characters.
How to Reproduce
If on Windows, i simply forced os.linesep="\n"
(simulating Linux behavior) right before .to_pandas()
call.