Inconsistent New Line Character Reading [Windows VS Linux] (table.to_pandas())

**_READ_FIRST: It ended up being quite a bit of text, but turned out necessary since it's about a specific and possibly inconsistent bug in the algorithm._**


### Description  
I’m encountering an inconsistency when using `table.to_pandas()` across different operating systems and would like to clarify whether this is expected behavior or a potential bug.  

### Context  
- **Development OS:** Windows  
- **Deployment OS:** Linux  
- `table.to_pandas()` uses the default `TextLinearizationConfig`, which depends on `os.linesep`:  
  ```python
  # Windows:
  os.linesep  # Returns "\r\n"
  
  # Linux:
  os.linesep  # Returns "\n"
  ```

### Issue  
When extracting text from table cells:  
- On **Windows**, line breaks (`\n`) are preserved as expected.  
- On **Linux**, `\n` characters are replaced with spaces (`" "`).  

**Example:**  
```python
# Windows:
df = table.to_pandas() 
```
![Image](https://github.com/user-attachments/assets/447978bb-a758-4a69-929a-17bf79e90544)

```python
# Linux:
df = table.to_pandas()
```
![Image](https://github.com/user-attachments/assets/1d206e46-a609-426e-a1e7-b8fbd5819e68)

### Expected Behavior  
Consistent cell text output (with `\n` preserved) regardless of OS.  

### Possible Problem
After looking at the default configuration of `TextLinearizationConfig`, I noticed that one of the values is defined statically:
```python
same_layout_element_separator: str = (
        "\n" #: Separator to use when two elements are in the same layout element
    )
```
Could this be conflicting with the other class values? Since, on **Linux**, `os.linesep` is also `\n` ?

### Questions  
1. Is this behavior intentional?  
2. Should `TextLinearizationConfig` use a different platform-agnostic line separator (e.g., hardcode `\n`)?  
3. Possible workaround (?): Explicitly set `os.linesep="\n"` in the config?  

### The Solution I Implemented
In order to ensure consistent behavior across both Windows and Linux environments, I replaced the default `"\n"` value used in `same_layout_element_separator` with a **_custom marker_**.

```python
custom_marker = "<<~~N~~>>"
table_df = table.to_pandas(config=TextLinearizationConfig(same_layout_element_separator=custom_marker))
```
![Image](https://github.com/user-attachments/assets/0f3eaad9-638e-4783-b954-b3415b46720a)
```python
table_df = table_df.map(lambda x: x.replace(custom_marker, "\n") if isinstance(x, str) else x)
```

_NOTE:_ During dataframe pre-processing, there's a specific step to handle existing newline characters (`\n`). This helps correct some specific table misread issues that may have occurred.

#### Problem Found with My Solution  (?)
However, even after applying this solution, I noticed that when accessing the `table.children` cells from the Textractor table object, the `cell.text` values still contain spaces (`" "`) instead of preserving the originally expected `"\n"` characters.


**How to Reproduce**  
If on Windows, i simply forced `os.linesep="\n"` (simulating Linux behavior) right before `.to_pandas()` call.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent New Line Character Reading [Windows VS Linux] (table.to_pandas()) #428

Description

Context

Issue

Expected Behavior

Possible Problem

Questions

The Solution I Implemented

Problem Found with My Solution (?)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent New Line Character Reading [Windows VS Linux] (table.to_pandas()) #428

Description

Description

Context

Issue

Expected Behavior

Possible Problem

Questions

The Solution I Implemented

Problem Found with My Solution (?)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions