Skip to content

Incorrect table cell word and line order #369

@wessens

Description

@wessens

Hello, this issue seems very similar to #136 , but I just can't make it work: the word and line order inside table cells is not preserved when invoking the get_text method.

The json attached is a reslt of running Textract start_document_analysis with parameters [TextractFeatures.TABLES, TextractFeatures.LAYOUT].

When running

import json

import textractor
from textractor.entities.document import Document

j = json.load(open('../data/processed/6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json'))

doc = Document.open(j)
print(doc.tables[1].get_text())

print(textractor.__version__)

I get as output for example

...
of adolescent and girls

6.1.2.4 the Ensure
...

But the actual lines are "of adolescent girls and" and "6.1.2.4 Ensure the" and the line order is different.

Blocks seem fine and the child order in "Relationships" also seem correct.

What am i doing wrong?

6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions