-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting table data? #1
Comments
Hi, @munikarmanish ! You're correct. The OCR currently only works for pre-processed images. While it does extract data from PDFs with tables, it currently performs a horizontal scan and doesn't perform any table based classification on the text yet, I'm still trying to figure out how to make that work. A make-do solution could be to classify the text after extraction based on the length of columns but that will only work if every column has a fixed length of words, which is not the case in most scenarios. |
The way to do this is to use code to do table detection (column and row) and then preform the ocr within the table it's a really hard problem though. |
Hi @munikarmanish did you found any thing regarding the research you mentioned above ? |
Yes, I've found a few interesting approaches: |
I am also facing above issue. did any found best solution after 2 years? |
Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.
Thanks in advance. :)
The text was updated successfully, but these errors were encountered: