Skip to content
This repository was archived by the owner on Mar 27, 2024. It is now read-only.

Commit d78c933

Browse files
committed
Merge branch 'hotfix-0.36.3'
Hotfix v0.36.3: Henkei: Avoid false linebreaks in plain text extraction The previous method of plain text extraction from documents was causing incorrect line breaks to be inserted into text. The problem is that we're currently extracting plain text by first using Apache Tika's HTML extraction and then using Nokogiri's HTML to text method. That's because Tika's text extraction includes unwanted artifacts, such as `[image: ...]` and `[bookmark: ...]`. See [the TIKA-2755](https://jira.apache.org/jira/browse/TIKA-2755). The solution to this is to use the Tika server's `/rmeta/text` endpoint (as opposed to `/tika`) and then parse the JSON response. The plain text content is in the "X-TIKA:content" field. This field includes a bunch of line breaks at the beginning of the document. These need to be stripped. In a future improvement, it would be great to strip only those linebreaks that are not part of the document. Resolves [#342](https://github.com/OpenlyOne/openly/issues/342)
2 parents 0cd8192 + a3cdddd commit d78c933

File tree

57 files changed

+9977
-9671
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+9977
-9671
lines changed

CHANGELOG.md

+7
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# CHANGELOG
22

3+
## v0.36.3 (Mar 4, 2019)
4+
5+
**Fixes:**
6+
- Improve plain text extraction, so that we no longer have unexpected line
7+
breaks in our diffs.
8+
([#342](https://github.com/OpenlyOne/openly/issues/342))
9+
310
## v0.36.2 (Feb 18, 2019)
411

512
**Fixes:**

lib/extensions/henkei/server.rb

+8-8
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,14 @@ class Server
2626
# Extract plain text from the provided file
2727
def self.extract_text(file)
2828
# HACK: extracting text via Apache Tika includes unwanted artefacts, such
29-
# => as bookmarks and images. Use Nokogiri to get text from HTML
30-
# => instead.
31-
# put_file(
32-
# file: file,
33-
# path: CONTENT_PATH,
34-
# options: { 'Accept': 'text/plain' }
35-
# )
36-
Nokogiri::HTML(extract_html(file)).text.gsub(/\n\n\n+/, "\n\n")
29+
# => as bookmarks and images. Use the /rmeta/text endpoint instead.
30+
# => See TIKA-2755: https://jira.apache.org/jira/browse/TIKA-2755
31+
response = put_file(
32+
file: file,
33+
path: '/rmeta/text'
34+
)
35+
36+
JSON.parse(response).first['X-TIKA:content']&.strip.to_s
3737
end
3838

3939
# Extract html from the provided file

spec/integrations/contribution_spec.rb

+3-7
Original file line numberDiff line numberDiff line change
@@ -306,16 +306,12 @@
306306
end
307307

308308
it 'gives creator of contribution edit access to files' do
309-
expect(
310-
api_connection.find_file!(
311-
contribution.branch.root.remote_file_id,
312-
fields: 'capabilities'
313-
).capabilities.can_edit
314-
).to be true
309+
# Allow edit access to propagate
310+
sleep 3 if VCR.current_cassette.recording?
315311

316312
expect(
317313
api_connection.find_file!(
318-
contribution.branch.files.without_root.first.remote_file_id,
314+
contribution.branch.root.remote_file_id,
319315
fields: 'capabilities'
320316
).capabilities.can_edit
321317
).to be true
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# frozen_string_literal: true
2+
3+
RSpec.describe Henkei::Server, type: :model, vcr: true do
4+
subject(:server) { described_class }
5+
6+
describe '#extract_text' do
7+
subject(:text) { server.extract_text(file) }
8+
9+
let(:file_name) { 'file-with-link.docx' }
10+
let(:file) { File.open(path_to_file_fixtures.join(file_name)) }
11+
let(:path_to_file_fixtures) do
12+
Rails.root.join('spec', 'support', 'fixtures', 'files')
13+
end
14+
15+
it 'does not wrap links with new lines' do
16+
is_expected.not_to include "\n[email protected]\n"
17+
end
18+
19+
it 'trims newlines at document beginning and end' do
20+
is_expected.not_to match(/^\n.*\n$/)
21+
end
22+
23+
context 'when file has no text' do
24+
let(:file_name) { 'image.png' }
25+
26+
it 'returns empty string' do
27+
is_expected.not_to be nil
28+
is_expected.to be_empty
29+
end
30+
end
31+
end
32+
end
6.78 KB
Binary file not shown.

spec/support/fixtures/files/image.png

4.27 KB
Loading

0 commit comments

Comments
 (0)