Skip to content

Commit d6e15f2

Browse files
committed
Fix problems with parsers
- Due to only matching ASCII, any non-ASCII character caused a parse failure. This is annoying. Fixed by matching "not whitespace" instead of alphanumerics. Added tests for non-ASCII characters in query string. - Related to the above: a boolean operator would delimit a clause (no space required between terms) because +/- were not allowed in terms. Using "not whitespace" also fixes this. Added tests. - Added tests for unbalanced quotation marks in phrase query parser. - Added a test to document that balanced quotation marks can delimit clauses. We'll call this a feature. It is possible to eliminate using lookahead. - Fix heuristic parser breaking terms that start with a decade string by using lookahead.
1 parent bbc3270 commit d6e15f2

10 files changed

+90
-14
lines changed

README.md

-1
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,5 @@ The tutorial is under copyright and cannot be republished without my permission.
6969

7070
## TODO
7171

72-
- [] decade fix -- requires lookahead to make sure decade actually is. Need to backport fix
7372
- [] performance?
7473
- [] final copy edits

boolean_term_parser.rb

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ module BooleanTermParser
44
# This query parser adds an optional operator ("+" or "-") to the simple term
55
# parser. In order to do that, a new "clause" node is added to the parse tree.
66
class QueryParser < Parslet::Parser
7-
rule(:term) { match('[a-zA-Z0-9]').repeat(1).as(:term) }
7+
rule(:term) { match('[^\s]').repeat(1).as(:term) }
88
rule(:operator) { (str('+') | str('-')).as(:operator) }
99
rule(:clause) { (operator.maybe >> term).as(:clause) }
10-
rule(:space) { match('\s').repeat(1) }
10+
rule(:space) { match('\s').repeat(1) }
1111
rule(:query) { (clause >> space.maybe).repeat.as(:query) }
1212
root(:query)
1313
end

heuristic_parser.rb

+4-2
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@ module HeuristicParser
55
# It adds a new clause type for date ranges. The parser recognizes strings
66
# like "1920s" or "2010" as dates instead of generic terms.
77
class QueryParser < Parslet::Parser
8+
rule(:eof) { any.absent? }
89
rule(:decade) do
910
((str('1') >> str('9') |
1011
str('2') >> str('0')) >>
11-
match('\d') >> str('0')).as(:decade) >> str('s').maybe
12+
match('\d') >> str('0')).as(:decade) >>
13+
str('s').maybe >> (eof | space).present?
1214
end
13-
rule(:term) { match('[a-zA-Z0-9]').repeat(1).as(:term) }
15+
rule(:term) { match('[^\s"]').repeat(1).as(:term) }
1416
rule(:quote) { str('"') }
1517
rule(:operator) { (str('+') | str('-')).as(:operator) }
1618
rule(:phrase) { (quote >> (term >> space.maybe).repeat >> quote).as(:phrase) }

phrase_parser.rb

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ module PhraseParser
55
# terms. This is done creating multiple types of clauses instead of just one.
66
# A phrase clause generates an Elasticsearch match_phrase query.
77
class QueryParser < Parslet::Parser
8-
rule(:term) { match('[a-zA-Z0-9]').repeat(1).as(:term) }
8+
rule(:term) { match('[^\s"]').repeat(1).as(:term) }
99
rule(:quote) { str('"') }
1010
rule(:operator) { (str('+') | str('-')).as(:operator) }
1111
rule(:phrase) { (quote >> (term >> space.maybe).repeat >> quote).as(:phrase) }

term_parser.rb

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
require 'parslet'
22

33
module TermParser
4-
# This is a simple parser that matches a sequence of alphanumeric characters and
5-
# converts it to an Elasticsearch match query.
4+
# This is a simple parser that matches a sequence of non-whitespace characters
5+
# and converts it to an Elasticsearch match query.
66
class QueryParser < Parslet::Parser
7-
rule(:term) { match('[a-zA-Z0-9]').repeat(1).as(:term) }
7+
rule(:term) { match('[^\s]').repeat(1).as(:term) }
88
rule(:space) { match('\s').repeat(1) }
99
rule(:query) { (term >> space.maybe).repeat.as(:query) }
1010
root(:query)

tests/boolean_term_parser_tests.rb

+24
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# coding: utf-8
12
require 'minitest/autorun'
23
require_relative '../boolean_term_parser'
34

@@ -28,4 +29,27 @@ def test_multiple_terms_with_operators
2829
{:clause => {:operator => '-', :term => 'cat'}}]}
2930
assert_equal(expected, tree)
3031
end
32+
33+
def test_non_ascii_characters
34+
tree = BooleanTermParser::QueryParser.new.parse('+føé -ba∑ ∫åñ')
35+
expected = {:query => [{:clause => {:operator => '+', :term => 'føé'}},
36+
{:clause => {:operator => '-', :term => 'ba∑'}},
37+
{:clause => {:term => '∫åñ'}}]}
38+
assert_equal(expected, tree)
39+
end
40+
41+
def test_operators_in_terms
42+
tree = BooleanTermParser::QueryParser.new.parse('-foo+term +bar-term baz-term')
43+
expected = {:query => [{:clause => {:operator => '-', :term => 'foo+term'}},
44+
{:clause => {:operator => '+', :term => 'bar-term'}},
45+
{:clause => {:term => 'baz-term'}}]}
46+
assert_equal(expected, tree)
47+
end
48+
49+
def test_quotation_marks
50+
tree = BooleanTermParser::QueryParser.new.parse('+fo"o -ba"r')
51+
expected = {:query => [{:clause => {:operator => '+', :term => 'fo"o'}},
52+
{:clause => {:operator => '-', :term => 'ba"r'}}]}
53+
assert_equal(expected, tree)
54+
end
3155
end

tests/heuristic_parser_tests.rb

+20-2
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,29 @@ def test_date_range
88
end
99

1010
def test_complex_query
11-
tree = HeuristicParser::QueryParser.new.parse('awesome "cat videos" -2000s')
12-
expected = {:query => [{:clause => {:term => 'awesome'}},
11+
tree = HeuristicParser::QueryParser.new.parse('+paw-some "cat videos" -2000s')
12+
expected = {:query => [{:clause => {:operator => '+', :term => 'paw-some'}},
1313
{:clause => {:phrase => [{:term => 'cat'}, {:term => 'videos'}]}},
1414
{:clause => {:operator => '-', :decade => '2000'}}]}
1515

1616
assert_equal(expected, tree)
1717
end
18+
19+
def test_term_prefixed_with_decade
20+
tree = HeuristicParser::QueryParser.new.parse('2000st')
21+
expected = {:query => [{:clause => {:term => '2000st'}}]}
22+
assert_equal(expected, tree)
23+
end
24+
25+
def test_term_suffixed_with_decade
26+
tree = HeuristicParser::QueryParser.new.parse('st2000')
27+
expected = {:query => [{:clause => {:term => 'st2000'}}]}
28+
assert_equal(expected, tree)
29+
end
30+
31+
def test_non_decade_parsed_as_term
32+
tree = HeuristicParser::QueryParser.new.parse('2001')
33+
expected = {:query => [{:clause => {:term => '2001'}}]}
34+
assert_equal(expected, tree)
35+
end
1836
end

tests/phrase_parser_tests.rb

+22
Original file line numberDiff line numberDiff line change
@@ -69,4 +69,26 @@ def test_complex_query
6969

7070
assert_equal(expected, tree)
7171
end
72+
73+
def test_mismatched_quotation_marks
74+
assert_raises Parslet::ParseFailed do
75+
PhraseParser::QueryParser.new.parse('"foo')
76+
end
77+
end
78+
79+
def test_quotation_mark_in_term
80+
assert_raises Parslet::ParseFailed do
81+
PhraseParser::QueryParser.new.parse('fo"o')
82+
end
83+
end
84+
85+
def test_mismatched_quotation_mark_delimiter
86+
# We'll call this a "feature" since the quotation marks are balanced.
87+
# If you don't want this, you can use lookahead to ensure end-quote is followed by a space or EOF
88+
tree = PhraseParser::QueryParser.new.parse('"foo"+bar"baz"')
89+
expected = {:query => [{:clause => {:phrase => [{:term => 'foo'}]}},
90+
{:clause => {:operator => '+', :term => 'bar'}},
91+
{:clause => {:phrase => [{:term => 'baz'}]}}]}
92+
assert_equal(expected, tree)
93+
end
7294
end

tests/term_parser_tests.rb

+11
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# coding: utf-8
12
require 'minitest/autorun'
23
require_relative '../term_parser'
34

@@ -16,4 +17,14 @@ def test_multiple_spaces_between_terms
1617
tree = TermParser::QueryParser.new.parse('foo bar')
1718
assert_equal({:query => [{:term => 'foo'}, {:term => 'bar'}]}, tree)
1819
end
20+
21+
def test_non_ascii_characters
22+
tree = TermParser::QueryParser.new.parse('føé ba∑')
23+
assert_equal({:query => [{:term => 'føé'}, {:term => 'ba∑'}]}, tree)
24+
end
25+
26+
def test_quotation_marks
27+
tree = TermParser::QueryParser.new.parse('fo"o')
28+
assert_equal({:query => [{:term => 'fo"o'}]}, tree)
29+
end
1930
end

tutorial/build_a_query_parser.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -324,7 +324,7 @@ Extra input after last repetition at line 1 char 6.
324324

325325
The simple parser above can recognize strings that match the grammar, but can't do anything with it. Using `#as`, we can capture parts of the input that we want to keep and save them as a parse tree. Anything not named with `#as` is discarded.
326326

327-
We need to capture the terms and the overall query.
327+
We need to capture the terms and the overall query. To be more flexible with user input, the `term` rule has been updated to match any non-whitespace character.
328328

329329
{{code="term_parser.rb:6-11"}}
330330

@@ -574,9 +574,9 @@ Where `decade` is defined as:
574574

575575
To implement this, we add the new `decade` rule to the parser and use it in the `clause` rule.
576576

577-
{{code="heuristic_parser.rb:7-21"}}
577+
{{code="heuristic_parser.rb:7-23"}}
578578

579-
A PEG parser always takes the first alternative, so we need to make `decade` match before `term`, because a `decade` is always a valid `term`. If we didn't do this, the `decade` rule would never match.
579+
A PEG parser always takes the first alternative, so we need to make `decade` match before `term`, because a `decade` is always a valid `term`. If we didn't do this, the `decade` rule would never match. We also need to add a lookahead for space or end of input to the `decade` rule. Without this, input like <span class="query-string">1990th</span> would be parsed as a decade `1990` and a term `th`.
580580

581581
For the transformer, we define a `DateRangeClause` class that takes a number and converts it into a start and end date:
582582

0 commit comments

Comments
 (0)