Skip to content

Commit 648b4e1

Browse files
committed
add exact phrase match and LFU cache to Patricia decompounder, thanks to GBI Genios
1 parent 52c83fc commit 648b4e1

File tree

18 files changed

+1214
-24
lines changed

18 files changed

+1214
-24
lines changed

CREDITS.txt

+5
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,8 @@ The FSA in package org.xbib.elastixsearch.common.fsa which provides the dictiona
2626
the baseform tokenizer is a derived version of
2727

2828
https://github.com/morfologik/morfologik-stemming/tree/master/morfologik-fsa/src/main/java/morfologik/fsa
29+
30+
Thanks to GBI-Genios Deutsche Wirtschaftsdatenbank GmbH for adding the caching-functionality and the "Exact phrase matches".
31+
The implementation of an exact phrase match query can ignore/skip decompounded tokens while matching phrases.
32+
The LFU cache for the Patricia Decompounder was inspired by the use of ConcurrentHashMap cache
33+
in the original pull request: https://github.com/jprante/elasticsearch-analysis-decompound/pull/54/

README.adoc

+114-6
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ The plugin code in each plugin is equivalent to the code in this combined bundle
3232
[frame="all"]
3333
|===
3434
| Plugin version | Elasticsearch version | Release date
35+
| 6.3.2.2 | 6.3.2 | Oct 2, 2018
3536
| 5.4.1.0 | 5.4.0 | Jun 1, 2017
3637
| 5.4.0.1 | 5.4.0 | May 12, 2017
3738
| 5.4.0.0 | 5.4.0 | May 4, 2017
@@ -614,7 +615,26 @@ GET _analyze
614615
}
615616
}
616617
617-
# Example
618+
# Decompound
619+
620+
This is an implementation of a word decompounder plugin for link:http://github.com/elasticsearch/elasticsearch[Elasticsearch].
621+
622+
Compounding several words into one word is a property not all languages share.
623+
Compounding is used in German, Scandinavian Languages, Finnish and Korean.
624+
625+
This code is a reworked implementation of the
626+
link:http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm[Baseforms Tool]
627+
found in the http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/index.htm[ASV toolbox]
628+
of http://asv.informatik.uni-leipzig.de/staff/Chris_Biemann[Chris Biemann],
629+
Automatische Sprachverarbeitung of Leipzig University.
630+
631+
Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant.
632+
Both of them have a disadvantage, they require loading a word list in memory before they run.
633+
This decompounder does not require word lists, it can process german language text out of the box.
634+
The decompounder uses prebuilt _Compact Patricia Tries_ for efficient word segmentation provided
635+
by the ASV toolbox.
636+
637+
## Decompound examples
618638

619639
Try it out
620640
----
@@ -630,7 +650,7 @@ GET _analyze
630650
}
631651
----
632652
633-
In the mapping, us a token filter of type "decompound"::
653+
In the mapping, use a token filter of type "decompound"::
634654
635655
{
636656
"index":{
@@ -678,7 +698,7 @@ Also the Lucene german normalization token filter is provided::
678698
The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into
679699
"Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".
680700

681-
# Threshold
701+
## Threshold
682702

683703
The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not.
684704
If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the
@@ -705,7 +725,7 @@ The default threshold value is 0.51. You can modify it in the settings::
705725
}
706726
}
707727

708-
# Subwords
728+
## Subwords
709729

710730
Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter `"subwords_only": true`
711731

@@ -729,7 +749,95 @@ Sometimes only the decomposed subwords should be indexed. For this, you can use
729749
}
730750

731751

732-
## Langdetect
752+
## Caching
753+
754+
The time consumed by the decompound computation may increase your overall indexing time drastically if applied in
755+
the billions. You can configure a least-frequently-used cache for mapping a token to the decompounded tokens
756+
with the following settings:
757+
758+
`use_cache: true` - enables caching
759+
`cache_size` - sets cache size, default: 100000
760+
`cache_eviction_factor` - sets cache eviction factor, valida values are between 0.00 and 1.00, default: 0.90
761+
762+
```
763+
{
764+
"settings": {
765+
"index": {
766+
"number_of_shards": 1,
767+
"number_of_replicas": 0,
768+
"analysis": {
769+
"filter": {
770+
"decomp":{
771+
"type" : "decompound",
772+
"use_payload": true,
773+
"use_cache": true
774+
}
775+
},
776+
"analyzer": {
777+
"decomp": {
778+
"type": "custom",
779+
"tokenizer" : "standard",
780+
"filter" : [
781+
"decomp",
782+
"lowercase"
783+
]
784+
},
785+
"lowercase": {
786+
"type": "custom",
787+
"tokenizer" : "standard",
788+
"filter" : [
789+
"lowercase"
790+
]
791+
}
792+
}
793+
}
794+
}
795+
},
796+
"mappings": {
797+
"_doc": {
798+
"properties": {
799+
"text": {
800+
"type": "text",
801+
"analyzer": "decomp",
802+
"search_analyzer": "lowercase"
803+
}
804+
}
805+
}
806+
}
807+
}
808+
```
809+
810+
## Exact phrase matches
811+
812+
The usage of decompounds can lead to undesired results regarding phrase queries.
813+
After indexing, decompound tokens ca not be distinguished from original tokens.
814+
The outcome of a phrase query "Deutsche Bank" could be `Deutsche Spielbankgesellschaft`,
815+
what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is
816+
tagged with additional payload data.
817+
818+
To evaluate this payload data, you can use the `exact_phrase` as a wrapper around a query
819+
containing your phrase queries.
820+
821+
`use_payload` - if set to true, enable payload creation. Default: false
822+
823+
```
824+
{
825+
"query": {
826+
"exact_phrase": {
827+
"query": {
828+
"query_string": {
829+
"query": "\"deutsche bank\"",
830+
"fields": [
831+
"message"
832+
]
833+
}
834+
}
835+
}
836+
}
837+
}
838+
```
839+
840+
# Langdetect
733841

734842
curl -XDELETE 'localhost:9200/test'
735843

@@ -797,7 +905,7 @@ Sometimes only the decomposed subwords should be indexed. For this, you can use
797905
}
798906
'
799907

800-
## Standardnumber
908+
# Standardnumber
801909

802910
Try it out
803911
----

build.gradle

+7-3
Original file line numberDiff line numberDiff line change
@@ -72,9 +72,13 @@ test {
7272
exceptionFormat = 'full'
7373
}
7474
}
75-
randomizedTest.enabled = false
76-
esTest.enabled = true
77-
esTest.dependsOn jar
75+
randomizedTest {
76+
enabled = false
77+
}
78+
esTest {
79+
dependsOn jar
80+
enabled = true
81+
}
7882

7983
clean {
8084
delete fileTree('.') { include '.local*.log' }

gradle.properties

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
group = org.xbib.elasticsearch.plugin
22
name = elasticsearch-plugin-bundle
3-
version = 6.3.2.1
3+
version = 6.3.2.2
44

55
xbib-elasticsearch-test.version = 6.3.2.1
66
elasticsearch.version = 6.3.2

gradle/wrapper/gradle-wrapper.jar

1.72 KB
Binary file not shown.
+2-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
#Tue Jul 17 20:18:04 CEST 2018
1+
#Mon Oct 01 19:05:53 CEST 2018
22
distributionBase=GRADLE_USER_HOME
33
distributionPath=wrapper/dists
44
zipStoreBase=GRADLE_USER_HOME
55
zipStorePath=wrapper/dists
6-
distributionUrl=https\://services.gradle.org/distributions/gradle-4.8.1-all.zip
6+
distributionUrl=https\://services.gradle.org/distributions/gradle-4.10.2-all.zip

settings.gradle

+2-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
rootProject.name = name
1+
rootProject.name = name
2+
enableFeaturePreview('STABLE_PUBLISHING')

0 commit comments

Comments
 (0)