@@ -32,6 +32,7 @@ The plugin code in each plugin is equivalent to the code in this combined bundle
3232[frame="all"]
3333|===
3434| Plugin version | Elasticsearch version | Release date
35+ | 6.3.2.2 | 6.3.2 | Oct 2, 2018
3536| 5.4.1.0 | 5.4.0 | Jun 1, 2017
3637| 5.4.0.1 | 5.4.0 | May 12, 2017
3738| 5.4.0.0 | 5.4.0 | May 4, 2017
@@ -614,7 +615,26 @@ GET _analyze
614615 }
615616 }
616617
617- # Example
618+ # Decompound
619+
620+ This is an implementation of a word decompounder plugin for link:http://github.com/elasticsearch/elasticsearch[Elasticsearch].
621+
622+ Compounding several words into one word is a property not all languages share.
623+ Compounding is used in German, Scandinavian Languages, Finnish and Korean.
624+
625+ This code is a reworked implementation of the
626+ link:http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm[Baseforms Tool]
627+ found in the http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/index.htm[ASV toolbox]
628+ of http://asv.informatik.uni-leipzig.de/staff/Chris_Biemann[Chris Biemann],
629+ Automatische Sprachverarbeitung of Leipzig University.
630+
631+ Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant.
632+ Both of them have a disadvantage, they require loading a word list in memory before they run.
633+ This decompounder does not require word lists, it can process german language text out of the box.
634+ The decompounder uses prebuilt _Compact Patricia Tries_ for efficient word segmentation provided
635+ by the ASV toolbox.
636+
637+ ## Decompound examples
618638
619639Try it out
620640----
@@ -630,7 +650,7 @@ GET _analyze
630650}
631651----
632652
633- In the mapping, us a token filter of type "decompound"::
653+ In the mapping, use a token filter of type "decompound"::
634654
635655 {
636656 "index":{
@@ -678,7 +698,7 @@ Also the Lucene german normalization token filter is provided::
678698The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into
679699"Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".
680700
681- # Threshold
701+ ## Threshold
682702
683703The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not.
684704If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the
@@ -705,7 +725,7 @@ The default threshold value is 0.51. You can modify it in the settings::
705725 }
706726 }
707727
708- # Subwords
728+ ## Subwords
709729
710730Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter `"subwords_only": true`
711731
@@ -729,7 +749,95 @@ Sometimes only the decomposed subwords should be indexed. For this, you can use
729749 }
730750
731751
732- ## Langdetect
752+ ## Caching
753+
754+ The time consumed by the decompound computation may increase your overall indexing time drastically if applied in
755+ the billions. You can configure a least-frequently-used cache for mapping a token to the decompounded tokens
756+ with the following settings:
757+
758+ `use_cache: true` - enables caching
759+ `cache_size` - sets cache size, default: 100000
760+ `cache_eviction_factor` - sets cache eviction factor, valida values are between 0.00 and 1.00, default: 0.90
761+
762+ ```
763+ {
764+ "settings": {
765+ "index": {
766+ "number_of_shards": 1,
767+ "number_of_replicas": 0,
768+ "analysis": {
769+ "filter": {
770+ "decomp":{
771+ "type" : "decompound",
772+ "use_payload": true,
773+ "use_cache": true
774+ }
775+ },
776+ "analyzer": {
777+ "decomp": {
778+ "type": "custom",
779+ "tokenizer" : "standard",
780+ "filter" : [
781+ "decomp",
782+ "lowercase"
783+ ]
784+ },
785+ "lowercase": {
786+ "type": "custom",
787+ "tokenizer" : "standard",
788+ "filter" : [
789+ "lowercase"
790+ ]
791+ }
792+ }
793+ }
794+ }
795+ },
796+ "mappings": {
797+ "_doc": {
798+ "properties": {
799+ "text": {
800+ "type": "text",
801+ "analyzer": "decomp",
802+ "search_analyzer": "lowercase"
803+ }
804+ }
805+ }
806+ }
807+ }
808+ ```
809+
810+ ## Exact phrase matches
811+
812+ The usage of decompounds can lead to undesired results regarding phrase queries.
813+ After indexing, decompound tokens ca not be distinguished from original tokens.
814+ The outcome of a phrase query "Deutsche Bank" could be `Deutsche Spielbankgesellschaft`,
815+ what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is
816+ tagged with additional payload data.
817+
818+ To evaluate this payload data, you can use the `exact_phrase` as a wrapper around a query
819+ containing your phrase queries.
820+
821+ `use_payload` - if set to true, enable payload creation. Default: false
822+
823+ ```
824+ {
825+ "query": {
826+ "exact_phrase": {
827+ "query": {
828+ "query_string": {
829+ "query": "\"deutsche bank\"",
830+ "fields": [
831+ "message"
832+ ]
833+ }
834+ }
835+ }
836+ }
837+ }
838+ ```
839+
840+ # Langdetect
733841
734842 curl -XDELETE 'localhost:9200/test'
735843
@@ -797,7 +905,7 @@ Sometimes only the decomposed subwords should be indexed. For this, you can use
797905 }
798906 '
799907
800- ## Standardnumber
908+ # Standardnumber
801909
802910Try it out
803911----
0 commit comments