@@ -32,6 +32,7 @@ The plugin code in each plugin is equivalent to the code in this combined bundle
32
32
[frame="all"]
33
33
|===
34
34
| Plugin version | Elasticsearch version | Release date
35
+ | 6.3.2.2 | 6.3.2 | Oct 2, 2018
35
36
| 5.4.1.0 | 5.4.0 | Jun 1, 2017
36
37
| 5.4.0.1 | 5.4.0 | May 12, 2017
37
38
| 5.4.0.0 | 5.4.0 | May 4, 2017
@@ -614,7 +615,26 @@ GET _analyze
614
615
}
615
616
}
616
617
617
- # Example
618
+ # Decompound
619
+
620
+ This is an implementation of a word decompounder plugin for link:http://github.com/elasticsearch/elasticsearch[Elasticsearch].
621
+
622
+ Compounding several words into one word is a property not all languages share.
623
+ Compounding is used in German, Scandinavian Languages, Finnish and Korean.
624
+
625
+ This code is a reworked implementation of the
626
+ link:http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm[Baseforms Tool]
627
+ found in the http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/index.htm[ASV toolbox]
628
+ of http://asv.informatik.uni-leipzig.de/staff/Chris_Biemann[Chris Biemann],
629
+ Automatische Sprachverarbeitung of Leipzig University.
630
+
631
+ Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant.
632
+ Both of them have a disadvantage, they require loading a word list in memory before they run.
633
+ This decompounder does not require word lists, it can process german language text out of the box.
634
+ The decompounder uses prebuilt _Compact Patricia Tries_ for efficient word segmentation provided
635
+ by the ASV toolbox.
636
+
637
+ ## Decompound examples
618
638
619
639
Try it out
620
640
----
@@ -630,7 +650,7 @@ GET _analyze
630
650
}
631
651
----
632
652
633
- In the mapping, us a token filter of type "decompound"::
653
+ In the mapping, use a token filter of type "decompound"::
634
654
635
655
{
636
656
"index":{
@@ -678,7 +698,7 @@ Also the Lucene german normalization token filter is provided::
678
698
The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into
679
699
"Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".
680
700
681
- # Threshold
701
+ ## Threshold
682
702
683
703
The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not.
684
704
If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the
@@ -705,7 +725,7 @@ The default threshold value is 0.51. You can modify it in the settings::
705
725
}
706
726
}
707
727
708
- # Subwords
728
+ ## Subwords
709
729
710
730
Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter `"subwords_only": true`
711
731
@@ -729,7 +749,95 @@ Sometimes only the decomposed subwords should be indexed. For this, you can use
729
749
}
730
750
731
751
732
- ## Langdetect
752
+ ## Caching
753
+
754
+ The time consumed by the decompound computation may increase your overall indexing time drastically if applied in
755
+ the billions. You can configure a least-frequently-used cache for mapping a token to the decompounded tokens
756
+ with the following settings:
757
+
758
+ `use_cache: true` - enables caching
759
+ `cache_size` - sets cache size, default: 100000
760
+ `cache_eviction_factor` - sets cache eviction factor, valida values are between 0.00 and 1.00, default: 0.90
761
+
762
+ ```
763
+ {
764
+ "settings": {
765
+ "index": {
766
+ "number_of_shards": 1,
767
+ "number_of_replicas": 0,
768
+ "analysis": {
769
+ "filter": {
770
+ "decomp":{
771
+ "type" : "decompound",
772
+ "use_payload": true,
773
+ "use_cache": true
774
+ }
775
+ },
776
+ "analyzer": {
777
+ "decomp": {
778
+ "type": "custom",
779
+ "tokenizer" : "standard",
780
+ "filter" : [
781
+ "decomp",
782
+ "lowercase"
783
+ ]
784
+ },
785
+ "lowercase": {
786
+ "type": "custom",
787
+ "tokenizer" : "standard",
788
+ "filter" : [
789
+ "lowercase"
790
+ ]
791
+ }
792
+ }
793
+ }
794
+ }
795
+ },
796
+ "mappings": {
797
+ "_doc": {
798
+ "properties": {
799
+ "text": {
800
+ "type": "text",
801
+ "analyzer": "decomp",
802
+ "search_analyzer": "lowercase"
803
+ }
804
+ }
805
+ }
806
+ }
807
+ }
808
+ ```
809
+
810
+ ## Exact phrase matches
811
+
812
+ The usage of decompounds can lead to undesired results regarding phrase queries.
813
+ After indexing, decompound tokens ca not be distinguished from original tokens.
814
+ The outcome of a phrase query "Deutsche Bank" could be `Deutsche Spielbankgesellschaft`,
815
+ what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is
816
+ tagged with additional payload data.
817
+
818
+ To evaluate this payload data, you can use the `exact_phrase` as a wrapper around a query
819
+ containing your phrase queries.
820
+
821
+ `use_payload` - if set to true, enable payload creation. Default: false
822
+
823
+ ```
824
+ {
825
+ "query": {
826
+ "exact_phrase": {
827
+ "query": {
828
+ "query_string": {
829
+ "query": "\"deutsche bank\"",
830
+ "fields": [
831
+ "message"
832
+ ]
833
+ }
834
+ }
835
+ }
836
+ }
837
+ }
838
+ ```
839
+
840
+ # Langdetect
733
841
734
842
curl -XDELETE 'localhost:9200/test'
735
843
@@ -797,7 +905,7 @@ Sometimes only the decomposed subwords should be indexed. For this, you can use
797
905
}
798
906
'
799
907
800
- ## Standardnumber
908
+ # Standardnumber
801
909
802
910
Try it out
803
911
----
0 commit comments