Mistakes in the Dutch stemmer #1

gboer · 2013-03-18T10:40:56Z

I first want to thank everyone on the Snowball project for creating this software. It's great that we can use the software to build more sophisticated search capabilities for our users. However, when I was testing several Dutch words, I noticed there are actually quite a lot of mistakes. I'm not quite sure how to fix the problems in the Dutch stemmer, so I thought I'd mentioned them here and hope someone picks it up.

Not sure where to start, so I'll mention a couple that are incorrect (the last word is the correct one):

gevaren gevar -> gevaar
gevaar gevar -> gevaar
gevaarlijk gevar -> gevaarlijk
gevaarlijke gevar -> gevaarlijk
gevaarlijker gevaarlijker -> gevaarlijk
gevaarten gevaart -> gevaarte
gevallen gevall -> geval
geven gev -> geef
gevist gevist -> vis
gewasbescherming gewasbescherm -> gewasbescherming
gewassen gewass -> gewas
geweer gewer -> geweer
aanbellen aanbell -> bel aan (yes, Dutch is weird)
aandeel aandel -> aandeel
aaneen aanen -> aaneen (should really be excluded from stemming if possible, since there is no way that this word occurs in any other form)
aalmoezen aalmoez -> aalmoes
gangetje gangetj -> gang
gebaartje gebaartj -> gebaar

These are just a few, but there are quite a lot more. Should you need help verifying or testing the stemmer for the Dutch words, I'm happy to help :)

The text was updated successfully, but these errors were encountered:

rboulton · 2013-03-27T12:47:22Z

Sorry not to have responded sooner; I'll try and take a look at this within the next week.

ojwb · 2014-12-09T00:35:34Z

@rboulton Did you manage to take a look at this?

As a general point, the aim of these stemmers is not to map their inputs to words in the same language, but rather to map different forms of the same word to the same string of characters (and forms of different words to different strings of characters). It just happens that in many cases the outputs are words in the same language.

So it isn't necessarily an error to be mapping gevaren and gevaar to gevar even if that isn't a word in Dutch. It is an error if other forms of gevaar get mapped to something else, or if unrelated words get mapped to gevar as well.

Sutharsan · 2015-01-12T09:11:33Z

I have the same experience with the 'Dutch' Snowball stemmer. Much better stemming is realised using the 'Kraaij-Pohlmann' stemming algoritm (language="Kp"). The simplest improvement is to use this algoritm as the default for Dutch stemming.
See https://wiki.apache.org/solr/LanguageAnalysis#dutch

Sutharsan · 2015-01-12T20:20:51Z

I agree with @ojwb that it is not a problem if the stemmer does not map to existing words, as long a it does not map to an existing word with a different meaning. Here a few examples comparing "Dutch" language with "Kp" language.

Using: <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />

'adverteer' > 'adveter' (advertise, 1st person singular) > No stem, not an existing word, but Ok.
'adverteren' > 'adveter' (advertise, 1st person plural) > No stem, not an existing word, but Ok.
'geadverteerd' > 'geadverteerd' (advertised) > Same word, Not Ok.
'artikelen' > 'artikel' (articles, plural) > Stem, singular word, Ok
'artikeltje' > 'artikeltj' (small article) > Not an existing word, Not Ok.
'openbaar' > 'open' (public) > Existing word (open), "stem" is related, not sure if Ok.
'zaken' > 'zak' (business) > Existing word (bag), not related, Not Ok.
'gelezen' > 'gelez' (read, regular verb) > Not an existing word, Not Ok.
'gebroken' > 'gebrok' (broken, irregular verb) > Not an existing word, Not Ok.

Using: <filter class="solr.SnowballPorterFilterFactory" language="Kp" />

'adverteer' > 'adveteer' (advertise, 1st person singular) > Stem, Ok.
'adverteren' > 'adveteer' (advertise, 1st person plural) > Stem, Ok.
'geadverteerd' > 'adveteer' (advertised) > Stem, Ok.
'artikelen' > 'artikel' (articles, plural) > Stem, Ok.
'artikeltje' > 'artikel' (small article) > Stem, Ok.
'openbaar' > 'open' (public) > Existing word (open), "stem" is related, not sure if Ok.
'zaken' > 'zaak' (business) > Stem, Ok.
'gelezen' > 'lees' (read, regular verb) > Stem, Ok.
'gebroken' > 'brook' (broken, irregular verb) > Not an existing word, Not Ok.

Of course I have picked examples where the stemming fails, but I have found only one category where "Kp" language fails: Irregular verbs. But all together the "Kp" Kraaij-Pohlmann algorithm is a much better stemming than the obvious choice of "Dutch" language. Instead of fixing the Dutch stemming, I recommend to replace it by the "Kp" stemming.

istepaniuk · 2018-12-14T09:16:12Z

I am not a good Dutch speaker but I can see from the examples that some plural nouns are not stemmed correctly in the example diff for Dutch.

For example, I understand that acties should become actie, but remains unchanged. The same occurs for all other nouns of that same form such as conclusies, condities, etc.

It would seem that this rule is missing entirely.

ojwb · 2019-10-14T04:38:56Z

I brought this matter up on the list recently:

https://lists.tartarus.org/pipermail/snowball-discuss/2019-October/001658.html

The history here is that Martin implemented dutch.sbl (and devised the algorithm for it) and also implemented kraaij_pohlmann.sbl from a paper and C implementation by Kraaij and Pohlmann (the paper only contains a partial description). When Snowball was quite new, Martin implemented several existing stemming algorithms in it (also Lovins' English stemmer and Schinke's Latin stemmer) as a way to demonstrate that the language was flexible enough to implement any algorithmic stemmer, but at least for kraaij_pohlmann.sbl he didn't worry about matching every detail of the behaviour - essentially it's more of a proof of concept implementation rather than something intended for the wider use it seems to now be getting.

Helpfully Martin managed to find a copy of the original C implementation from Kraaij and Pohlmann, which means we can look at the discrepancies between that and kraaij_pohlmann.sbl and decide which are worth addressing.

https://snowballstem.org/algorithms/kraaij_pohlmann/stemmer.html claims "in the demonstration vocabulary only 32 words out of over 45,000 stem differently" and goes on to list them (and a significant number appear to be non-Dutch words) but if I attempt to repeat that comparison I get 220 differences.

One obvious difference from looking at the sources is that the C version includes vowels with accents whereas the Snowball version only considers unaccented vowels - a quick attempt to copy that in Snowball reduced the differences from 220 to 153.

I'll see if I can usefully summarise the differences so people can easily take a look.

If anyone knows of a good quality Dutch word list, that might be useful - dutch/voc.txt seems to have a significant amount of non-Dutch, which is rather unhelpful in this instance.

ojwb · 2025-02-14T05:03:31Z

https://snowballstem.org/algorithms/kraaij_pohlmann/stemmer.html claims "in the demonstration vocabulary only 32 words out of over 45,000 stem differently" and goes on to list them (and a significant number appear to be non-Dutch words) but if I attempt to repeat that comparison I get 220 differences.

I think I've finally got to the bottom of this discrepancy.

The C implementation Kraaij-Pohlmann works in iso-8859-1 (with accented characters encoded as octal escapes, e.g. \353 for ë. I think Martin was working in DOS codepage 850 back then, and if I convert our dutch word list to this encoding and run it through the C implementation then convert the output back to UTF-8 I get the exact same 32 differences currently listed on the website.

The comparison is invalid due to the encoding confusion. The Snowball kraaij_pohlmann omits all rules using with letters with diacritics, but because of the encoding confusion these rules in the C implementation didn't fire in the old testing. So there really are 220 differences on our wordlist. That's still not a huge number, but I think it would make sense to try to fully align the Snowball implementation with the C one (or at least close the gap significantly).

I'll adjust the text on the website.

See snowballstem/snowball#1

ojwb · 2025-02-14T20:43:04Z

For example, I understand that acties should become actie, but remains unchanged. The same occurs for all other nouns of that same form such as conclusies, condities, etc.

It would seem that this rule is missing entirely.

Both the current kraaij_pohlmann.sbl and the original C implementation remove -s from the three examples you give, so switching would resolve these.

ojwb · 2025-02-14T21:10:33Z

Not sure where to start, so I'll mention a couple that are incorrect (the last word is the correct one):

gevaren gevar -> gevaar
gevaar gevar -> gevaar
gevaarlijk gevar -> gevaarlijk
gevaarlijke gevar -> gevaarlijk
gevaarlijker gevaarlijker -> gevaarlijk
gevaarten gevaart -> gevaarte
gevallen gevall -> geval
geven gev -> geef
gevist gevist -> vis
gewasbescherming gewasbescherm -> gewasbescherming
gewassen gewass -> gewas
geweer gewer -> geweer
aanbellen aanbell -> bel aan (yes, Dutch is weird)
aandeel aandel -> aandeel
aaneen aanen -> aaneen (should really be excluded from stemming if possible, since there is no way that this word occurs in any other form)
aalmoezen aalmoez -> aalmoes
gangetje gangetj -> gang
gebaartje gebaartj -> gebaar

Checking these examples with kraaij_pohlmann.sbl and the original C implementation, the two give the same results:

gevaren -> vaar
gevaar -> vaar
gevaarlijk -> vaarlijk
gevaarlijke -> vaarlijk
gevaarlijker -> vaarlijk
gevaarten -> vaar
gevallen -> val
geven -> geef
gevist -> vis
gewasbescherming -> wasbescherm
gewassen -> was
geweer -> weer
aanbellen -> aanbel
aandeel -> aandeel
aaneen -> aaneen
aalmoezen -> aalmoes
gangetje -> gang
gebaartje -> baar

Some of these do look better than dutch.sbl, but there seem to be some problems:

gevaren, gevaar, gevaarten all stem to vaar, but it seems the first two mean "danger" while the third means "huge objects"; also varen ("to sail") stems to vaar as well.
gevallen ("to happen") stems to val but so does vallen ("to fall"/"to tumble")

It does seem the Kraaij-Pohlmann algorithm is too aggressive at removing ge- but it may be hard to avoid without a huge exception list. Perhaps we could restrict it more based on the measure of the word left after removal (but then gevist -> vis would not happen).

I don't think this sinks the idea of making Kraaij-Pohlmann the default, but it would be good if we could find a way to adjust it in this area, and it would be better to do that at the same time as fixing it to handle diacritics (more) like the C implementation and making it the default since then all the changes to stemming of "dutch" happen in one go.

The C implementation only removes the -es part but the Snowball implementation was removing the whole of -ares/-eres. This reduces the number of differences in the output from the two implementations when run on our test vocabulary from 220 to 212. See #1

ojwb · 2025-02-15T04:00:49Z

Commit linked just above reduces the number of differences from 220 to 212.

ojwb · 2025-02-17T05:45:18Z

One obvious difference from looking at the sources is that the C version includes vowels with accents whereas the Snowball version only considers unaccented vowels - a quick attempt to copy that in Snowball reduced the differences from 220 to 153.

Sadly I failed to find that previous attempt but I seem to have managed to recreate it - thanks to the reduction of 8 above that takes us down to 145 differences. Will tidy up and merge tomorrow.

Implement handling of diacritics on vowels to match the C implementation. Reduces the number of words from the test vocabulary which stem differently from 212 to 145. See #1

Reduces the number of words which stem differently from 145 to 138. See #1

Reduces the number of words which stem differently from 138 to 65. See #1

The Snowball implementation tries to identify cases where `y` is a consonant and temporarily changes these to `Y` which is then treated as a consonant during stemming (then `Y` is changed back to `y` before returning). However the original C Kraaij-Pohlmann implementation does not do this (it's taken from the Porter stemmers for English, French, German and Dutch). A quick scan of the stemming differences resulting from this change suggests that the this extra handling only helps by conflating `royale` with `royaal` but possibly there are additional cases and this extra tweak is useful. However it's getting in the way of resolving the differences between the C and Snowball implementations so remove at least for now and review later. This reduces the number of words which stem differently from 65 to 45. See #1

This reduces the number of words which stem differently from 45 to 8. See #1

ojwb · 2025-02-18T03:36:49Z

Down to just 8 differences now on the test vocabulary:

word	C K-P	Snowball K-P	Notes
algerije	algerije	alrije
creëren	creëer	creeer
edele	edel	edeel
gedijen	gedij	dij
geoff	of	off	Not a Dutch word (English name Geoff?)
ideële	ideëel	ideeel
recreëren	recreëer	recreeer
tyumen	tyuum	tyum	Name of Russian city?

It looks like there are probably some common causes here.

Comparing the stems from the C version to the Snowball version gives:

A total of 8 words changed stem
3 words changed stem but aren't interesting
1 merges of groups of stems:

{ dij dijen } + { gedijen }
2 splits of groups of stems:

{ creëren | gecreëerd }
{ edele | edel }
6 words moving between stem groups:

The "Geoff" case is probably irrelevant, but for the other differences the C version looks better to me.

This reduces the number of words which stem differently from 8 to 6. See #1

This reduces the number of words which stem differently from 6 to 5. See #1

This reduces the number of words which stem differently from 5 to 2. See #1

This reduces the number of words which stem differently from 2 to 1. See #1

The Snowball implementation now produces the same stems for all words in our sample vocabulary list. See #1

ojwb · 2025-02-18T22:23:25Z

The Snowball implementation now produces the same stems for all words in our sample vocabulary list. This list is probably on the short side though, so I extracted a much larger list of 2004127 words from Dutch wikipedia and wiktionary and tested with that. I had to fix sample.c to use a bigger buffer, and also fix a place in stem.c where the code tries to copy using strcat() with the source and destination overlapping. After doing that I found just one input which stems differently:

word	C K-P	Snowball K-P
lagerweij	lagerweij	larweij

Fixes handling of `lagerweij` to match C implementation. See #1

ojwb · 2025-02-21T02:59:09Z

The snowball implementation now produces identical stems to the original C implementation (with some undefined behaviour fixed) for a very large range of Dutch words, and also for words from all the other language vocabularies we have.

My plan is to change to use Kraaij-Pohlmann for "dutch"/"nl", add an alias for people who want to select Martin Porter's "dutch" stemmer, and update the website documentation to reflect this. We also should have a more thorough document about the Kraaij-Pohlmann stemmer than we currently do, so I'll work on that. That will address almost all of the issues raised here, so then we can close this ticket.

I've noted a few cases where the Kraaij-Pohlmann stemmer seems to conflate cases which seem problematic - e.g. the worst is probably that geënt (grafted) gets conflated with en (and) as both give stem en. Some of these probably justify exception cases or other tweaks, but I'll open a new ticket for that. Please contribute any such cases you know of (either note them here and I can collate a list, or add them to the new ticket once I open it).

ojwb · 2025-02-25T03:45:11Z

Please contribute any such cases you know of (either note them here and I can collate a list, or add them to the new ticket once I open it).

See #208.

See snowballstem/snowball#1

This comment has been minimized.

Sign in to view

ojwb mentioned this issue Jan 31, 2025

Apostrophes #187

Open

This comment has been minimized.

Sign in to view

ojwb added a commit to snowballstem/snowball-website that referenced this issue Feb 14, 2025

Update text about Snowball kraaij_pohlmann vs C

91c6f63

See snowballstem/snowball#1

ojwb added a commit that referenced this issue Feb 17, 2025

kraaij_pohlmann.sbl: Handle diacritics

6ee964b

Implement handling of diacritics on vowels to match the C implementation. Reduces the number of words from the test vocabulary which stem differently from 212 to 145. See #1

ojwb added a commit that referenced this issue Feb 17, 2025

kraaij_pohlmann.sbl: Add missing rule for -és

31b0fba

Reduces the number of words which stem differently from 145 to 138. See #1

ojwb added a commit that referenced this issue Feb 17, 2025

kraaij_pohlmann.sbl: Adjust trema after -ge/-ge- removal

4d764b4

Reduces the number of words which stem differently from 138 to 65. See #1

ojwb added a commit that referenced this issue Feb 18, 2025

kraaij_pohlmann.sbl: Handle ë specially in lengthen_V

ca60350

This reduces the number of words which stem differently from 45 to 8. See #1

ojwb added a commit that referenced this issue Feb 18, 2025

kraaij_pohlmann.sbl: Treat ij as vowel for ge-/-ge-

fdc254e

This reduces the number of words which stem differently from 8 to 6. See #1

ojwb added a commit that referenced this issue Feb 18, 2025

kraaij_pohlmann.sbl: Do Step_6 if ge- removed

9f52011

This reduces the number of words which stem differently from 6 to 5. See #1

ojwb added a commit that referenced this issue Feb 18, 2025

kraaij_pohlmann.sbl: Fix ë handling in lengthen_V

24d45c6

This reduces the number of words which stem differently from 5 to 2. See #1

ojwb added a commit that referenced this issue Feb 18, 2025

kraaij_pohlmann.sbl: Fix stemming of edele

bb5e168

This reduces the number of words which stem differently from 2 to 1. See #1

ojwb added a commit that referenced this issue Feb 18, 2025

kraaij_pohlmann.sbl: Fix stemming of tyumen

86581aa

The Snowball implementation now produces the same stems for all words in our sample vocabulary list. See #1

ojwb added a commit that referenced this issue Feb 19, 2025

kraaij_pohlmann.sbl: Another ij fix for ge-/-ge-

d19326a

Fixes handling of `lagerweij` to match C implementation. See #1

ojwb mentioned this issue Feb 25, 2025

Kraaij-Pohlmann Dutch stemmer potential improvements #208

Open

ojwb added a commit to snowballstem/snowball-data that referenced this issue Feb 25, 2025

Update for Kraaij-Pohlmann as default for Dutch

7643582

See snowballstem/snowball#1

ojwb closed this as completed in b676baf Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistakes in the Dutch stemmer #1

Mistakes in the Dutch stemmer #1

gboer commented Mar 18, 2013

rboulton commented Mar 27, 2013

ojwb commented Dec 9, 2014

Sutharsan commented Jan 12, 2015

Sutharsan commented Jan 12, 2015 •

edited by ojwb

Loading

This comment has been minimized.

This comment has been minimized.

istepaniuk commented Dec 14, 2018

This comment has been minimized.

ojwb commented Oct 14, 2019

This comment has been minimized.

This comment has been minimized.

ojwb commented Feb 14, 2025

ojwb commented Feb 14, 2025

ojwb commented Feb 14, 2025

ojwb commented Feb 15, 2025

ojwb commented Feb 17, 2025

ojwb commented Feb 18, 2025

ojwb commented Feb 18, 2025 •

edited

Loading

ojwb commented Feb 21, 2025

ojwb commented Feb 25, 2025

Mistakes in the Dutch stemmer #1

Mistakes in the Dutch stemmer #1

Comments

gboer commented Mar 18, 2013

rboulton commented Mar 27, 2013

ojwb commented Dec 9, 2014

Sutharsan commented Jan 12, 2015

Sutharsan commented Jan 12, 2015 • edited by ojwb Loading

This comment has been minimized.

This comment has been minimized.

istepaniuk commented Dec 14, 2018

This comment has been minimized.

ojwb commented Oct 14, 2019

This comment has been minimized.

This comment has been minimized.

ojwb commented Feb 14, 2025

ojwb commented Feb 14, 2025

ojwb commented Feb 14, 2025

ojwb commented Feb 15, 2025

ojwb commented Feb 17, 2025

ojwb commented Feb 18, 2025

ojwb commented Feb 18, 2025 • edited Loading

ojwb commented Feb 21, 2025

ojwb commented Feb 25, 2025

Sutharsan commented Jan 12, 2015 •

edited by ojwb

Loading

ojwb commented Feb 18, 2025 •

edited

Loading