Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyphenation rules are not found for words beginning with a capital letter #23

Open
jschoen42 opened this issue Nov 7, 2024 · 1 comment

Comments

@jschoen42
Copy link

jschoen42 commented Nov 7, 2024

when testing with german text, i noticed that despite 69.000 special rules for compound words,
many german words have the hyphen in the wrong place, although there are actually rules for them

e.g. all with wrong hyphens

  • Fortschritt -> Forts-chritt
  • Abendstern -> Abends-tern
  • Morgenthau -> Mor-gent-hau
  • Gastherme -> Gasther-me
  • Nennwertherabsetzung -> Nenn-wer-ther-ab-set-zung

in PyHyphen there is a special handling of completely capitalized words (mode 2, 3), there is no handling for words where only the first letter is capitalized

my workaround for 'syllables' ('pairs' has the same problem)

hyphen = Hyphenator("de_DE", directory=DATA_DIR)

def syllables_patch( word ):
    mode = 0
    if word.istitle():
        word = word.lower()
        mode = 4

    result = hyphen.syllables( word )
    if len(result)>0 and mode == 4:
        result[0] = result[0].title()

    return result

with the patch

  • Fortschritt -> Fort-schritt
  • Abendstern -> Abend-stern
  • Morgenthau -> Mor-gen-thau
  • Gastherme -> Gas-ther-me
  • Nennwertherabsetzung -> Nenn-wert-her-ab-set-zung

now all hyphens are correct

the problem affects all rules in all other languages, not just the german combound rules - but the error is clearly visible here

Jürgen

@jschoen42
Copy link
Author

jschoen42 commented Nov 9, 2024

I have tested the patch with the german word lists in repo https://github.com/cpos/AlleDeutschenWoerter

result with the patch

  • "Substantive": 14.609 complete, 1.130 with corrected/improved hyphens
  • "Verben": 4.727 complete, 311 with corrected/improved hyphens, when the verb is at the beginning of a sentence
  • "Adjektive": 6.837 complete, 477 with corrected/improved hyphens, when the adjective is at the beginning of a sentence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant