-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenation exeption list #208
Comments
The method that we used before to enter exception words required that the base table did not contain any patterns with There are of course other ways to implement an exception list. For example by checking for exception words before passing words to libhyphen. That sounds very straightforward but is not so great for a number of reasons so we have to think it through. By the way, I noticed that you inserted your exception list of non-standard hyphenations in hyph_nb_NO.dic directly, whereas we used to concat the two files while building. Not really anything wrong with that, except that there is duplication now. One thing that bothers me is that running substrings.pl on your new hyph_nb_NO.dic yields changes to the file. I expected it to add some new patterns because of the addition of the non-standard hyphenation patterns, but I didn't expect that it would change the whole file (and not only the ordering). As far as I understand that means that the file was not prepared correctly. Finally, what is that last line: "Binærfil (standard inndata) samsvarer"? |
Would it be possible to for instance replace all the standard hyphenation 8s with 7, replace all the non-standard hyphenation 9s with 8, and then use 9 for the exception words? Do you think that would break things? We could go back to how they were concatenated before. I didn't think about that. Yeah I have no idea how it was prepared, only that it works better than what we had before. Oh, whoops. "Binærfil (standard inndata) samsvarer" is from a diff and means "Binary file (standard in) matches" (or however it's formulated in the english locale). I suppose I performed a diff at some point. That should be removed. |
The fact that the non-standard patterns have The problem is that the base table has |
Yeah. I don't know how the implementation works but would it be a problem to move patterns with 9s to 8 even though there already are other patterns using 8? I would expect the behavior to change slightly but maybe not that much? |
I think you need to read this. |
As always I think it would be better to start from the source code if possible instead of patching up the resulting table. It was the case when we used the "spell-norwegian" project and it is the case now. Of course there needs to be source code in the first place, and you need to be able to get hold of it. As you know it is not always possible to get in touch with the author. There are a bunch of people mentioned in the header of the No.pm file of the Text-Hyphen-No project, but it is not immediately clear who you would contact. By the way maybe you should include the header in our copy too. In case anyone else stumbles upon our copy it would be helpful for them.
Thinking about this again, this is of course because the patterns in the Text-Hyphen-No project are TeX patterns, not Hyphen patterns. The Text::Hyphen perl module has nothing to do with the Hyphen library. |
I grep'ed all the files in |
I got a response from Rune Kleveland (translated by me):
I've downloaded the CVS repo for spell-no from sourceforge, converted it to git, and uploaded it here in case we want to make changes to it: https://github.com/nlbdev/spell-no So as I understand it, it's the |
Cool! Yes, this is kind of what I remember from reading the README some years ago. I never tried to build it because the Makefile seemed super complicated. But maybe now that we can ask Rune Kleveland for help we should give it a try. Do you know what the connection is between this project and the TeX file that you found in the Perl project? |
The things we should definitely try to do after we manage to build it are:
|
The Makefile in the spell-norwegian-2.1 project looks a bit newer than the one in spell-no that you got from CVS. |
I've been looking at the code a bit. The part of the build that interests us the most is in the patterns subdir. It says:
In other words, it looks like the basis of the whole thing is an existing patterns file. But the patterns themselves are not used in the final output, rather they are used to hyphenate the word list, and from this new patterns are generated. This means we could add our own exception words to the process. Also we can decide the hyphen levels in such a way that at least one level is available to add the non-standard patterns at the end. (We can't add them sooner in the process because this part is specific to Libhyphen.) The README suggests two solutions for when hyphenation fails on words (not in the dictionary). The first one I don't fully understand. The second solution basically adds the word to a list of exceptions which are checked at runtime (TeX's This project solves non-standard hyphenation too, but apparently it is done via some TeX configuration file. I don't think we have to reevaluate our approach, but it's something to keep in mind. At the top of the Makefile it says that you need a patgen with enough capacity. Luckily we can build patgen from source in case we need to make adjustments to some parameters in the code. |
No, I don't. Maybe we could add a Dockerfile to https://github.com/nlbdev/spell-no with a build environment? |
OK, sure. Although the build prerequisites are almost non-existing. You only need common Unix tools like awk, sed, gzip, etc., and we will probably build patgen ourselves, and the patgen build is also fully self-contained. By the way I compared the "patterns" directories inside the spell-no repo and the newer spell-norwegian-2.1 and the conclusion is that the differences are negligible, so we can proceed with the spell-no repo. |
The norsk.words file however has become much bigger in spell-norwegian-2.1, so it's a good idea to update it in spell-no. |
I have a good understanding now of how the build works. I had to do a few modifications in order to get it working on my Mac OS machine and with my version of patgen. I also had to add a rule to create a patterns file for Libhyphen. Before I proceed with the other planned changes, like adding exception words, support for non-standard hyphenation, etc., we should discuss a few things (see Slack). |
I did a lot of work for this issue last year in December, but there is still some work to do before we can use it in Pipeline notably:
I also would like to:
Has anyone tried my tool to check for mistakes or missing words in the |
I really like the tool you made. One thing is that we need to be sure that we keep it in sync with our latest build, in case we make changes to the |
Make a solution for hyphen exeptions, e.g. (never to hyphenate the word)
de-tte
de-nne
di-sse
kj-eks
The text was updated successfully, but these errors were encountered: