-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4
Comments
Hi. Did you add the parameter |
Of course. (To be precise, I used the long version |
Is there an easy way to fix that in the code? |
I worked on it for a few hours and didn't see a quick way forward. Our team then decided to train with one sentence per line raw text derived from the gold parses, rather than the original raw text, and the resulting sentence segmentation models worked reasonably well when tested with original raw text as input. This probably means that the UU segmenter does not use linebreaks as an input feature as otherwise one would expect the models to learn to use linebreaks as the only relevant feature for sentence segmentation on the training data and then fail on the test data that usually doesn't use linebreaks to delimit sentences. I cannot say whether this setup reproduces the results published by the UU team as we paused this research in favour of another project. We probably will continue our efforts to get UU segmenter working as a baseline and implement our own ideas in the first half of 2019. |
I see. I'd suggest to add a line in the README about having no files named raw_text.txt, since I ran into trouble because of that. |
I've run
segmenter.py train
successfully with justconllu
files in the workspace but when I include the raw text from the 2018 shared task asraw_train.txt
andraw_dev.txt
, I get(Line numbers may be slightly off as I added some comments here and there.)
It seems that you assume that the raw text has one sentence per line but the shared task raw text does not use line breaks in this way. Did you not use the raw text at training?
Do you use sentences as training instances? Wouldn't then the CRF never see the context to the right of sentence boundaries, e.g. in English the capitalisation of the next letter is a strong cue, and wouldn't the CRF in worst case learn to simply check whether it's the end of each sequence to assign T or U?
The text was updated successfully, but these errors were encountered: