Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

Open
jowagner opened this issue Sep 10, 2018 · 5 comments

Comments

@jowagner
Copy link
Contributor

jowagner commented Sep 10, 2018

I've run segmenter.py train successfully with just conllu files in the workspace but when I include the raw text from the 2018 shared task as raw_train.txt and raw_dev.txt, I get

Traceback (most recent call last):                                                                                                                           
  File "segmenter.py", line 155, in <module>                                                                                                                 
    reset=args.reset, tag_scheme=args.tags, ignore_mwt=args.ignore_mwt)                                                                                      
  File "/.../ud-parsing-2018/uusegmenter/toolbox.py", line 905, in raw2tags                                              
    assert len(raw) == len(sents)                                                                                                                            
AssertionError

(Line numbers may be slightly off as I added some comments here and there.)

It seems that you assume that the raw text has one sentence per line but the shared task raw text does not use line breaks in this way. Did you not use the raw text at training?

Do you use sentences as training instances? Wouldn't then the CRF never see the context to the right of sentence boundaries, e.g. in English the capitalisation of the next letter is a strong cue, and wouldn't the CRF in worst case learn to simply check whether it's the end of each sequence to assign T or U?

@yanshao9798
Copy link
Owner

Hi. Did you add the parameter -ss while you train the model?

@jowagner
Copy link
Contributor Author

jowagner commented Sep 21, 2018

Of course. (To be precise, I used the long version --sent_seg.) I used the exact same command that succeeded without pre-existing raw_train.txt and raw_dev.txt and that showed improvements over the first 20 epochs up to around 0.785 F1 for sentence segmentation and 0.994 F1 for tokenisation with treebank en_ewt.

@erickrf
Copy link

erickrf commented Dec 19, 2018

Is there an easy way to fix that in the code?

@jowagner
Copy link
Contributor Author

I worked on it for a few hours and didn't see a quick way forward. Our team then decided to train with one sentence per line raw text derived from the gold parses, rather than the original raw text, and the resulting sentence segmentation models worked reasonably well when tested with original raw text as input. This probably means that the UU segmenter does not use linebreaks as an input feature as otherwise one would expect the models to learn to use linebreaks as the only relevant feature for sentence segmentation on the training data and then fail on the test data that usually doesn't use linebreaks to delimit sentences.

I cannot say whether this setup reproduces the results published by the UU team as we paused this research in favour of another project. We probably will continue our efforts to get UU segmenter working as a baseline and implement our own ideas in the first half of 2019.

@erickrf
Copy link

erickrf commented Dec 22, 2018

I see. I'd suggest to add a line in the README about having no files named raw_text.txt, since I ran into trouble because of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants