AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

jowagner · 2018-09-10T11:26:01Z

I've run segmenter.py train successfully with just conllu files in the workspace but when I include the raw text from the 2018 shared task as raw_train.txt and raw_dev.txt, I get

Traceback (most recent call last):                                                                                                                           
  File "segmenter.py", line 155, in <module>                                                                                                                 
    reset=args.reset, tag_scheme=args.tags, ignore_mwt=args.ignore_mwt)                                                                                      
  File "/.../ud-parsing-2018/uusegmenter/toolbox.py", line 905, in raw2tags                                              
    assert len(raw) == len(sents)                                                                                                                            
AssertionError

(Line numbers may be slightly off as I added some comments here and there.)

It seems that you assume that the raw text has one sentence per line but the shared task raw text does not use line breaks in this way. Did you not use the raw text at training?

Do you use sentences as training instances? Wouldn't then the CRF never see the context to the right of sentence boundaries, e.g. in English the capitalisation of the next letter is a strong cue, and wouldn't the CRF in worst case learn to simply check whether it's the end of each sequence to assign T or U?

The text was updated successfully, but these errors were encountered:

yanshao9798 · 2018-09-21T03:23:51Z

Hi. Did you add the parameter -ss while you train the model?

jowagner · 2018-09-21T08:45:28Z

Of course. (To be precise, I used the long version --sent_seg.) I used the exact same command that succeeded without pre-existing raw_train.txt and raw_dev.txt and that showed improvements over the first 20 epochs up to around 0.785 F1 for sentence segmentation and 0.994 F1 for tokenisation with treebank en_ewt.

erickrf · 2018-12-19T17:57:56Z

Is there an easy way to fix that in the code?

jowagner · 2018-12-21T19:39:06Z

I worked on it for a few hours and didn't see a quick way forward. Our team then decided to train with one sentence per line raw text derived from the gold parses, rather than the original raw text, and the resulting sentence segmentation models worked reasonably well when tested with original raw text as input. This probably means that the UU segmenter does not use linebreaks as an input feature as otherwise one would expect the models to learn to use linebreaks as the only relevant feature for sentence segmentation on the training data and then fail on the test data that usually doesn't use linebreaks to delimit sentences.

I cannot say whether this setup reproduces the results published by the UU team as we paused this research in favour of another project. We probably will continue our efforts to get UU segmenter working as a baseline and implement our own ideas in the first half of 2019.

erickrf · 2018-12-22T16:58:41Z

I see. I'd suggest to add a line in the README about having no files named raw_text.txt, since I ran into trouble because of that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

jowagner commented Sep 10, 2018 •

edited

Loading

yanshao9798 commented Sep 21, 2018

jowagner commented Sep 21, 2018 •

edited

Loading

erickrf commented Dec 19, 2018

jowagner commented Dec 21, 2018

erickrf commented Dec 22, 2018

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

Comments

jowagner commented Sep 10, 2018 • edited Loading

yanshao9798 commented Sep 21, 2018

jowagner commented Sep 21, 2018 • edited Loading

erickrf commented Dec 19, 2018

jowagner commented Dec 21, 2018

erickrf commented Dec 22, 2018

jowagner commented Sep 10, 2018 •

edited

Loading

jowagner commented Sep 21, 2018 •

edited

Loading