You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following is an overview of the sizes of the 17 subcorpora that make up the MultiGEC dataset in terms of number of texts. For the sake of readability, we only report numbers for the first two hypothesis sets.
train
dev
test
total
orig
hyp1
hyp2
orig
hyp1
hyp2
orig
hyp1
hyp2
orig
hyp1
hyp2
Czech - NatWebInf
3620
3620
0
1291
1291
687
1256
1256
1216
6167
6167
1903
Czech - Romani
3247
3247
0
179
179
84
173
173
163
3599
3599
247
Czech - SecLearn
2057
2057
183
173
173
97
177
177
170
2407
2407
450
Czech - NatForm
227
227
0
88
88
47
76
76
74
391
391
121
English - Write & Improve
4040
4040
0
506
506
0
504
504
0
5050
5050
0
Estonian - EIC
206
206
206
26
26
26
26
26
26
258
258
258
Estonian - EKIL2
1202
1202
1202
150
150
150
151
151
151
1503
1503
1503
German - Merlin
827
827
0
103
103
0
103
103
0
1033
1033
0
Greek - GLCII
1031
1031
0
129
129
0
129
129
0
1289
1289
0
Icelandic - IceEC
140
140
0
18
18
0
18
18
0
176
176
0
Icelandic - IceL2EC
155
155
0
19
19
0
19
19
0
193
193
0
Italian - Merlin
651
651
0
81
81
0
81
81
0
813
813
0
Latvian - LaVA
813
813
0
101
101
0
101
101
0
1015
1015
0
Russian - RULEC-GEC
2539
2539
0
1969
1969
0
1535
1535
1535
6043
6043
1535
Slovene - Solar-Eval
10
10
0
50
50
0
49
49
0
109
109
0
Swedish - SweLL-gold
402
402
0
50
50
0
50
50
0
502
502
0
Ukrainian - UA-GEC
1706
1706
0
87
87
87
79
79
79
1872
1872
166
total
22873
22873
1591
5020
5020
1178
4527
4527
3414
32420
32420
6183
Subcorpus-specific statistics
The following tables contain detailed statistics for the 17 language-specific MultiGEC subcorpora.
The number of sentences and tokens were recomputed to ensure cross-language consistency, so they might differ from what is reported the papers introducing the source corpora.