title
Dataset statistics

Overall corpus statistics

Following is an overview of the sizes of the 17 subcorpora that make up the MultiGEC dataset in terms of number of texts. For the sake of readability, we only report numbers for the first two hypothesis sets.

	train			dev			test			total
	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2
Czech - NatWebInf	3620	3620	0	1291	1291	687	1256	1256	1216	6167	6167	1903
Czech - Romani	3247	3247	0	179	179	84	173	173	163	3599	3599	247
Czech - SecLearn	2057	2057	183	173	173	97	177	177	170	2407	2407	450
Czech - NatForm	227	227	0	88	88	47	76	76	74	391	391	121
English - Write & Improve	4040	4040	0	506	506	0	504	504	0	5050	5050	0
Estonian - EIC	206	206	206	26	26	26	26	26	26	258	258	258
Estonian - EKIL2	1202	1202	1202	150	150	150	151	151	151	1503	1503	1503
German - Merlin	827	827	0	103	103	0	103	103	0	1033	1033	0
Greek - GLCII	1031	1031	0	129	129	0	129	129	0	1289	1289	0
Icelandic - IceEC	140	140	0	18	18	0	18	18	0	176	176	0
Icelandic - IceL2EC	155	155	0	19	19	0	19	19	0	193	193	0
Italian - Merlin	651	651	0	81	81	0	81	81	0	813	813	0
Latvian - LaVA	813	813	0	101	101	0	101	101	0	1015	1015	0
Russian - RULEC-GEC	2539	2539	0	1969	1969	0	1535	1535	1535	6043	6043	1535
Slovene - Solar-Eval	10	10	0	50	50	0	49	49	0	109	109	0
Swedish - SweLL-gold	402	402	0	50	50	0	50	50	0	502	502	0
Ukrainian - UA-GEC	1706	1706	0	87	87	87	79	79	79	1872	1872	166
total	22873	22873	1591	5020	5020	1178	4527	4527	3414	32420	32420	6183

Subcorpus-specific statistics

The following tables contain detailed statistics for the 17 language-specific MultiGEC subcorpora. The number of sentences and tokens were recomputed to ensure cross-language consistency, so they might differ from what is reported the papers introducing the source corpora.

Czech - NatWebInf

	tokens			sentences			texts
	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2
train	83725	86805	63976	6463	7706	5550	3620	3620	0
dev	29827	33142	17954	2270	2895	1565	1291	1291	687
test	25707	29400	28563	2059	2842	2692	1256	1256	1216
total	139259	149347	110493	10792	13443	9807	6167	6167	1903

Czech - Romani

	tokens			sentences			texts
	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2
train	277020	294217	0	18198	21393	0	3247	3247	0
dev	14437	15219	7612	900	1144	550	179	179	84
test	15533	16315	15414	967	1300	1139	173	173	163
total	306990	325751	23026	20065	23837	1689	3599	3599	247

Czech - SecLearn

	tokens			sentences			texts
	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2
train	329894	335339	32331	27741	29433	2706	2057	2057	183
dev	31933	32209	19511	2608	2754	1569	173	173	97
test	35085	35505	33836	2710	2914	2730	177	177	170
total	396912	403053	85678	33059	35101	7005	2407	2407	450

Czech - NatForm

	tokens			sentences			texts
	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2
train	44034	45165	0	3245	3304	0	227	227	0
dev	22118	22468	12172	1537	1555	878	88	88	47
test	19886	20237	19645	1433	1492	1423	76	76	74
total	86038	87870	31817	6215	6351	2301	391	391	121

English - Write & Improve

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	676366	686379	37341	39074	4040	4040
dev	88628	89877	4307	4669	506	506
test	92915	94276	4911	5324	504	504
total	857909	870532	46559	49067	5050	5050

Estonian - EIC

	tokens				sentences				texts
	orig	hyp1	hyp2	hyp3	orig	hyp1	hyp2	hyp3	orig	hyp1	hyp2	hyp3
train	33718	33799	33783	33817	2849	2928	2906	2931	206	206	206	206
dev	4465	4471	4460	4472	366	373	372	373	26	26	26	26
test	4319	4343	4332	4341	385	391	388	392	26	26	26	26
total	42502	42613	42575	42630	3600	3692	3666	3696	258	258	258	258

Estonian - EKIL2

	tokens			sentences			texts
	orig	hyp1	hyp2	orig	hyp1	hyp2	orig	hyp1	hyp2
train	187960	189527	189437	14400	14779	14740	1202	1202	1202
dev	24396	24460	24450	1853	1896	1890	150	150	150
test	22952	23117	23106	1676	1731	1727	151	151	151
total	235308	237104	236993	17929	18406	18357	1503	1503	1503

German - Merlin

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	117172	120477	8455	9290	827	827
dev	15739	16144	1102	1206	103	103
test	13343	13755	1029	1121	103	103
total	146254	150376	10586	11617	1033	1033

Greek - GLCII

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	206577	213666	12167	13066	1031	1031
dev	26257	26884	1538	1663	129	129
test	24512	25456	1525	1658	129	129
total	257346	266006	15230	16387	1289	1289

Icelandic - IceEC

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	141302	141411	7146	7211	140	140
dev	16011	16033	784	789	18	18
test	19160	19135	905	909	18	18
total	176473	176579	8835	8909	176	176

Icelandic - IceL2EC

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	124604	124493	5470	5599	155	155
dev	18880	18751	741	789	19	19
test	14310	14288	595	617	19	19
total	157794	157532	6806	7005	193	193

Italian - Merlin

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	82769	83733	6620	6769	651	651
dev	10624	10713	818	848	81	81
test	10482	10566	845	854	81	81
total	103875	105012	8283	8471	813	813

Latvian - LaVA

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	147888	151745	17254	18236	813	813
dev	18413	18949	2228	2359	101	101
test	17894	18311	2091	2188	101	101
total	184195	189005	21573	22783	1015	1015

Russian - RULEC-GEC

	tokens				sentences				texts
	orig	hyp1	hyp2	hyp3	orig	hyp1	hyp2	hyp3	orig	hyp1	hyp2	hyp3
train	88173	88363	0	0	5191	5171	0	0	2539	2539	0	0
dev	43521	43661	0	0	2688	2682	0	0	1969	1969	0	0
test	91134	91881	90703	91665	5321	5311	5338	5361	1535	1535	1535	1535
total	222828	223905	90703	91665	13200	13164	5338	5361	6043	6043	1535	1535

Slovene - Solar-Eval

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	5053	5178	253	296	10	10
dev	31316	31623	1672	1825	50	50
test	33467	33609	1775	1908	49	49
total	69836	70410	3700	4029	109	109

Swedish - SweLL_gold

	tokens		sentences		texts
	orig	hyp1	orig	hyp1	orig	hyp1
train	120035	123372	6294	6860	402	402
dev	13182	13499	724	770	50	50
test	12016	12376	653	704	50	50
total	145233	149247	7671	8334	502	502

Ukrainian - UA-GEC

	tokens					sentences					texts
	orig	hyp1	hyp2	hyp3	hyp4	orig	hyp1	hyp2	hyp3	hyp4	orig	hyp1	hyp2	hyp3	hyp4
train	458693	462431	0	460401	0	29429	30057	0	30078	0	1706	1706	0	1706	0
dev	23866	24106	24168	23954	23949	1318	1338	1370	1337	1380	87	87	87	87	87
test	19951	20121	20158	20023	19995	1089	1143	1193	1143	1192	79	79	79	79	79
total	502510	506658	44326	504378	43944	31836	32538	2563	32558	2572	1872	1872	166	1872	166

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats.md

stats.md

Overall corpus statistics

Subcorpus-specific statistics

Czech - NatWebInf

Czech - Romani

Czech - SecLearn

Czech - NatForm

English - Write & Improve

Estonian - EIC

Estonian - EKIL2

German - Merlin

Greek - GLCII

Icelandic - IceEC

Icelandic - IceL2EC

Italian - Merlin

Latvian - LaVA

Russian - RULEC-GEC

Slovene - Solar-Eval

Swedish - SweLL_gold

Ukrainian - UA-GEC

Files

stats.md

Latest commit

History

stats.md

File metadata and controls

Overall corpus statistics

Subcorpus-specific statistics

Czech - NatWebInf

Czech - Romani

Czech - SecLearn

Czech - NatForm

English - Write & Improve

Estonian - EIC

Estonian - EKIL2

German - Merlin

Greek - GLCII

Icelandic - IceEC

Icelandic - IceL2EC

Italian - Merlin

Latvian - LaVA

Russian - RULEC-GEC

Slovene - Solar-Eval

Swedish - SweLL_gold

Ukrainian - UA-GEC