Summary
This proposes adding ASSIN2 to the harness — a standard Brazilian Portuguese benchmark covering two subtasks:
- ASSIN2-RTE: Recognizing Textual Entailment (~6,500 train / 500 val / ~3,000 test sentence pairs, F1-macro metric)
- ASSIN2-STS: Semantic Textual Similarity (same splits, Pearson correlation metric)
Dataset available at nilc-nlp/assin2 on Hugging Face.
Motivation
ASSIN2 is the standard upgrade to ASSIN v1 (already in portuguese_bench) and is the most widely used Brazilian Portuguese NLP benchmark since 2020. Adding it would extend the existing portuguese_bench task group with a more recent evaluation suite, and complements the recently proposed ifeval_pt (#3622).
Implementation plan
- Two task configs: assin2_rte and assin2_sts under portuguese_bench
- Dataset: nilc-nlp/assin2
- RTE: few-shot multiple choice, F1-macro
- STS: generation, Pearson correlation
- Happy to implement if maintainers are interested
References
- Dataset: huggingface.co/datasets/nilc-nlp/assin2
- Paper: Real et al., 2020, "The ASSIN 2 Shared Task: A Quick Overview"
Summary
This proposes adding ASSIN2 to the harness — a standard Brazilian Portuguese benchmark covering two subtasks:
Dataset available at nilc-nlp/assin2 on Hugging Face.
Motivation
ASSIN2 is the standard upgrade to ASSIN v1 (already in portuguese_bench) and is the most widely used Brazilian Portuguese NLP benchmark since 2020. Adding it would extend the existing portuguese_bench task group with a more recent evaluation suite, and complements the recently proposed ifeval_pt (#3622).
Implementation plan
References