How to make your NLP system multilingual

Lab

For this lab, we'll train models for binary classification on a multilingual dataset of Wikipedia comments labelled for toxicity by humans.

Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages

The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voices in conversation.
...
This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data.
...
Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results "translate" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages.
...
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Importantly, this dataset includes human-generated human-labelled data in multiple languages.

How do our multilingual techniques do compared to a scenario where we actually have human-generated human-labelled data?

Datasets

Google has provided about half a million English comments as well as data in other languages.

Currently, Perspective API has production TOXICITY and SEVERE_TOXICITY attributes in the following languages:
English (en)
Spanish (es)
French (fr)
German (de)
Portuguese (pt)
Italian (it)
Russian (ru)

We don't actually have very much data for the languages other than English.

Questions

How well does a system trained on English only do on other languages?

How does adding other languages affect the accuracy on English?

What is the effect of machine translation quality?

What is the relationship between training data size and accuracy?

What is the relationship between the number of languages covered and the zero-shot accuracy?

How does all of the above vary across tasks?

Experiments

Try out the provided notebook in an environment like Google Colab.

BERT-based
small-amount of training data

https://colab.research.google.com/drive/1d18u9XHLRCB6LxtSR3mPoFrlx3io8wA5?usp=sharing

Evaluation and results

System	Description	Training datasets	`en`	`it`^en	`es`^en	`tr`^en
English	Translate at inference "Lazy"	`en`	x%	y%	z%	w%

			`en`	`it`	`es`	`tr`
English	Do nothing "Zero-shot"	`en`	(x%)	y%	z%	w%

Italian Real data	Human benchmark	`it`		y%
Multilingual Real data	Human benchmark "$$$"	`en`, `it`, `es`, `tr`	x%	y%	z%	w%

Multilingual Synthetic	Translate at training "Eager"	`en`, `en`^it, `en`^es, `en`^tr	x%	y%	z%	w%
Multilingual Synthetic	Translate and filter at training	`en`, `en`^{it filtered}, `en`^{es filtered}, `en`^{tr filtered}	?	?	?	?

x^y indicates a dataset in language x that was machine-translated to language y.

Translation and filtering

Translation at inference to English:
it^en

es^en

tr^en

Translation at training from English: en^it

en^es

en^tr

Advanced

Inspect the false positives and false negatives
Inspect the greatest differences between systems
Ensemble and fine-tune
Rebalance
Scale up the number of languages
Use more data
Improve the translation
Change the filtering threshold
Train in a production environment
Serve in a production environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lab.md

lab.md

How to make your NLP system multilingual

Lab

Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages

Datasets

Questions

Experiments

Evaluation and results

Translation and filtering

Advanced

Files

lab.md

Latest commit

History

lab.md

File metadata and controls

How to make your NLP system multilingual

Lab

Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages

Datasets

Questions

Experiments

Evaluation and results

Translation and filtering

Advanced