For this lab, we'll train models for binary classification on a multilingual dataset of Wikipedia comments labelled for toxicity by humans.
The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voices in conversation.
...
This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data.
...
Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results "translate" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages.
...
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
Importantly, this dataset includes human-generated human-labelled data in multiple languages.
How do our multilingual techniques do compared to a scenario where we actually have human-generated human-labelled data?
Google has provided about half a million English comments as well as data in other languages.
Currently, Perspective API has production TOXICITY and SEVERE_TOXICITY attributes in the following languages:
English (en)
Spanish (es)
French (fr)
German (de)
Portuguese (pt)
Italian (it)
Russian (ru)
We don't actually have very much data for the languages other than English.
How well does a system trained on English only do on other languages?
How does adding other languages affect the accuracy on English?
What is the effect of machine translation quality?
What is the relationship between training data size and accuracy?
What is the relationship between the number of languages covered and the zero-shot accuracy?
How does all of the above vary across tasks?
Try out the provided notebook in an environment like Google Colab.
- BERT-based
- small-amount of training data
https://colab.research.google.com/drive/1d18u9XHLRCB6LxtSR3mPoFrlx3io8wA5?usp=sharing
System | Description | Training datasets | en |
it en |
es en |
tr en |
|
---|---|---|---|---|---|---|---|
English | Translate at inference "Lazy" |
en |
x% | y% | z% | w% | |
en |
it |
es |
tr |
||||
English | Do nothing "Zero-shot" |
en |
(x%) | y% | z% | w% | |
Italian Real data |
Human benchmark | it |
y% | ||||
Multilingual Real data |
Human benchmark "$$$" |
en , it , es , tr |
x% | y% | z% | w% | |
Multilingual Synthetic |
Translate at training "Eager" |
en , en it , en es , en tr |
x% | y% | z% | w% | |
Multilingual Synthetic |
Translate and filter at training | en , en it filtered, en es filtered, en tr filtered |
? | ? | ? | ? |
x
y
indicates a dataset in language x that was machine-translated to language y.
Translation at inference to English:
it
en
Translation at training from English:
en
it
- Inspect the false positives and false negatives
- Inspect the greatest differences between systems
- Ensemble and fine-tune
- Rebalance
- Scale up the number of languages
- Use more data
- Improve the translation
- Change the filtering threshold
- Train in a production environment
- Serve in a production environment