Skip to content

Latest commit

 

History

History
117 lines (75 loc) · 5.89 KB

lab.md

File metadata and controls

117 lines (75 loc) · 5.89 KB

Lab

For this lab, we'll train models for binary classification on a multilingual dataset of Wikipedia comments labelled for toxicity by humans.

Use TPUs to identify toxicity comments across multiple languages

The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voices in conversation.
...
This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data.
...
Jigsaw's API, Perspective, serves toxicity models and others in a growing set of languages (see our documentation for the full list). Over the past year, the field has seen impressive multilingual capabilities from the latest model innovations, including few- and zero-shot learning. We're excited to learn whether these results "translate" (pun intended!) to toxicity classification. Your training data will be the English data provided for our previous two competitions and your test data will be Wikipedia talk page comments in several different languages.
...
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Importantly, this dataset includes human-generated human-labelled data in multiple languages.

How do our multilingual techniques do compared to a scenario where we actually have human-generated human-labelled data?


Datasets

Google has provided about half a million English comments as well as data in other languages.

Currently, Perspective API has production TOXICITY and SEVERE_TOXICITY attributes in the following languages:
English (en)
Spanish (es)
French (fr)
German (de)
Portuguese (pt)
Italian (it)
Russian (ru)

We don't actually have very much data for the languages other than English.

Questions

How well does a system trained on English only do on other languages?

How does adding other languages affect the accuracy on English?

What is the effect of machine translation quality?

What is the relationship between training data size and accuracy?

What is the relationship between the number of languages covered and the zero-shot accuracy?

How does all of the above vary across tasks?


Experiments

Try out the provided notebook in an environment like Google Colab.

  • BERT-based
  • small-amount of training data

https://colab.research.google.com/drive/1d18u9XHLRCB6LxtSR3mPoFrlx3io8wA5?usp=sharing


Evaluation and results

System Description Training datasets en iten esen tren
English Translate at inference
"Lazy"
en x% y% z% w%
en it es tr
English Do nothing
"Zero-shot"
en (x%) y% z% w%
Italian
Real data
Human benchmark it y%
Multilingual
Real data
Human benchmark
"$$$"
en, it, es, tr x% y% z% w%
Multilingual
Synthetic
Translate at training
"Eager"
en, enit, enes, entr x% y% z% w%
Multilingual
Synthetic
Translate and filter at training en, enit filtered, enes filtered, entr filtered ? ? ? ?

xy indicates a dataset in language x that was machine-translated to language y.

Translation and filtering

Translation at inference to English:
iten

esen

tren

Translation at training from English: enit

enes

entr


Advanced

  • Inspect the false positives and false negatives
  • Inspect the greatest differences between systems
  • Ensemble and fine-tune
  • Rebalance
  • Scale up the number of languages
  • Use more data
  • Improve the translation
  • Change the filtering threshold
  • Train in a production environment
  • Serve in a production environment