Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update notebooks to no longer rely on JW300 #199

Open
cdleong opened this issue Oct 21, 2021 · 2 comments
Open

Update notebooks to no longer rely on JW300 #199

cdleong opened this issue Oct 21, 2021 · 2 comments

Comments

@cdleong
Copy link
Contributor

cdleong commented Oct 21, 2021

Edit: see #200, maybe we should leave the old JW300 notebooks up, and instead create new ones

The problem

JW300 has been taken down for copyright reasons. At least the following notebooks all rely on it:

https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_from_English_training.ipynb
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_gdrive_from_English.ipynb
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb

a solution (but see #200 )

They need to be fixed to no longer use this dataset. Perhaps we could use Tatoeba or FloRES 101? Or one of the other machine translation datasets on https://huggingface.co/datasets?task_ids=task_ids:machine-translation&sort=downloads

@cdleong
Copy link
Contributor Author

cdleong commented Oct 21, 2021

Steps that need to be done:

  • (optional) assign yourself in "Assignees" over to the right
  • Try running the notebooks, in Google Colab
  • See where they break.
  • Edit the notebook to swap in another dataset. Perhaps by Loading in a HuggingFace dataset, and then writing it back out into a format JoeyNMT knows how to use, creating a train.en and train.xh file maybe.
  • Fork the masakhane-MT repo https://docs.github.com/en/get-started/quickstart/fork-a-repo
  • Swap in your updated notebook
  • Make a merge request/pull request so that everyone can use the updated notebook.

@cdleong
Copy link
Contributor Author

cdleong commented Oct 21, 2021

So for example, this section breaks because JW300 is no longer downloadable:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant