Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small dataset subsampling for speech commands #74

Open
turian opened this issue Sep 9, 2021 · 0 comments
Open

Small dataset subsampling for speech commands #74

turian opened this issue Sep 9, 2021 · 0 comments

Comments

@turian
Copy link
Contributor

turian commented Sep 9, 2021

Ending up with a weird number of samples in the train/valid split:

  • test: 96
  • train: 56
  • valid: 132

This is caused by the background_noise subsampling in the tasks/sampler.py. In speech commands all the background noise samples (which are labelled as silence) are delivered as longer audio samples that are expected to be sliced up into smaller chunks. When we are subsampling this dataset only one background noise sample is being included (running_tap.wav), and that happens to be in the validation set. As a result we are ending up with a validation set that is almost exclusively silence samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant