Skip to content

Creating a diarization broadcast corpus

judyfong edited this page Jun 17, 2020 · 10 revisions

Requirements

  • Gecko
  • rttm files
  • corresponding videos of episode

Tips

Label speaker turns which last at least 60 ms. (CHANGED)

Each speaker gets their own speaker number per recording/episode.

Unknown speakers get labelled Unknown 01 etc.

There are at least two ways to create the csv file.

  1. Follow Aríel's video called My Movie.mp4. In it he uses VSCode, extension json2csv, and does some formatting.
  2. Add all the speakers to one segment in Gecko and copy over the list then remove them back all again to create initial list for the csv file.

Process

  1. Generate the proposed rttm files for 28 episodes that week.
  2. Labelling - Gecko
    1. Open Gecko If you use the Gecko version linked here then you can save partially corrected files and reload them back into the editor to edit later.
    2. Upload the video file & rttm file
    3. Adjust the segment start and end times to match speaker turns.
    4. Add missing speaker turns.
    5. Correct speaker labels/numbers. Add new ones if necessary
    6. Write down the full speaker names which correspond to each speaker number. These go in a csv file.
    7. Label music, foreign language, or noise. They're available as default labels.
    8. Segments which are only silence can be deleted.
    9. Review the segments in case you missed anything or added tiny segments.
    10. Export as json, srt, and rttm.
  3. Turn in the csv, json, srt, and corrected rttm files to the relevant folders. Then get new rttm and video files.
  4. Repeat for a new episode.
  5. Judy reports the new DER with that week's data. When it is under 10%, this project is done.

reco2spk_num2spk_name.csv

format

<recording/episode id>, <speaker_number in rttm file>, <speaker name>

example

Fréttirkl1900-5022010T0,1, Bogi Águstsson

Clone this wiki locally