Skip to content

Commit

Permalink
Earnings-22: Full dataset, excluding media files (#19)
Browse files Browse the repository at this point in the history
  • Loading branch information
pique0822 authored Apr 4, 2022
1 parent 4bbcd88 commit ad755cd
Show file tree
Hide file tree
Showing 256 changed files with 1,010,200 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ In each dataset, the most up-to-date version of the dataset will always be in th
| Dataset | Description |
| ------- | ----------- |
|`earnings21` | This dataset contains 44 files totalling roughly 39 hours of earnings calls from the year 2020. This dataset provides the full audios, the transcripts, and accompanying metadata such as speaker labels, punctuation, and entity tags. |
|`earnings22` | This dataset contains 125 files totalling roughly 119 hours of English language earnings calls from global countries. This dataset provides the full audios, transcripts, and accompanying metadata such as ticker symbol, headquarters country, and our defined "Language Region".

# How to Check Out Only a Single Dataset

Expand Down
4 changes: 4 additions & 0 deletions earnings22/LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
The transcripts and associated text files that are used for alignment in this directory are licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International][cc-by-sa] license.

[cc-by-sa]: https://creativecommons.org/licenses/by-sa/4.0/
66 changes: 66 additions & 0 deletions earnings22/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
[![License: CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](LICENSE.md)

# Earnings 22

The Earnings 22 dataset ( also referred to as earnings22 ) is a 119-hour corpus of English-language earnings calls collected from global companies. The primary purpose is to serve as a benchmark for industrial and academic automatic speech recognition (ASR) models on real-world accented speech.

This work has been submitted for publication at Interspeech 2022.

# Table of Contents

* [File Format Overview](#file-format-overview)
+ [nlp Files](#nlp-files)
- [Example](#example-nlp-file)
* [Results](#results)
* [WER Calculation](#wer-calculation)
* [Cite this Dataset](#cite-this-dataset)

# File Format Overview
In the following section, we provide an overview of the file formats we provide with this dataset.

## nlp Files
NLP files are `.csv` inspired, pipe-separated text files that contain token and metadata information of a transcript. Each line of a file represents a single transcript token and the metadata associated with it.

|Column Title|Description
|--|--|
| Column 1: `token` | A single token in the transcript. These are typically single words or multiple words with hyphens in between. |
| Column 2: `speaker` | A unique integer that associates this token to a specific speaker in an audio |
Column 3: `ts` | A float representing the start time of the token, in seconds |
Column 4: `endTs` | A float representing the end time of the token, in seconds |
Column 5: `punctuation` | A punctuation character that is included at the end of a token that is used when reconstructing the transcript. Example punctuation: `",", ";", ".", "!"`. |
Column 6: `case` | A two letter code to denominate the which of four possible casings for this token: <ul><li>`UC` - Denotes a token that has the first character in uppercase and every other character lowercase.</li><li>`LC` - Denotes a token that has every character in lowercase.</li><li>`CA` - Denotes a token that has every character in uppercase.</li><li>`MC` - Denotes a token that doesn’t follow the previous rules. This is the case when upper- and lowercase characters are mixed throughout the token</li></ul> |
Column 7: `tags` | Displays one of the several entity tags that are listed in wer_tags in long form - such that the displayed entity here is in the form `ID:ENTITY_CLASS`. |
Column 8: `wer_tags` | A list of entity tags that are associated with this token. In this field, only entity IDs should be present. The specific ENTITY_CLASS for each ID can be extracted from an accompanying wer_tags sidecar json. |

_**Note that each entity ID is unique to that specific entity. Entities can be comprised of single and multiple tokens. Within a file there can be several entities of the same ENTITY_CLASS but only one entity can be labeled with any given ID.**_


### Example nlp File
`example.nlp`

```
token|speaker|ts|endTs|punctuation|case|tags|wer_tags
Good|0||||UC|[]|[]
morning|0||||LC|['5:TIME']|['5']
and|0||||LC|[]|[]
welcome|0||||LC|[]|[]
to|0||||LC|[]|[]
the|0||||LC|['6:DATE']|['6']
first|0||||LC|['6:DATE']|['6']
quarter|0||||LC|['6:DATE']|['6']
2020|0||||CA|['0:YEAR']|['0', '1', '6']
NexGEn|0||||MC|['7:ORG']|['7']
```

# Results
Tables found in the paper along with all entity class WER can be found within the `transcripts` directory.

# WER Calculation
All of our analysis on this dataset is done through the use of our newly released [fstalign](https://github.com/revdotcom/fstalign/tree/master) tool. We strongly recommend the use of this tool to quickly get started using the *Earnings-22* dataset.

# Cite this Dataset
This dataset has been submitted to Interspeech 2022.
The paper describing our methods and results can be found on arXiv at <ARXIV LINK>
```
ARXIV PATH
```
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/2020-Annual-Results-Call-Recording.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4329526.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4351517.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4372696.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4420696.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4423872.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4426736.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4430051.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4432298.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4443871.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4443920.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4446796.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4448760.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4449269.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4450488.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4450779.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4452058.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4453076.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4453085.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4453225.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4461799.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4462231.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4463259.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4463693.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4464164.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4466251.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4466399.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4466607.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4466718.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4466797.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4467071.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4467079.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4467434.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4467717.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468000.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468307.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468639.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468647.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468654.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468678.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468679.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468715.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4468919.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469075.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469088.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469184.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469208.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469291.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469528.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469590.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469669.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4469836.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470010.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470253.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470290.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470570.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470595.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470663.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4470684.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4471586.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4471606.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4471809.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4471961.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4472403.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4472895.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4473238.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4473837.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4474229.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4474327.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4474506.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4474955.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4475486.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4475604.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4479524.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4479741.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4479944.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4480850.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4481221.mp3
Git LFS file not shown
3 changes: 3 additions & 0 deletions earnings22/media/4481552.mp3
Git LFS file not shown
Loading

0 comments on commit ad755cd

Please sign in to comment.