Releases: veselink1/sponsorml-ytcaptions
Releases · veselink1/sponsorml-ytcaptions
distilbert-span-extraction-uncased
A transfomer-based span extraction model based on DistilBERT for extraction of sponsored segment in YouTube captions.
distilbert-classification-uncased + dataset
A transfomer-based text classification model based on DistilBERT for classification of YouTube transcript segments as sponsored/non-sponsored.
Preprocessed data (v3)
Each archive contains 20,000 transcripts with sponsor segments (except data.16.json.gz, which contains fewer videos).
[
{
"video_id": "---jcia5ufM",
"captions": [
["so you've decided to get a new dog", 2.149, 4.22],
["congratulations that's a huge decision", 4.23, 6.59],
/* ... */
],
"sponsor_times": [
[41.05, 52.56]
]
},
/* ... */
]
Use the data_loader.py from the tagged commit 44e2cdc.
Preprocessed data (v2)
Dataset created with 11dc381.
[
{
"videoID": "---jcia5ufM",
"captions": [
{
"end": 2.139,
"start": 0.03,
"text": ""
},
{
"end": 4.22,
"start": 2.149,
"text": "so you've decided to get a new dog\n"
},
{
"end": 6.59,
"start": 4.23,
"text": "congratulations that's a huge decision\n"
},
/* ... */
],
"sponsor_ranges": [
[
41.05,
52.56
]
]
},
/* ... */
]Preprocessed data
| videoID | transcript | sponsorText | sponsorTokenRange |
|---|---|---|---|
| ---jcia5ufM | welcome back to ... | sponsor for today's video ... | (406, 589) |
| ... | ... | ... | ... |
Each file contains 10,000 records.