Skip to content

Releases: veselink1/sponsorml-ytcaptions

distilbert-span-extraction-uncased

19 May 07:53

Choose a tag to compare

A transfomer-based span extraction model based on DistilBERT for extraction of sponsored segment in YouTube captions.

distilbert-classification-uncased + dataset

12 May 06:25

Choose a tag to compare

A transfomer-based text classification model based on DistilBERT for classification of YouTube transcript segments as sponsored/non-sponsored.

Preprocessed data (v3)

09 May 10:03

Choose a tag to compare

Each archive contains 20,000 transcripts with sponsor segments (except data.16.json.gz, which contains fewer videos).

[
  {
    "video_id": "---jcia5ufM",
    "captions": [
      ["so you've decided to get a new dog", 2.149, 4.22],
      ["congratulations that's a huge decision", 4.23, 6.59],
      /* ... */
    ],
    "sponsor_times": [
      [41.05, 52.56]
    ]
  },
  /* ... */
]

Use the data_loader.py from the tagged commit 44e2cdc.

Preprocessed data (v2)

02 May 07:59

Choose a tag to compare

Dataset created with 11dc381.

[
  {
    "videoID": "---jcia5ufM",
    "captions": [
      {
        "end": 2.139,
        "start": 0.03,
        "text": ""
      },
      {
        "end": 4.22,
        "start": 2.149,
        "text": "so you've decided to get a new dog\n"
      },
      {
        "end": 6.59,
        "start": 4.23,
        "text": "congratulations that's a huge decision\n"
      },
      /* ... */
    ],
    "sponsor_ranges": [
      [
        41.05,
        52.56
      ]
    ]
  },
  /* ... */
]

Preprocessed data

30 Apr 09:18

Choose a tag to compare

Preprocessed data Pre-release
Pre-release
videoID transcript sponsorText sponsorTokenRange
---jcia5ufM welcome back to ... sponsor for today's video ... (406, 589)
... ... ... ...

Each file contains 10,000 records.