Skip to content
forked from SeahorseDB/dataset

This is a repository for sharing the dataset used for testing in Seahorse DB.

Notifications You must be signed in to change notification settings

moohnam/dataset

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

📂 YouTube Embedding Datasets (SeahorseDB)

This repository provides datasets containing CLIP-generated embeddings and metadata for YouTube video scenes. The embeddings are extracted based on timestamps indicating specific scenes within the videos. These datasets are part of the SeahorseDB collection, accessible on Hugging Face.


📊 Dataset Overview

Dataset Name Size Description
youtube-1M-dataset 1 Million Contains CLIP embeddings and metadata for 1M YouTube video scenes.
youtube-2M-dataset 2 Million Contains CLIP embeddings and metadata for 2M YouTube video scenes.
youtube-15M-dataset 15 Million Contains CLIP embeddings and metadata for 15M YouTube video scenes.

All datasets are stored in Parquet format for efficient storage and processing.


🗃️ Dataset Structure

Each dataset includes the following fields:

Field Name Data Type Description
feature sequence 1024-dimensional CLIP-generated embedding vector representing the scene.
title string Title of the YouTube video.
channel_name string Name of the YouTube channel hosting the video.
video_id string Unique identifier for the YouTube video.
timestamp string Timestamp indicating the specific scene in the video.
id int64 Unique identifier for each row.
__index_level_0__ int64 Dummy field

📥 How to Access

The datasets can be downloaded directly from Hugging Face:

You can load the Parquet files using libraries like Pandas or PyArrow.

Example in Python:

import pandas as pd

# Load the 1M dataset
df = pd.read_parquet("path/to/youtube-1M-dataset.parquet")
print(df.head())

# Access embeddings and metadata
embeddings = df['feature'].tolist()  # 1024-dimensional embeddings for scenes
timestamps = df['timestamp'].tolist()
metadata = df['title'].tolist()

About

This is a repository for sharing the dataset used for testing in Seahorse DB.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published