Skip to content

Releases: huggingface/datasets

4.1.1

18 Sep 13:15
9be15a7
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 4.1.0...4.1.1

4.1.0

15 Sep 16:41
dd280cb
Compare
Choose a tag to compare

Dataset Features

  • feat: use content defined chunking by @kszucs in #7589

    • internally uses use_content_defined_chunking=True when writing Parquet files
    • this enables fast deduped uploads to Hugging Face !
    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
  • Concurrent push_to_hub by @lhoestq in #7708

  • Concurrent IterableDataset push_to_hub by @lhoestq in #7710

  • HDF5 support by @klamike in #7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

New Contributors

Full Changelog: 4.0.0...4.1.0

4.0.0

09 Jul 14:54
b0de7a8
Compare
Choose a tag to compare

New Features

  • Add IterableDataset.push_to_hub() by @lhoestq in #7595

    # Build streaming data pipelines in a few lines of code !
    from datasets import load_dataset
    
    ds = load_dataset(..., streaming=True)
    ds = ds.map(...).filter(...)
    ds.push_to_hub(...)
  • Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in #7606

    # Faster push to Hub ! Available for both Dataset and IterableDataset
    ds.push_to_hub(..., num_proc=8)
  • New Column object

    # Syntax:
    ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)
    
    # Iterate on a column:
    for text in ds["text"]:
        ...
    
    # Load one cell without bringing the full column in memory
    first_text = ds["text"][0]  # equivalent to ds[0]["text"]
  • Torchcodec decoding by @TyTodd in #7616

    • Enables streaming only the ranges you need !
    # Don't download full audios/videos when it's not necessary
    # Now with torchcodec it only streams the required ranges/frames:
    from datasets import load_dataset
    
    ds = load_dataset(..., streaming=True)
    for example in ds:
        video = example["video"]
        frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
    • Requires torch>=2.7.0 and FFmpeg >= 4
    • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
    • Load audio data with AudioDecoder:
    audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
    samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
    samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
    samples.sample_rate  # 16000
    
    # old syntax is still supported
    array, sr = audio["array"], audio["sampling_rate"]
    • Load video data with VideoDecoder:
    video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
    first_frame = video.get_frame_at(0)
    first_frame.data.shape  # (3, 240, 320)
    first_frame.pts_seconds  # 0.0
    frames = video.get_frames_in_range(0, 6, 1)
    frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

  • Remove scripts altogether by @lhoestq in #7592

    • trust_remote_code is no longer supported
  • Torchcodec decoding by @TyTodd in #7616

    • torchcodec replaces soundfile for audio decoding
    • torchcodec replaces decord for video decoding
  • Replace Sequence by List by @lhoestq in #7634

    • Introduction of the List type
    from datasets import Features, List, Value
    
    features = Features({
        "texts": List(Value("string")),
        "four_paragraphs": List(Value("string"), length=4)
    })
    • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature
    from datasets import Sequence
    
    Sequence(Value("string"))  # List(Value("string"))
    Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

New Contributors

Full Changelog: 3.6.0...4.0.0

3.6.0

07 May 15:17
458f45a
Compare
Choose a tag to compare

Dataset Features

  • Enable xet in push to hub by @lhoestq in #7552
    • Faster downloads/uploads with Xet storage
    • more info: #7526

Other improvements and bug fixes

New Contributors

Full Changelog: 3.5.1...3.6.0

3.5.1

28 Apr 14:02
2e94045
Compare
Choose a tag to compare

Bug fixes

  • support pyarrow 20 by @lhoestq in #7540
    • Fix pyarrow error TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
  • Write pdf in map by @lhoestq in #7487

Other improvements

New Contributors

Full Changelog: 3.5.0...3.5.1

3.5.0

27 Mar 16:38
0b5998a
Compare
Choose a tag to compare

Datasets Features

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

New Contributors

Full Changelog: 3.4.1...3.5.0

3.4.1

17 Mar 16:00
f742152
Compare
Choose a tag to compare

Bug Fixes

Full Changelog: 3.4.0...3.4.1

3.4.0

14 Mar 16:46
14fb15a
Compare
Choose a tag to compare

Dataset Features

  • Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424

    • /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
    from datasets import load_dataset, Video
    
    dataset = load_dataset("path/to/video/folder", split="train")
    dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
    • faster streaming for image/audio/video folder from Hugging Face
    • support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
  • Add IterableDataset.decode with multithreading by @lhoestq in #7450

    • even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
    dataset = dataset.decode(num_threads=num_threads)
  • Add with_split to DatasetDict.map by @jp1924 in #7368

General improvements and bug fixes

New Contributors

Full Changelog: 3.3.2...3.4.0

3.3.2

20 Feb 17:44
b37230c
Compare
Choose a tag to compare

Bug fixes

  • Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in #7411
  • Gracefully cancel async tasks by @lhoestq in #7414

Other general improvements

New Contributors

Full Changelog: 3.3.1...3.3.2

3.3.1

17 Feb 14:53
4ead6ec
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 3.3.0...3.3.1