Releases: huggingface/datasets
4.1.1
What's Changed
- fix iterate nested field by @lhoestq in #7775
- Add support for arrow iterable when concatenating or interleaving by @radulescupetru in #7771
- fix empty dataset to_parquet by @lhoestq in #7779
New Contributors
- @radulescupetru made their first contribution in #7771
Full Changelog: 4.1.0...4.1.1
4.1.0
Dataset Features
-
feat: use content defined chunking by @kszucs in #7589
- internally uses
use_content_defined_chunking=True
when writing Parquet files - this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking ds.push_to_hub("username/dataset_name")
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- internally uses
-
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
ds = load_dataset("username/dataset-with-hdf5-files")
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows
Other improvements and bug fixes
- Convert to string when needed + faster .zstd by @lhoestq in #7683
- fix audio cast storage from array + sampling_rate by @lhoestq in #7684
- Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
- Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
- Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
- Update dataset_dict push_to_hub by @lhoestq in #7711
- Retry intermediate commits too by @lhoestq in #7712
- num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
- Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
- fix num_proc=1 ci test by @lhoestq in #7714
- Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
- typo by @lhoestq in #7716
- fix largelist repr by @lhoestq in #7735
- Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
- Fix type hint
train_test_split
by @qgallouedec in #7736 - fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
- Refactor HDF5 and preserve tree structure by @klamike in #7743
- docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
- Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
- Support pathlib.Path for feature input by @Joshua-Chin in #7755
- add support for pyarrow string view in features by @onursatici in #7718
- Fix typo in error message for cache directory deletion by @brchristian in #7749
- update torchcodec in ci by @lhoestq in #7764
- Bump dill to 0.4.0 by @Bomme in #7763
New Contributors
- @DavidRConnell made their first contribution in #7438
- @rootAvish made their first contribution in #7701
- @tanuj-rai made their first contribution in #7702
- @evalstate made their first contribution in #7713
- @brchristian made their first contribution in #7730
- @klamike made their first contribution in #7690
- @YassineYousfi made their first contribution in #7726
- @Sanjaykumar030 made their first contribution in #7737
- @kszucs made their first contribution in #7589
- @Joshua-Chin made their first contribution in #7755
- @onursatici made their first contribution in #7718
- @Bomme made their first contribution in #7763
Full Changelog: 4.0.0...4.1.0
4.0.0
New Features
-
Add
IterableDataset.push_to_hub()
by @lhoestq in #7595# Build streaming data pipelines in a few lines of code ! from datasets import load_dataset ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...)
-
Add
num_proc=
to.push_to_hub()
(Dataset and IterableDataset) by @lhoestq in #7606# Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)
-
New
Column
object- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
- Lazy column by @lhoestq in #7614
# Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...) # Iterate on a column: for text in ds["text"]: ... # Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"]
-
Torchcodec decoding by @TyTodd in #7616
- Enables streaming only the ranges you need !
# Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
- Requires
torch>=2.7.0
and FFmpeg >= 4 - Not available for Windows yet but it is coming soon - in the meantime please use
datasets<4.0
- Load audio data with
AudioDecoder
:
audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000 # old syntax is still supported array, sr = audio["array"], audio["sampling_rate"]
- Load video data with
VideoDecoder
:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])
Breaking changes
-
Remove scripts altogether by @lhoestq in #7592
trust_remote_code
is no longer supported
-
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
-
Replace Sequence by List by @lhoestq in #7634
- Introduction of the
List
type
from datasets import Features, List, Value features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) })
Sequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aList
or adict
depending on the subfeature
from datasets import Sequence Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
- Introduction of the
Other improvements and bug fixes
- Refactor
Dataset.map
to reuse cache files mapped with differentnum_proc
by @ringohoffman in #7434 - fix string_to_dict test by @lhoestq in #7571
- Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
- Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
- fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
- load_dataset splits typing by @lhoestq in #7587
- Fixed typos by @TopCoder2K in #7572
- Fix regex library warnings by @emmanuel-ferdman in #7576
- [MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
- Add missing property on
RepeatExamplesIterable
by @SilvanCodes in #7581 - Avoid multiple default config names by @albertvillanova in #7585
- Fix broken link to albumentations by @ternaus in #7593
- fix string_to_dict usage for windows by @lhoestq in #7598
- No TF in win tests by @lhoestq in #7603
- Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
- Tests typing and fixes for push_to_hub by @lhoestq in #7608
- fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
- remove unused code by @lhoestq in #7615
- Update
_dill.py
to useco_linetable
for Python 3.10+ in place ofco_lnotab
by @qgallouedec in #7609 - Fixes in docs by @lhoestq in #7620
- Add albumentations to use dataset by @ternaus in #7596
- minor docs data aug by @lhoestq in #7621
- fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
- fix save_infos by @lhoestq in #7639
- better features repr by @lhoestq in #7640
- update docs and docstrings by @lhoestq in #7641
- fix length for ci by @lhoestq in #7642
- Backward compat sequence instance by @lhoestq in #7643
- fix sequence ci by @lhoestq in #7644
- Custom metadata filenames by @lhoestq in #7663
- Update the beans dataset link in Preprocess by @HJassar in #7659
- Backward compat list feature by @lhoestq in #7666
- Fix infer list of images by @lhoestq in #7667
- Fix audio bytes by @lhoestq in #7670
- Fix double sequence by @lhoestq in #7672
New Contributors
- @TopCoder2K made their first contribution in #7564
- @francescorubbo made their first contribution in #7522
- @emmanuel-ferdman made their first contribution in #7576
- @SilvanCodes made their first contribution in #7581
- @ternaus made their first contribution in #7593
- @ArjunJagdale made their first contribution in #7623
- @TyTodd made their first contribution in #7616
- @HJassar made their first contribution in #7659
Full Changelog: 3.6.0...4.0.0
3.6.0
Dataset Features
- Enable xet in push to hub by @lhoestq in #7552
- Faster downloads/uploads with Xet storage
- more info: #7526
Other improvements and bug fixes
- Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in #7544
- Avoid global umask for setting file mode. by @ryan-clancy in #7547
- Rebatch arrow iterables before formatted iterable by @lhoestq in #7553
- Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in #7532
- fix regression by @lhoestq in #7558
- fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in #7521
- Remove
aiohttp
from direct dependencies by @akx in #7294
New Contributors
- @ryan-clancy made their first contribution in #7547
- @Harry-Yang0518 made their first contribution in #7532
- @giraffacarp made their first contribution in #7521
- @akx made their first contribution in #7294
Full Changelog: 3.5.1...3.6.0
3.5.1
Bug fixes
- support pyarrow 20 by @lhoestq in #7540
- Fix pyarrow error
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
- Fix pyarrow error
- Write pdf in map by @lhoestq in #7487
Other improvements
- update fsspec 2025.3.0 by @peteski22 in #7478
- Support underscore int read instruction by @lhoestq in #7488
- Support skip_trying_type by @yoshitomo-matsubara in #7483
- pdf docs fixes by @lhoestq in #7519
- Remove conditions for Python < 3.9 by @cyyever in #7474
- mention av in video docs by @lhoestq in #7523
- correct use with polars example by @SiQube in #7524
- chore: fix typos by @afuetterer in #7436
New Contributors
- @peteski22 made their first contribution in #7478
- @yoshitomo-matsubara made their first contribution in #7483
- @SiQube made their first contribution in #7524
- @afuetterer made their first contribution in #7436
Full Changelog: 3.5.0...3.5.1
3.5.0
Datasets Features
- Introduce PDF support (#7318) by @yabramuvdi in #7325
>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder" # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...
What's Changed
- Fix local pdf loading by @lhoestq in #7466
- Minor fix for metadata files in extension counter by @lhoestq in #7464
- Priotitize json by @lhoestq in #7476
New Contributors
- @yabramuvdi made their first contribution in #7325
Full Changelog: 3.4.1...3.5.0
3.4.1
3.4.0
Dataset Features
-
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
- /!\ Breaking change: we replaced
decord
withtorchvision
to read videos, sincedecord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideo
type is still marked as experimental is this version
from datasets import load_dataset, Video dataset = load_dataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
- faster streaming for image/audio/video folder from Hugging Face
- support for
metadata.parquet
in addition tometadata.csv
ormetadata.jsonl
for the metadata of the image/audio/video files
- /!\ Breaking change: we replaced
-
Add IterableDataset.decode with multithreading by @lhoestq in #7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
dataset = dataset.decode(num_threads=num_threads)
General improvements and bug fixes
- fix: None default with bool type on load creates typing error by @stephantul in #7426
- Use pyupgrade --py39-plus by @cyyever in #7428
- Refactor
string_to_dict
to returnNone
if there is no match instead of raisingValueError
by @ringohoffman in #7435 - Fix small bugs with async map by @lhoestq in #7445
- Fix resuming after
ds.set_epoch(new_epoch)
by @lhoestq in #7451 - minor docs changes by @lhoestq in #7452
New Contributors
- @stephantul made their first contribution in #7426
- @cyyever made their first contribution in #7428
- @jp1924 made their first contribution in #7368
Full Changelog: 3.3.2...3.4.0
3.3.2
Bug fixes
- Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in #7411
- Gracefully cancel async tasks by @lhoestq in #7414
Other general improvements
- Update use_with_pandas.mdx: to_pandas() correction in last section by @ibarrien in #7407
- Fix a typo in arrow_dataset.py by @jingedawang in #7402
New Contributors
- @dakinggg made their first contribution in #7411
- @ibarrien made their first contribution in #7407
- @jingedawang made their first contribution in #7402
Full Changelog: 3.3.1...3.3.2