Skip to content

Commit eb511ff

Browse files
author
Francesco Fiusco
committed
WIP scientific data
1 parent 8c0dd8c commit eb511ff

File tree

1 file changed

+24
-11
lines changed

1 file changed

+24
-11
lines changed

content/scientific-data.rst

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -220,13 +220,27 @@ An overview of common data formats
220220
- ❌
221221
- ✅
222222

223-
.. important::
223+
.. important:: Legend
224224

225225
- ✅ : Good
226226
- 🟨 : Ok / depends on a case
227227
- ❌ : Bad
228228

229+
Some of these formats (e.g. JSON and CSV) are saved as text files (ASCII), thus they are
230+
human-readable. This makes them easier to visually check them (e.g. for format errors) and
231+
are supported out of the box by many tools. However, they tend to be slower during I/O and
232+
are not optimal for storage of floating point numbers, as they either require much larger
233+
disk space or have to sacrifice precision to curb size.
234+
235+
Most storage-intensive data is saved in binary formats, which usually require specific libraries
236+
(and possibly specific versions) to be read and cannot be inspected visually. However, they tend to
237+
have much better performance during I/O and to save space when storing floating point numbers at full
238+
precision. Moreover, embedding metadata is easier.
229239

240+
Most of the formats in the table are application- and language-agnostic. However, a couple are
241+
Python-native: `Pickle <https://docs.python.org/3/library/pickle.html>`__, which is used to serialise
242+
any Python object, and `npy <https://numpy.org/devdocs/reference/generated/numpy.lib.format.html>`__,
243+
which is used to serialise Numpy arrays. Several Numpy arrays can be bundled in a single *npz* file.
230244

231245

232246
CSV (comma-separated values)
@@ -246,8 +260,8 @@ CSV (comma-separated values)
246260
- Ease of use: Ok for one or two dimensional data. Bad for anything higher.
247261
- **Best use cases:** Sharing data. Small data. Data that needs to be human-readable.
248262

249-
CSV is by far the most popular file format, as it is human-readable and easily shareable.
250-
However, it is not the best format to use when you're working with big data.
263+
CSV is a very popular file format, as it is human-readable and easily shareable.
264+
However, it is not the best format to use when working with big (numerical) data.
251265

252266
.. important::
253267

@@ -267,7 +281,7 @@ HDF5 (Hierarchical Data Format version 5)
267281
.. admonition:: Key features
268282

269283
- **Type:** Binary format
270-
- **Packages needed:** Pandas, PyTables, h5py
284+
- **Packages needed:** Pandas, PyTables, h5py, pyvista for meshes, domain-specific...
271285
- **Space efficiency:** Good for numeric data.
272286
- **Good for sharing/archival:** Yes, if datasets are named well.
273287
- Tidy data:
@@ -300,14 +314,13 @@ NetCDF4 (Network Common Data Form version 4)
300314
- **Best use cases:** Working with big datasets in array data format. Especially useful if the dataset
301315
contains spatial or temporal dimensions. Archiving or sharing those datasets.
302316

303-
NetCDF4 is a data format that uses HDF5 as its file format, but it has standardized structure of
304-
datasets and metadata related to these datasets. This makes it possible to be read from various different programs.
305-
306-
NetCDF4 is by far the most common format for storing large data from big simulations in physical sciences.
317+
NetCDF4 is a data format built on top of HDF5, but exposes a simpler API with a more standardised structure.
318+
NetCDF4 is one of the most used formats for storing large data from big simulations in physical sciences.
307319

308-
The advantage of NetCDF4 compared to HDF5 is that one can easily add additional metadata, e.g. spatial
309-
dimensions (``x``, ``y``, ``z``) or timestamps (``t``) that tell where the grid-points are situated.
310-
As the format is standardized, many programs can use this metadata for visualization and further analysis.
320+
..
321+
The advantage of NetCDF4 compared to HDF5 is that one can easily add additional metadata, e.g. spatial
322+
dimensions (``x``, ``y``, ``z``) or timestamps (``t``) that tell where the grid-points are situated.
323+
As the format is standardized, many programs can use this metadata for visualization and further analysis.
311324
312325
There's more
313326
~~~~~~~~~~~~

0 commit comments

Comments
 (0)