@@ -220,13 +220,27 @@ An overview of common data formats
220
220
- ❌
221
221
- ✅
222
222
223
- .. important ::
223
+ .. important :: Legend
224
224
225
225
- ✅ : Good
226
226
- 🟨 : Ok / depends on a case
227
227
- ❌ : Bad
228
228
229
+ Some of these formats (e.g. JSON and CSV) are saved as text files (ASCII), thus they are
230
+ human-readable. This makes them easier to visually check them (e.g. for format errors) and
231
+ are supported out of the box by many tools. However, they tend to be slower during I/O and
232
+ are not optimal for storage of floating point numbers, as they either require much larger
233
+ disk space or have to sacrifice precision to curb size.
234
+
235
+ Most storage-intensive data is saved in binary formats, which usually require specific libraries
236
+ (and possibly specific versions) to be read and cannot be inspected visually. However, they tend to
237
+ have much better performance during I/O and to save space when storing floating point numbers at full
238
+ precision. Moreover, embedding metadata is easier.
229
239
240
+ Most of the formats in the table are application- and language-agnostic. However, a couple are
241
+ Python-native: `Pickle <https://docs.python.org/3/library/pickle.html >`__, which is used to serialise
242
+ any Python object, and `npy <https://numpy.org/devdocs/reference/generated/numpy.lib.format.html >`__,
243
+ which is used to serialise Numpy arrays. Several Numpy arrays can be bundled in a single *npz * file.
230
244
231
245
232
246
CSV (comma-separated values)
@@ -246,8 +260,8 @@ CSV (comma-separated values)
246
260
- Ease of use: Ok for one or two dimensional data. Bad for anything higher.
247
261
- **Best use cases: ** Sharing data. Small data. Data that needs to be human-readable.
248
262
249
- CSV is by far the most popular file format, as it is human-readable and easily shareable.
250
- However, it is not the best format to use when you're working with big data.
263
+ CSV is a very popular file format, as it is human-readable and easily shareable.
264
+ However, it is not the best format to use when working with big (numerical) data.
251
265
252
266
.. important ::
253
267
@@ -267,7 +281,7 @@ HDF5 (Hierarchical Data Format version 5)
267
281
.. admonition :: Key features
268
282
269
283
- **Type: ** Binary format
270
- - **Packages needed: ** Pandas, PyTables, h5py
284
+ - **Packages needed: ** Pandas, PyTables, h5py, pyvista for meshes, domain-specific...
271
285
- **Space efficiency: ** Good for numeric data.
272
286
- **Good for sharing/archival: ** Yes, if datasets are named well.
273
287
- Tidy data:
@@ -300,14 +314,13 @@ NetCDF4 (Network Common Data Form version 4)
300
314
- **Best use cases: ** Working with big datasets in array data format. Especially useful if the dataset
301
315
contains spatial or temporal dimensions. Archiving or sharing those datasets.
302
316
303
- NetCDF4 is a data format that uses HDF5 as its file format, but it has standardized structure of
304
- datasets and metadata related to these datasets. This makes it possible to be read from various different programs.
305
-
306
- NetCDF4 is by far the most common format for storing large data from big simulations in physical sciences.
317
+ NetCDF4 is a data format built on top of HDF5, but exposes a simpler API with a more standardised structure.
318
+ NetCDF4 is one of the most used formats for storing large data from big simulations in physical sciences.
307
319
308
- The advantage of NetCDF4 compared to HDF5 is that one can easily add additional metadata, e.g. spatial
309
- dimensions (``x ``, ``y ``, ``z ``) or timestamps (``t ``) that tell where the grid-points are situated.
310
- As the format is standardized, many programs can use this metadata for visualization and further analysis.
320
+ ..
321
+ The advantage of NetCDF4 compared to HDF5 is that one can easily add additional metadata, e.g. spatial
322
+ dimensions (``x``, ``y``, ``z``) or timestamps (``t``) that tell where the grid-points are situated.
323
+ As the format is standardized, many programs can use this metadata for visualization and further analysis.
311
324
312
325
There's more
313
326
~~~~~~~~~~~~
0 commit comments