Skip to content

Commit 12e1bb6

Browse files
franzpoeschelax3l
andauthored
Dataset-specific JSON/TOML configuration (openPMD#1646)
* Add JSONMatcher class * Embed JSONMatcher into the backends * First attempt: Dataset-specific configuration * Seems to work * Adapt Coretest to new output of myPath() * Better error messages and some documentation inside example * Adapt constructors for installation without ADIOS2/HDF5 * CI fixes * Basic implementation * Update documentation and tests * Add a JSON translation of the config for NVHPC compilers might also be good for documentation purposes as JSON is more widely known * Use dataset-specific config in tests * Fix: do_prune parameter for merge() * Rename merge() -> merge_internal() Having the same name as the public function provoked errors due to conversion from nlohmann::json types. * Don't compute the matchers for all backends * Add default block to test configs * Documentation * Add TOML example * Add Python binding for openPMD_path * Fix doxygen * Read dataset-specific configuration also in ADIOS2::openDataset * Cleanup * Fix initialization from Dummy IO Handler * Fix Doxygen * Documentation Update * Fix NVCOMPILER macro in example --------- Co-authored-by: Axel Huebl <[email protected]>
1 parent c484088 commit 12e1bb6

39 files changed

+1325
-238
lines changed

CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,7 @@ set(CORE_SOURCE
406406
src/auxiliary/Date.cpp
407407
src/auxiliary/Filesystem.cpp
408408
src/auxiliary/JSON.cpp
409+
src/auxiliary/JSONMatcher.cpp
409410
src/auxiliary/Mpi.cpp
410411
src/backend/Attributable.cpp
411412
src/backend/BaseRecordComponent.cpp

docs/source/details/backendconfig.rst

+75
Original file line numberDiff line numberDiff line change
@@ -287,3 +287,78 @@ Explanation of the single keys:
287287
In "template" mode, only the dataset metadata (type, extent and attributes) are stored and no chunks can be written or read (i.e. write/read operations will be skipped).
288288
* ``json.attribute.mode`` / ``toml.attribute.mode``: One of ``"long"`` (default in openPMD 1.*) or ``"short"`` (default in openPMD 2.* and generally in TOML).
289289
The long format explicitly encodes the attribute type in the dataset on disk, the short format only writes the actual attribute as a JSON/TOML value, requiring readers to recover the type.
290+
291+
Dataset-specific configuration
292+
------------------------------
293+
294+
Sometimes it is beneficial to set configuration options for specific datasets.
295+
Most dataset-specific configuration options supported by the openPMD-api are additionally backend-specific, being format-specific serialization instructions such as compression or chunking.
296+
297+
All dataset-specific and backend-specific configuration is specified under the key path ``<backend>.dataset``.
298+
Without filtering by dataset name (see the ``select``` key below) this looks like:
299+
300+
.. code-block:: json
301+
302+
{
303+
"adios2": {
304+
"dataset": {
305+
"operators": []
306+
}
307+
},
308+
"hdf5": {
309+
"dataset": {
310+
"chunking": "auto"
311+
}
312+
}
313+
}
314+
315+
Dataset-specific configuration options can be configured in multiple ways:
316+
317+
As part of the general JSON/TOML configuration
318+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
319+
320+
In the simplest case, the dataset configuration is specified without any extra steps as part of the JSON/TOML configuration that is used to initialize the openPMD Series as part of the ``Series`` constructor. This does not allow specifying different configurations per dataset, but sets the default configuration for all datasets.
321+
322+
As a separate JSON/TOML configuration during dataset initialization
323+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
324+
325+
Similarly to the ``Series`` constructor, the ``Dataset`` constructor optionally receives a JSON/TOML configuration, used for setting options specifically only for those datasets initialized with this ``Dataset`` specification. The default given in the ``Series`` constructor will be overridden.
326+
327+
This is the preferred way for configuring dataset-specific options that are *not* backend-specific (currently only ``{"resizable": true}``).
328+
329+
By pattern-matching the dataset names
330+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331+
332+
The above approach has the disadvantage that it has to be supported explicitly at the level of the downstream application, e.g. a simulation or data reader. As an alternative, the the backend-specific dataset configuration under ``<backend>.dataset`` can also be given as a list of alternatives that are matched against the dataset name in sequence, e.g. ``hdf5.dataset = [<pattern_1>, <pattern_2>, ...]``.
333+
334+
Each such pattern ``<pattern_i>`` is a JSON object with key ``cfg`` and optional key ``select``: ``{"select": <regex>, "cfg": <cfg>}``.
335+
336+
In here, ``<regex>`` is a regex or a list of regexes, of type egrep as defined by the `C++ standard library <https://en.cppreference.com/w/cpp/regex/basic_regex/constants>`__.
337+
``<cfg>`` is a configuration that will be forwarded as a "regular" dataset configuration to the backend.
338+
339+
.. note::
340+
341+
To match lists of regular expressions ``select = [REGEX_1, REGEX_2, ..., REGEX_n]``, the list is internally transformed into a single regular expression ``($^)|(REGEX_1)|(REGEX_2)|...|(REGEX_n)``.
342+
343+
In a configuration such as ``hdf5.dataset = [<pattern_1>, <pattern_2>, ...]``, the single patterns will be processed in top-down manner, selecting the first matching pattern found in the list.
344+
The specified regexes will be matched against the openPMD dataset path either within the Iteration (e.g. ``meshes/E/x`` or ``particles/.*/position/.*``) or within the Series (e.g. ``/data/1/meshes/E/x`` or ``/data/.*/particles/.*/position/.*``), considering full matches only.
345+
346+
.. note::
347+
348+
The dataset name is determined by the result of ``attributable.myPath().openPMDPath()`` where ``attributable`` is an object in the openPMD hierarchy.
349+
350+
.. note::
351+
352+
To match against the path within the containing Iteration or within the containing Series, the specified regular expression is internally transformed into ``(/data/[0-9]+/)?(REGEX)`` where ``REGEX`` is the specified pattern, and then matched against the full dataset path.
353+
354+
The **default configuration** is specified by omitting the ``select`` key.
355+
Specifying more than one default is an error.
356+
If no pattern matches a dataset, the default configuration is chosen if specified, or an empty JSON object ``{}`` otherwise.
357+
358+
A full example:
359+
360+
.. literalinclude:: openpmd_extended_config.toml
361+
:language: toml
362+
363+
.. literalinclude:: openpmd_extended_config.json
364+
:language: json
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
{
2+
"adios2": {
3+
"engine": {
4+
"parameters": {
5+
"Profile": "On"
6+
}
7+
},
8+
"dataset": [
9+
{
10+
"cfg": {
11+
"operators": [
12+
{
13+
"type": "blosc",
14+
"parameters": {
15+
"clevel": "1",
16+
"doshuffle": "BLOSC_BITSHUFFLE"
17+
}
18+
}
19+
]
20+
}
21+
},
22+
{
23+
"select": [
24+
".*positionOffset.*",
25+
".*particlePatches.*"
26+
],
27+
"cfg": {
28+
"operators": []
29+
}
30+
}
31+
]
32+
},
33+
"hdf5": {
34+
"independent_stores": false,
35+
"dataset": [
36+
{
37+
"cfg": {
38+
"chunks": "auto"
39+
}
40+
},
41+
{
42+
"select": [
43+
"/data/1/particles/e/.*",
44+
"/data/2/particles/e/.*"
45+
],
46+
"cfg": {
47+
"chunks": [
48+
5
49+
]
50+
}
51+
},
52+
{
53+
"select": "particles/e/.*",
54+
"cfg": {
55+
"chunks": [
56+
10
57+
]
58+
}
59+
}
60+
]
61+
}
62+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
2+
# ADIOS2 config
3+
4+
[adios2.engine.parameters]
5+
Profile = "On"
6+
7+
# default configuration
8+
[[adios2.dataset]]
9+
# nested list as ADIOS2 can add multiple operators to a single dataset
10+
[[adios2.dataset.cfg.operators]]
11+
type = "blosc"
12+
parameters.doshuffle = "BLOSC_BITSHUFFLE"
13+
parameters.clevel = "1"
14+
15+
# dataset-specific configuration to exclude some datasets
16+
# from applying operators.
17+
[[adios2.dataset]]
18+
select = [".*positionOffset.*", ".*particlePatches.*"]
19+
cfg.operators = []
20+
21+
# Now HDF5
22+
23+
[hdf5]
24+
independent_stores = false
25+
26+
# default configuration
27+
# The position of the default configuration does not matter, but there must
28+
# be only one single default configuration.
29+
[[hdf5.dataset]]
30+
cfg.chunks = "auto"
31+
32+
# Dataset-specific configuration that specifies full paths,
33+
# i.e. including the path to the Iteration.
34+
# The non-default configurations are matched in top-down order,
35+
# so the order is relevant.
36+
[[hdf5.dataset]]
37+
select = ["/data/1/particles/e/.*", "/data/2/particles/e/.*"]
38+
cfg.chunks = [5]
39+
40+
# dataset-specific configuration that specifies only the path
41+
# within the Iteration
42+
[[hdf5.dataset]]
43+
select = "particles/e/.*"
44+
cfg.chunks = [10]

examples/13_write_dynamic_configuration.cpp

+101-11
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,16 @@ using namespace openPMD;
1010

1111
int main()
1212
{
13-
if (!getVariants()["adios2"])
13+
if (!getVariants()["hdf5"])
1414
{
1515
// Example configuration below selects the ADIOS2 backend
1616
return 0;
1717
}
1818

1919
using position_t = double;
20+
21+
// see https://github.com/ToruNiina/toml11/issues/205
22+
#if !defined(__NVCOMPILER_MAJOR__) || __NVCOMPILER_MAJOR__ >= 23
2023
/*
2124
* This example demonstrates how to use JSON/TOML-based dynamic
2225
* configuration for openPMD.
@@ -34,7 +37,7 @@ int main()
3437
# be passed by adding an at-sign `@` in front of the path
3538
# The format will then be recognized by filename extension, i.e. .json or .toml
3639
37-
backend = "adios2"
40+
backend = "hdf5"
3841
iteration_encoding = "group_based"
3942
# The following is only relevant in read mode
4043
defer_iteration_parsing = true
@@ -57,13 +60,104 @@ parameters.clevel = 5
5760
# type = "some other parameter"
5861
# # ...
5962
60-
[hdf5.dataset]
61-
chunks = "auto"
63+
# Sometimes, dataset configurations should not affect all datasets, but only
64+
# specific ones, e.g. only particle data.
65+
# Dataset configurations can be given as a list, here at the example of HDF5.
66+
# In such lists, each entry is an object with two keys:
67+
#
68+
# 1. 'cfg': Mandatory key, this is the actual dataset configuration.
69+
# 2. 'select': A Regex or a list of Regexes to match against the dataset name.
70+
#
71+
# This makes it possible to give dataset-specific configurations.
72+
# The dataset name is the same as returned
73+
# by `Attributable::myPath().openPMDPath()`.
74+
# The regex must match against either the full path (e.g. "/data/1/meshes/E/x")
75+
# or against the path within the iteration (e.g. "meshes/E/x").
76+
77+
# Example:
78+
# Let HDF5 datasets be automatically chunked by default
79+
[[hdf5.dataset]]
80+
cfg.chunks = "auto"
81+
82+
# For particles, we can specify the chunking explicitly
83+
[[hdf5.dataset]]
84+
# Multiple selection regexes can be given as a list.
85+
# They will be fused into a single regex '($^)|(regex1)|(regex2)|(regex3)|...'.
86+
select = ["/data/1/particles/e/.*", "/data/2/particles/e/.*"]
87+
cfg.chunks = [5]
88+
89+
# Selecting a match works top-down, the order of list entries is important.
90+
[[hdf5.dataset]]
91+
# Specifying only a single regex.
92+
# The regex can match against the full dataset path
93+
# or against the path within the Iteration.
94+
# Capitalization is irrelevant.
95+
select = "particles/e/.*"
96+
CFG.CHUNKS = [10]
6297
)END";
98+
#else
99+
/*
100+
* This is the same configuration in JSON. We need this in deprecated
101+
* NVHPC-compilers due to problems that those compilers have with the
102+
* toruniina::toml11 library.
103+
*/
104+
std::string const defaults = R"(
105+
{
106+
"backend": "hdf5",
107+
"defer_iteration_parsing": true,
108+
"iteration_encoding": "group_based",
109+
110+
"adios2": {
111+
"engine": {
112+
"type": "bp4"
113+
},
114+
"dataset": {
115+
"operators": [
116+
{
117+
"parameters": {
118+
"clevel": 5
119+
},
120+
"type": "zlib"
121+
}
122+
]
123+
}
124+
},
125+
126+
"hdf5": {
127+
"dataset": [
128+
{
129+
"cfg": {
130+
"chunks": "auto"
131+
}
132+
},
133+
{
134+
"select": [
135+
"/data/1/particles/e/.*",
136+
"/data/2/particles/e/.*"
137+
],
138+
"cfg": {
139+
"chunks": [
140+
5
141+
]
142+
}
143+
},
144+
{
145+
"select": "particles/e/.*",
146+
"CFG": {
147+
"CHUNKS": [
148+
10
149+
]
150+
}
151+
}
152+
]
153+
}
154+
}
155+
)";
156+
#endif
63157

64158
// open file for writing
65159
Series series =
66-
Series("../samples/dynamicConfig.bp", Access::CREATE, defaults);
160+
Series("../samples/dynamicConfig.h5", Access::CREATE, defaults);
67161

68162
Datatype datatype = determineDatatype<position_t>();
69163
constexpr unsigned long length = 10ul;
@@ -93,18 +187,14 @@ chunks = "auto"
93187

94188
/*
95189
* We want different compression settings for this dataset, so we pass
96-
* a dataset-specific configuration.
190+
* a dataset-specific configuration. This will override any definition
191+
* specified above.
97192
* Also showcase how to define an resizable dataset.
98193
* This time in JSON.
99194
*/
100195
std::string const differentCompressionSettings = R"END(
101196
{
102197
"resizable": true,
103-
"adios1": {
104-
"dataset": {
105-
"transform": "blosc:compressor=zlib,shuffle=bit,lvl=1;nometa"
106-
}
107-
},
108198
"adios2": {
109199
"dataset": {
110200
"operators": [

0 commit comments

Comments
 (0)