diff --git a/dataset_metadata/index.html b/dataset_metadata/index.html
index 1871754..a949c80 100755
--- a/dataset_metadata/index.html
+++ b/dataset_metadata/index.html
@@ -670,7 +670,7 @@ <h2 id="metadata">Metadata</h2>
 </tr>
 <tr>
 <td><em>Available samples</em></td>
-<td>141720670</td>
+<td>141392195</td>
 <td>153226273</td>
 <td>507739073</td>
 </tr>
diff --git a/search/search_index.json b/search/search_index.json
index a12d557..76e2eeb 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"CESNET DataZoo","text":"<p>This is the documentation of the CESNET DataZoo project. </p> <p>The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the <code>cesnet-datazoo</code> package are:</p> <ul> <li>A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic \u2014 CESNET-TLS22, CESNET-QUIC22, and CESNET-TLS-Year22. Details about the available datasets are on the dataset overview page.</li> <li>Provides standard features used for traffic classification, such as sizes, directions, and inter-packet times of the first 30 packets of each flow. More details on the data features page.</li> <li>Extensive configuration options for:<ul> <li>Selection of train, validation, and test periods. The datasets span from two weeks to one year; therefore, it is possible to evaluate classification methods in a time-based fashion that is closer to practical deployment.</li> <li>Selection of application classes and splitting classes between known and unknown. This enables research in the open-world setting, in which classification models need to handle new classes that were not seen during the training process.</li> <li>Data transformations, such as feature scaling. Transforms are implemented in a separate package CESNET Models. See <code>cesnet_models.transforms</code> documentation for details.</li> </ul> </li> <li>Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.</li> <li>Datasets are offered in multiple sizes to give users an option to start experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the <code>S</code> size containing 25 million samples. </li> </ul>"},{"location":"#papers","title":"Papers","text":"<ul> <li>DataZoo: Streamlining Traffic Classification Experiments  Jan Luxemburk and Karel Hynek  CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023</li> </ul>"},{"location":"dataloaders/","title":"Using dataloaders","text":"<p>Apart from loading data into dataframes, the <code>cesnet-datazoo</code> package provides dataloaders for processing data in smaller batches.</p> <p>An example of how dataloaders can be used is in <code>cesnet_datazoo.datasets.loaders</code> or in the following snippet:</p> <pre><code>def load_from_dataloader(dataloader: DataLoader):\n    other_fields = []\n    data_ppi = []\n    data_flowstats = []\n    labels = []\n    for batch_other_fields, batch_ppi, batch_flowstats, batch_labels in dataloader:\n        other_fields.append(batch_other_fields)\n        data_ppi.append(batch_ppi)\n        data_flowstats.append(batch_flowstats)\n        labels.append(batch_labels)\n    df_other_fields = pd.concat(other_fields, ignore_index=True)\n    data_ppi = np.concatenate(data_ppi)\n    data_flowstats = np.concatenate(data_flowstats)\n    labels = np.concatenate(labels)\n    return df_other_fields, data_ppi, data_flowstats, labels\n</code></pre> <p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_other_fields,  batch_ppi, batch_flowstats, batch_labels)</code>. Batch size B is configured with <code>batch_size</code> and <code>test_batch_size</code> config options. The shapes are:</p> <ul> <li>batch_other_fields <code>pd.DataFrame (B, C)</code> - a Pandas DataFrame with auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the <code>return_other_fields</code> config option is false, this will be an empty DataFrame. Columns C depend on the used dataset and are available at <code>dataset_config.other_fields</code>.</li> <li>batch_ppi - <code>np.ndarray (B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li> <li>batch_flowstats <code>np.ndarray (B, F)</code> - where F is the number of flowstats features computed with DatasetConfig.get_flowstats_features_len. To get the order and names of flowstats features, call DatasetConfig.get_flowstats_feature_names_expanded. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the data features page for more information about features.</li> <li>batch_labels <code>np.ndarray (B)</code> - integer labels encoded with a <code>LabelEncoder</code> instance available at <code>dataset.class_info.encoder</code>.</li> </ul> <p>PPI and flow statistics features returned from dataloaders are transformed depending on the selected configuration. See the transforms page for more information.</p>"},{"location":"dataset_metadata/","title":"DatasetMetadata","text":"<p>Each dataset class has its metadata available as a <code>DatasetMetadata</code> instance in the <code>metadata</code> attribute.</p>"},{"location":"dataset_metadata/#metadata","title":"Metadata","text":"Name CESNET-TLS22 CESNET-QUIC22 CESNET-TLS-Year22 Protocol TLS QUIC TLS Published in 2022 2023 2023 Collected in 2021 2022 2022 Collection duration 2 weeks 4 weeks 1 year Available samples 141720670 153226273 507739073 Available dataset sizes XS, S, M, L XS, S, M, L XS, S, M, L Collection period 4.10.2021 - 17.10.2021 31.10.2022 - 27.11.2022 1.1.2022 - 31.12.2022 Missing dates in collection period 20220128, 20220129, 20220130, 20221212, 20221213, 20221229, 20221230, 20221231 Application count 191 102 180 Background traffic classes default-background, google-background, facebook-background PPI features IPT, DIR, SIZE IPT, DIR, SIZE IPT, DIR, SIZE, PUSH_FLAG Flowstats features BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION Flowstats features boolean FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_OTHER FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_END, FLOW_ENDREASON_OTHER Packet histograms PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT TCP features FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV Other fields ID ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST Cite https://doi.org/10.1016/j.comnet.2022.109467 https://doi.org/10.1016/j.dib.2023.108888 Zenodo URL https://zenodo.org/record/7965515 https://zenodo.org/record/7963302 Related papers https://doi.org/10.23919/TMA58422.2023.10199052"},{"location":"datasets_overview/","title":"Overview of datasets","text":""},{"location":"datasets_overview/#cesnet-tls22","title":"CESNET-TLS22","text":"<p>CESNET-TLS22</p> <ul> <li>TLS protocol</li> <li>Collected in 2021</li> <li>Spans two weeks</li> <li>Contains 141 million samples</li> <li>Has 191 application classes</li> </ul> <p>This dataset was published in \"Fine-grained TLS services classification with reject option\" (DOI, arXiv). It was built from live traffic collected using high-speed monitoring probes at the perimeter of the CESNET2 network.</p> <p>For detailed information about the dataset, see the linked paper and the dataset metadata page.</p>"},{"location":"datasets_overview/#cesnet-quic22","title":"CESNET-QUIC22","text":"<p>CESNET-QUIC22</p> <ul> <li>QUIC protocol</li> <li>Collected in 2022</li> <li>Spans four weeks</li> <li>Contains 153 million samples</li> <li>Has 102 application classes and three background traffic classes</li> </ul> <p>This dataset was published in \"CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines\" (DOI). The QUIC protocol has the potential to replace TLS over TLS as the standard protocol for reliable and secure Internet communication. Due to its design that makes the inspection of connection handshakes challenging and its usage in HTTP/3, there is an increasing demand for QUIC traffic classification methods.</p> <p>For detailed information about the dataset, see the linked paper and the dataset metadata page. Experiments based on this dataset were published in \"Encrypted traffic classification: the QUIC case\" (DOI).</p>"},{"location":"datasets_overview/#cesnet-tls-year22","title":"CESNET-TLS-Year22","text":"<p>CESNET-TLS-Year22</p> <ul> <li>TLS protocol</li> <li>Collected in 2022</li> <li>Spans one year</li> <li>Contains 507 million samples</li> <li>Has 180 application classes</li> </ul> <p>This dataset is similar to CESNET-TLS22; however, it spans the entire year 2022. It will be published in the near future.</p>"},{"location":"features/","title":"Features","text":"<p>This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.</p>"},{"location":"features/#ppi-sequence","title":"PPI sequence","text":"<p>A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set <code>use_push_flags</code> for using PUSH flags in PPI sequences, if available in the used dataset.</p> Name Description SIZE Size of the transport payload IPT Inter-packet time in milliseconds. The IPT of the first packet is set to zero DIR Direction of the packet encoded as \u00b11 PUSH_FLAG Whether the push flag was set in the TCP packet"},{"location":"features/#flow-statistics","title":"Flow statistics","text":"<p>Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.</p> Name Description DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons"},{"location":"features/#packet-histograms","title":"Packet histograms","text":"<p>Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0\u201315, 16\u201331, 32\u201363, 64\u2013127, 128\u2013255, 256\u2013511, 512\u20131024, &gt;1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set <code>use_packet_histograms</code> for using packet histograms features, if available in the dataset.</p> Name Description PSIZE_BIN{x} Packet sizes histogram x-th bin for the forward direction PSIZE_BIN{x}_REV Packet sizes histogram x-th bin for the reverse direction IPT_BIN{x} Inter-packet times histogram x-th bin for the forward direction IPT_BIN{x}_REV Inter-packet times histogram x-th bin for the reverse direction <p>On the dataset metadata page, packet histogram features are called <code>PHIST_SRC_SIZES</code>, <code>PHIST_DST_SIZES</code>, <code>PHIST_SRC_IPT</code>, <code>PHIST_DST_IPT</code>. Those are the names of database columns that are flattened to the _BIN{x} features.</p>"},{"location":"features/#tcp-features","title":"TCP features","text":"<p>Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set <code>use_tcp_features</code> for using a subset of flags defined in <code>cesnet_datazoo.constants.SELECTED_TCP_FLAGS</code>.</p> Name Description FLAG_{F} Whether F flag was present in the forward (client to server) direction FLAG_{F}_REV Whether F flag was present in the reverse (server to client) direction"},{"location":"features/#other-fields","title":"Other fields","text":"<p>Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets.  Set <code>return_other_fields</code> to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.</p> Name Description ID Per-dataset unique flow identifier TIME_FIRST Timestamp of the first packet TIME_LAST Timestamp of the last packet SRC_IP Source IP address DST_IP Destination IP address DST_ASN Destination Autonomous System number SRC_PORT Source port DST_PORT Destination port PROTOCOL Transport protocol TLS_SNI / QUIC_SNI Server Name Indication domain TLS_JA3 JA3 fingerprint QUIC_VERSION QUIC protocol version QUIC_USER_AGENT User agent string if available in the QUIC Initial Packet"},{"location":"features/#details-about-packet-histograms-and-ppi","title":"Details about packet histograms and PPI","text":"<p>Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.</p> TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV Zero-length packets(without L4 payload, e.g. ACKs) Not included Not included Included Retransmissions(and out-of-order packets) Included Not included* Included Computed from Entire flow First 30 packets Entire flow <p>*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.</p> <p>For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.</p>"},{"location":"getting_started/","title":"Getting started","text":""},{"location":"getting_started/#jupyter-notebooks","title":"Jupyter notebooks","text":"<p>Example Jupyter notebooks are provided at https://github.com/CESNET/cesnet-tcexamples. Start with:</p> <ul> <li>Initialize the CESNET-QUIC22 dataset and explore its data features - explore_data.ipynb</li> <li>Training of a LightGBM classifier and its evaluation on a per-week and per-day basis - example_evaluation.ipynb</li> </ul>"},{"location":"getting_started/#code-snippets","title":"Code snippets","text":""},{"location":"getting_started/#download-a-dataset-and-compute-statistics","title":"Download a dataset and compute statistics","text":"<p><pre><code>from cesnet_datazoo.datasets import CESNET_QUIC22\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset.compute_dataset_statistics(num_samples=100_000, num_workers=0)\n</code></pre> This will download the dataset, compute dataset statistics, and save them into <code>/datasets/CESNET-QUIC22/statistics</code>.</p>"},{"location":"getting_started/#enable-logging-and-set-the-spawn-method-on-windows","title":"Enable logging and set the spawn method on Windows","text":"<p><pre><code>import logging\nimport multiprocessing as mp\n\nmp.set_start_method(\"spawn\") \nlogging.basicConfig(\n    level=logging.INFO,\n    format=\"[%(asctime)s][%(name)s][%(levelname)s] - %(message)s\")\n</code></pre> For running on Windows, we recommend using the <code>spawn</code> method for creating dataloader worker processes. Set up logging to get more information from the package.</p>"},{"location":"getting_started/#initialize-dataset-to-create-train-validation-and-test-dataframes","title":"Initialize dataset to create train, validation, and test dataframes","text":"<pre><code>from cesnet_datazoo.datasets import CESNET_QUIC22\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\n\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset_config = DatasetConfig(\n    dataset=dataset,\n    apps_selection=AppSelection.ALL_KNOWN,\n    train_period_name=\"W-2022-44\",\n    test_period_name=\"W-2022-45\",\n)\ndataset.set_dataset_config_and_initialize(dataset_config)\ntrain_dataframe = dataset.get_train_df()\nval_dataframe = dataset.get_val_df()\ntest_dataframe = dataset.get_test_df()\n</code></pre> <p>The <code>DatasetConfig</code> class handles the configuration of datasets, and calling <code>set_dataset_config_and_initialize</code> initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See <code>CesnetDataset</code> reference.</p>"},{"location":"installation/","title":"Installation","text":"<p>Install the package from pip with:</p> <pre><code>pip install cesnet-datazoo\n</code></pre> <p>or for editable install with:</p> <pre><code>pip install -e git+https://github.com/CESNET/cesnet-datazoo\n</code></pre>"},{"location":"installation/#requirements","title":"Requirements","text":"<p>The <code>cesnet-datazoo</code> package requires Python &gt;=3.10.</p>"},{"location":"installation/#dependencies","title":"Dependencies","text":"Name Version matplotlib numpy pandas pydantic &gt;=2.0 PyYAML requests scikit-learn seaborn tables &gt;=3.8.0 torch &gt;=1.10 tqdm"},{"location":"reference_cesnet_dataset/","title":"Base dataset class","text":""},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset","title":"datasets.cesnet_dataset.CesnetDataset","text":"<p>The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:</p> <ul> <li>Iterable PyTorch DataLoader for batch processing. See using dataloaders for more details.</li> <li>Pandas DataFrame for loading the entire train, validation, or test set at once.</li> </ul> <p>The dataset is stored in a PyTables database. The internal <code>PyTablesDataset</code> class is used as a wrapper that implements the PyTorch <code>Dataset</code> interface and is compatible with <code>DataLoader</code>, which provides efficient parallel loading of the data. The dataset configuration is done through the <code>DatasetConfig</code> class.</p> <p>Intended usage:</p> <ol> <li>Create an instance of the dataset class with the desired size and data root. This will download the dataset if it has not already been downloaded.</li> <li>Create an instance of <code>DatasetConfig</code> and set it with <code>set_dataset_config_and_initialize</code>. This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.</li> <li>Use <code>get_train_dataloader</code> or <code>get_train_df</code> to get training data for a classification model.</li> <li>Validate the model and perform the hyperparameter optimalization on <code>get_val_dataloader</code> or <code>get_val_df</code>.</li> <li>Evaluate the model on <code>get_test_dataloader</code> or <code>get_test_df</code>.</li> </ol> <p>Parameters:</p> Name Type Description Default <code>data_root</code> <code>str</code> <p>Path to the folder where the dataset will be stored. Each dataset size has its own subfolder <code>data_root/size</code></p> required <code>size</code> <code>str</code> <p>Size of the dataset. Options are <code>XS</code>, <code>S</code>, <code>M</code>, <code>L</code>, <code>ORIG</code>.</p> <code>'S'</code> <code>silent</code> <code>bool</code> <p>Whether to suppress print and tqdm output.</p> <code>False</code> <p>Attributes:</p> Name Type Description <code>name</code> <code>str</code> <p>Name of the dataset.</p> <code>database_filename</code> <code>str</code> <p>Name of the database file.</p> <code>database_path</code> <code>str</code> <p>Path to the database file.</p> <code>servicemap_path</code> <code>str</code> <p>Path to the servicemap file.</p> <code>statistics_path</code> <code>str</code> <p>Path to the dataset statistics folder.</p> <code>bucket_url</code> <code>str</code> <p>URL of the bucket where the database is stored.</p> <code>metadata</code> <code>DatasetMetadata</code> <p>Additional dataset metadata.</p> <code>available_classes</code> <code>list[str]</code> <p>List of all available classes in the dataset.</p> <code>available_dates</code> <code>list[str]</code> <p>List of all available dates in the dataset.</p> <code>time_periods</code> <code>dict[str, list[str]]</code> <p>Predefined time periods. Each time period is a list of dates.</p> <code>default_train_period_name</code> <code>str</code> <p>Default time period for training.</p> <code>default_test_period_name</code> <code>str</code> <p>Default time period for testing.</p> <p>The following attributes are initialized when <code>set_dataset_config_and_initialize</code> is called.</p> <p>Attributes:</p> Name Type Description <code>dataset_config</code> <code>Optional[DatasetConfig]</code> <p>Configuration of the dataset.</p> <code>class_info</code> <code>Optional[ClassInfo]</code> <p>Structured information about the classes.</p> <code>dataset_indices</code> <code>Optional[IndicesTuple]</code> <p>Named tuple containing <code>train_indices</code>, <code>val_known_indices</code>, <code>val_unknown_indices</code>, <code>test_known_indices</code>, <code>test_unknown_indices</code>. These are the indices into PyTables database that define train, validation, and test sets.</p> <code>train_dataset</code> <code>Optional[PyTablesDataset]</code> <p>Train set in the form of <code>PyTablesDataset</code> instance wrapping the PyTables database.</p> <code>val_dataset</code> <code>Optional[PyTablesDataset]</code> <p>Validation set in the form of <code>PyTablesDataset</code> instance wrapping the PyTables database.</p> <code>test_dataset</code> <code>Optional[PyTablesDataset]</code> <p>Test set in the form of <code>PyTablesDataset</code> instance wrapping the PyTables database.</p> <code>known_app_counts</code> <code>Optional[DataFrame]</code> <p>Known application counts in the train, validation, and test sets.</p> <code>unknown_app_counts</code> <code>Optional[DataFrame]</code> <p>Unknown application counts in the validation and test sets.</p> <code>train_dataloader</code> <code>Optional[DataLoader]</code> <p>Iterable PyTorch <code>DataLoader</code> for training.</p> <code>train_dataloader_sampler</code> <code>Optional[Sampler]</code> <p>Sampler used for iterating the training dataloader. Either <code>RandomSampler</code> or <code>SequentialSampler</code>.</p> <code>train_dataloader_drop_last</code> <code>bool</code> <p>Whether to drop the last incomplete batch when iterating the training dataloader.</p> <code>val_dataloader</code> <code>Optional[DataLoader]</code> <p>Iterable PyTorch <code>DataLoader</code> for validation.</p> <code>test_dataloader</code> <code>Optional[DataLoader]</code> <p>Iterable PyTorch <code>DataLoader</code> for testing.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>class CesnetDataset():\n    \"\"\"\n    The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:\n\n    - Iterable PyTorch DataLoader for batch processing. See [using dataloaders][using-dataloaders] for more details.\n    - Pandas DataFrame for loading the entire train, validation, or test set at once.\n\n    The dataset is stored in a [PyTables](https://www.pytables.org/) database. The internal `PyTablesDataset` class is used as a wrapper\n    that implements the PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) interface\n    and is compatible with [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),\n    which provides efficient parallel loading of the data. The dataset configuration is done through the [`DatasetConfig`][config.DatasetConfig] class.\n\n    **Intended usage:**\n\n    1. Create an instance of the [dataset class][dataset-classes] with the desired size and data root. This will download the dataset if it has not already been downloaded.\n    2. Create an instance of [`DatasetConfig`][config.DatasetConfig] and set it with [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize].\n    This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.\n    3. Use [`get_train_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_train_dataloader] or [`get_train_df`][datasets.cesnet_dataset.CesnetDataset.get_train_df] to get training data for a classification model.\n    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_val_dataloader] or [`get_val_df`][datasets.cesnet_dataset.CesnetDataset.get_val_df].\n    5. Evaluate the model on [`get_test_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_test_dataloader] or [`get_test_df`][datasets.cesnet_dataset.CesnetDataset.get_test_df].\n\n    Parameters:\n        data_root: Path to the folder where the dataset will be stored. Each dataset size has its own subfolder `data_root/size`\n        size: Size of the dataset. Options are `XS`, `S`, `M`, `L`, `ORIG`.\n        silent: Whether to suppress print and tqdm output.\n\n    Attributes:\n        name: Name of the dataset.\n        database_filename: Name of the database file.\n        database_path: Path to the database file.\n        servicemap_path: Path to the servicemap file.\n        statistics_path: Path to the dataset statistics folder.\n        bucket_url: URL of the bucket where the database is stored.\n        metadata: Additional [dataset metadata][metadata].\n        available_classes: List of all available classes in the dataset.\n        available_dates: List of all available dates in the dataset.\n        time_periods: Predefined time periods. Each time period is a list of dates.\n        default_train_period_name: Default time period for training.\n        default_test_period_name: Default time period for testing.\n\n    The following attributes are initialized when [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize] is called.\n\n    Attributes:\n        dataset_config: Configuration of the dataset.\n        class_info: Structured information about the classes.\n        dataset_indices: Named tuple containing `train_indices`, `val_known_indices`, `val_unknown_indices`, `test_known_indices`, `test_unknown_indices`. These are the indices into PyTables database that define train, validation, and test sets.\n        train_dataset: Train set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        val_dataset: Validation set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        test_dataset: Test set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        known_app_counts: Known application counts in the train, validation, and test sets.\n        unknown_app_counts: Unknown application counts in the validation and test sets.\n        train_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training.\n        train_dataloader_sampler: Sampler used for iterating the training dataloader. Either [`RandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) or [`SequentialSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler).\n        train_dataloader_drop_last: Whether to drop the last incomplete batch when iterating the training dataloader.\n        val_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        test_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    \"\"\"\n    data_root: str\n    size: str\n    silent: bool = False\n\n    name: str\n    database_filename: str\n    database_path: str\n    servicemap_path: str\n    statistics_path: str\n    bucket_url: str\n    metadata: DatasetMetadata\n    available_classes: list[str]\n    available_dates: list[str]\n    time_periods: dict[str, list[str]]\n    default_train_period_name: str\n    default_test_period_name: str\n\n    dataset_config: Optional[DatasetConfig] = None\n    class_info: Optional[ClassInfo] = None\n    dataset_indices: Optional[IndicesTuple] = None\n    train_dataset: Optional[PyTablesDataset] = None\n    val_dataset: Optional[PyTablesDataset] = None\n    test_dataset: Optional[PyTablesDataset] = None\n    known_app_counts: Optional[pd.DataFrame] = None\n    unknown_app_counts: Optional[pd.DataFrame] = None\n    train_dataloader: Optional[DataLoader] = None\n    train_dataloader_sampler: Optional[Sampler] = None\n    train_dataloader_drop_last: bool = True\n    val_dataloader: Optional[DataLoader] = None\n    test_dataloader: Optional[DataLoader] = None\n\n    _collate_fn: Optional[Callable] = None\n    _tables_app_enum: dict[int, str]\n    _tables_cat_enum: dict[int, str]\n\n    def __init__(self, data_root: str, size: str = \"S\", database_checks_at_init: bool = False, silent: bool = False) -&gt; None:\n        self.silent = silent\n        self.metadata = load_metadata(self.name)\n        self.size = size\n        if self.size != \"ORIG\":\n            if size not in self.metadata.available_dataset_sizes:\n                raise ValueError(f\"Unknown dataset size {self.size}\")\n            self.name = f\"{self.name}-{self.size}\"\n            filename, ext = os.path.splitext(self.database_filename)\n            self.database_filename = f\"{filename}-{self.size}{ext}\"\n        self.data_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, self.size)))\n        self.database_path = os.path.join(self.data_root, self.database_filename)\n        self.servicemap_path = os.path.join(self.data_root, SERVICEMAP_FILE)\n        self.statistics_path = os.path.join(self.data_root, \"statistics\")\n        if not os.path.exists(self.data_root):\n            os.makedirs(self.data_root)\n        if not self._is_downloaded():\n            self._download()\n        if database_checks_at_init:\n            with tb.open_file(self.database_path, mode=\"r\") as database:\n                tables_paths = list(map(lambda x: x._v_pathname, iter(database.get_node(f\"/flows\"))))\n                num_samples = 0\n                for p in tables_paths:\n                    table = database.get_node(p)\n                    assert isinstance(table, tb.Table)\n                    if self._tables_app_enum != {v: k for k, v in dict(table.get_enum(APP_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_app_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    if self._tables_cat_enum != {v: k for k, v in dict(table.get_enum(CATEGORY_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_cat_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    num_samples += len(table)\n                if self.size == \"ORIG\" and num_samples != self.metadata.available_samples:\n                    raise ValueError(f\"Expected {self.metadata.available_samples} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.size != \"ORIG\" and num_samples != DATASET_SIZES[self.size]:\n                    raise ValueError(f\"Expected {DATASET_SIZES[self.size]} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.available_dates != list(map(lambda x: x.removeprefix(\"/flows/D\"), tables_paths)):\n                    raise ValueError(f\"Found mismatch between available_dates and the dates available in the PyTables database. Please report this issue.\")\n        # Add all available dates as single date time periods\n        for d in self.available_dates:\n            self.time_periods[d] = [d]\n        available_applications = sorted([app for app in pd.read_csv(self.servicemap_path, index_col=\"Tag\").index if not is_background_app(app)])\n        if len(available_applications) != self.metadata.application_count:\n            raise ValueError(f\"Found {len(available_applications)} applications in the servicemap (omitting background traffic classes), but expected {self.metadata.application_count}. Please report this issue.\")\n        self.available_classes = available_applications + self.metadata.background_traffic_classes\n\n    def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -&gt; None:\n        \"\"\"\n        Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n        Parameters:\n            dataset_config: Desired configuration of the dataset.\n            disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n        \"\"\"\n        self.dataset_config = dataset_config\n        self._clear()\n        self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n\n    def get_train_dataloader(self) -&gt; DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n        When the dataloader is iterated in random order, the last incomplete batch is dropped.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config               | Description                                                                                |\n        | ---------------------------- | ------------------------------------------------------------------------------------------ |\n        | `batch_size`                 | Number of samples per batch.                                                               |\n        | `train_workers`              | Number of workers for loading train data.                                                  |\n        | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n        | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n        Returns:\n            Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n        if not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n        assert self.train_dataset\n        if self.train_dataloader:\n            return self.train_dataloader\n        # Create sampler according to the selected order\n        if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n            if self.dataset_config.train_dataloader_seed is not None:\n                generator = torch.Generator()\n                generator.manual_seed(self.dataset_config.train_dataloader_seed)\n            else:\n                generator = None\n            self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n            self.train_dataloader_drop_last = True\n        elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n            self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n            self.train_dataloader_drop_last = False\n        else: assert_never(self.dataset_config.train_dataloader_order)\n        # Create dataloader\n        batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n        train_dataloader = DataLoader(\n            self.train_dataset,\n            num_workers=self.dataset_config.train_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.train_workers &gt; 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.train_workers == 0:\n            self.train_dataset.pytables_worker_init()\n        self.train_dataloader = train_dataloader\n        return train_dataloader\n\n    def get_val_dataloader(self) -&gt; DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        The dataloader is created on the first call and then cached.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `val_workers`     | Number of workers for loading validation data.                    |\n\n        Returns:\n            Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n        if not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n        assert self.val_dataset is not None\n        if self.val_dataloader:\n            return self.val_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        val_dataloader = DataLoader(\n            self.val_dataset,\n            num_workers=self.dataset_config.val_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.val_workers &gt; 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.val_workers == 0:\n            self.val_dataset.pytables_worker_init()\n        self.val_dataloader = val_dataloader\n        return val_dataloader\n\n    def get_test_dataloader(self) -&gt; DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n        The dataloader is created on the first call and then cached.\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `test_workers`    | Number of workers for loading test data.                          |\n\n        Returns:\n            Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n        if not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n        assert self.test_dataset is not None\n        if self.test_dataloader:\n            return self.test_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        test_dataloader = DataLoader(\n            self.test_dataset,\n            num_workers=self.dataset_config.test_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=False,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.test_workers == 0:\n            self.test_dataset.pytables_worker_init()\n        self.test_dataloader = test_dataloader\n        return test_dataloader\n\n    def get_dataloaders(self) -&gt; tuple[DataLoader, DataLoader, DataLoader]:\n        \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n        train_dataloader = self.get_train_dataloader()\n        val_dataloader = self.get_val_dataloader()\n        test_dataloader = self.get_test_dataloader()\n        return train_dataloader, val_dataloader, test_dataloader\n\n    def get_train_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n        \"\"\"\n        Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n        !!! warning \"Memory usage\"\n\n            The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Train data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_train=True)\n        assert self.dataset_config is not None and self.train_dataset is not None\n        if len(self.train_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n        train_dataloader = self.get_train_dataloader()\n        assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n        # Read dataloader in sequential order\n        train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n        train_dataloader.sampler.drop_last = False\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        df = create_df_from_dataloader(dataloader=train_dataloader,\n                                       feature_names=feature_names,\n                                       flatten_ppi=flatten_ppi,\n                                       silent=self.silent)\n        # Restore the original dataloader sampler and drop_last\n        train_dataloader.sampler.sampler = self.train_dataloader_sampler\n        train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n        return df\n\n    def get_val_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n        \"\"\"\n        Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n        !!! warning \"Memory usage\"\n\n            The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Validation data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_val=True)\n        assert self.dataset_config is not None and self.val_dataset is not None\n        if len(self.val_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_test_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n        \"\"\"\n        Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n        !!! warning \"Memory usage\"\n\n            The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Test data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_test=True)\n        assert self.dataset_config is not None and self.test_dataset is not None\n        if len(self.test_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_num_classes(self) -&gt; int:\n        \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n        return self.class_info.num_classes\n\n    def get_known_apps(self) -&gt; list[str]:\n        \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n        return self.class_info.known_apps\n\n    def get_unknown_apps(self) -&gt; list[str]:\n        \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n        return self.class_info.unknown_apps\n\n    def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -&gt; None:\n        \"\"\"\n        Computes dataset statistics and saves them to the `statistics_path` folder.\n\n        Parameters:\n            num_samples: Number of samples to use for computing the statistics.\n            num_workers: Number of workers for loading data.\n            batch_size: Number of samples per batch for loading data.\n            disabled_apps: List of applications to exclude from the statistics.\n        \"\"\"\n        if disabled_apps:\n            bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n            if len(bad_disabled_apps) &gt; 0:\n                raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if not os.path.exists(self.statistics_path):\n            os.mkdir(self.statistics_path)\n        compute_dataset_statistics(database_path=self.database_path,\n                                   tables_app_enum=self._tables_app_enum,\n                                   tables_cat_enum=self._tables_cat_enum,\n                                   output_dir=self.statistics_path,\n                                   packet_histograms=self.metadata.packet_histograms,\n                                   flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                                   protocol=self.metadata.protocol,\n                                   extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                                   disabled_apps=disabled_apps if disabled_apps is not None else [],\n                                   num_samples=num_samples,\n                                   num_workers=num_workers,\n                                   batch_size=batch_size,\n                                   silent=self.silent)\n\n    def _generate_time_periods(self) -&gt; None:\n        time_periods = {}\n        for period in self.time_periods:\n            time_periods[period] = []\n            if period.startswith(\"W\"):\n                split = period.split(\"-\")\n                collection_year, week = int(split[1]), int(split[2])\n                for d in range(1, 8):\n                    s = datetime.date.fromisocalendar(collection_year, week, d).strftime(\"%Y%m%d\")\n                    # last week of a year can span into the following year\n                    if s not in self.metadata.missing_dates_in_collection_period and s.startswith(str(collection_year)):\n                        time_periods[period].append(s)\n            elif period.startswith(\"M\"):\n                split = period.split(\"-\")\n                collection_year, month = int(split[1]), int(split[2])\n                for d in range(1, calendar.monthrange(collection_year, month)[1]):\n                    s = datetime.date(collection_year, month, d).strftime(\"%Y%m%d\")\n                    if s not in self.metadata.missing_dates_in_collection_period:\n                        time_periods[period].append(s)\n        self.time_periods = time_periods\n\n    def _is_downloaded(self) -&gt; bool:\n        \"\"\"Servicemap is downloaded after the database; thus if it exists, the database is also downloaded\"\"\"\n        return os.path.exists(self.servicemap_path) and os.path.exists(self.database_path)\n\n    def _download(self) -&gt; None:\n        if not self.silent:\n            print(f\"Downloading {self.name} dataset\")\n        database_url = f\"{self.bucket_url}&amp;file={self.database_filename}\"\n        servicemap_url = f\"{self.bucket_url}&amp;file={SERVICEMAP_FILE}\"\n        resumable_download(url=database_url, file_path=self.database_path, silent=self.silent)\n        simple_download(url=servicemap_url, file_path=self.servicemap_path)\n\n    def _clear(self) -&gt; None:\n        self.class_info = None\n        self.dataset_indices = None\n        self.train_dataset = None\n        self.val_dataset = None\n        self.test_dataset = None\n        self.known_app_counts = None\n        self.unknown_app_counts = None\n        self.train_dataloader = None\n        self.train_dataloader_sampler = None\n        self.train_dataloader_drop_last = True\n        self.val_dataloader = None\n        self.test_dataloader = None\n        self._collate_fn = None\n\n    def _check_before_dataframe(self, check_train: bool = False, check_val: bool = False, check_test: bool = False) -&gt; None:\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting a dataframe\")\n        if self.dataset_config.return_tensors:\n            raise ValueError(\"Dataframes are not available when return_tensors is set. Use a dataloader instead.\")\n        if check_train and not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataframe is not available when need_train_set is false\")\n        if check_val and not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataframe is not available when need_val_set is false\")\n        if check_test and not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataframe is not available when need_test_set is false\")\n\n    def _initialize_train_val_test(self, disable_indices_cache: bool = False) -&gt; None:\n        assert self.dataset_config is not None\n        dataset_config = self.dataset_config\n        servicemap = pd.read_csv(dataset_config.servicemap_path, index_col=\"Tag\")\n        # Initialize train set\n        if dataset_config.need_train_set:\n            train_indices, train_unknown_indices, known_apps, unknown_apps = init_or_load_train_indices(dataset_config=dataset_config,\n                                                                                                        tables_app_enum=self._tables_app_enum,\n                                                                                                        servicemap=servicemap,\n                                                                                                        disable_indices_cache=disable_indices_cache,)\n            # Date weight sampling of train indices\n            if dataset_config.train_dates_weigths is not None:\n                assert dataset_config.train_size != \"all\"\n                if dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                    # requested number of samples is train_size + val_known_size when using the split-from-train validation approach\n                    assert dataset_config.val_known_size != \"all\"\n                    num_samples = dataset_config.train_size + dataset_config.val_known_size\n                else:\n                    num_samples = dataset_config.train_size\n                if num_samples &gt; len(train_indices):\n                    raise ValueError(f\"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})\")\n                train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)\n        elif dataset_config.apps_selection == AppSelection.FIXED:\n            known_apps = dataset_config.apps_selection_fixed_known\n            unknown_apps = dataset_config.apps_selection_fixed_unknown\n            train_indices = np.zeros((0,3), dtype=np.int64)\n            train_unknown_indices = np.zeros((0,3), dtype=np.int64)\n        else:\n            raise ValueError(\"Either need train set or the fixed application selection\")\n        # Initialize validation set\n        if dataset_config.need_val_set:\n            if dataset_config.val_approach == ValidationApproach.VALIDATION_DATES:\n                val_known_indices, val_unknown_indices, val_data_path = init_or_load_val_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n            elif dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                train_val_rng = get_fresh_random_generator(dataset_config=dataset_config, section=RandomizedSection.TRAIN_VAL_SPLIT)\n                val_data_path = dataset_config._get_train_data_path()\n                val_unknown_indices = train_unknown_indices\n                train_labels = train_indices[:, INDICES_LABEL_POS]\n                if dataset_config.train_dates_weigths is not None:\n                    assert dataset_config.val_known_size != \"all\"\n                    # When weight sampling is used, val_known_size is kept but the resulting train size can be smaller due to no enough samples in some train dates\n                    if dataset_config.val_known_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples after weight sampling ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.val_known_size, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                    dataset_config.train_size = len(train_indices)\n                elif dataset_config.train_size == \"all\" and dataset_config.val_known_size == \"all\":\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.train_val_split_fraction, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                else:\n                    if dataset_config.val_known_size != \"all\" and  dataset_config.train_size != \"all\" and dataset_config.train_size + dataset_config.val_known_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested train size + validation size ({dataset_config.train_size + dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.train_size != \"all\" and dataset_config.train_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested train size ({dataset_config.train_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.val_known_size != \"all\" and dataset_config.val_known_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices,\n                                                                        train_size=dataset_config.train_size if dataset_config.train_size != \"all\" else None,\n                                                                        test_size=dataset_config.val_known_size if dataset_config.val_known_size != \"all\" else None,\n                                                                        stratify=train_labels, shuffle=True, random_state=train_val_rng)\n        else:\n            val_known_indices = np.zeros((0,3), dtype=np.int64)\n            val_unknown_indices = np.zeros((0,3), dtype=np.int64)\n            val_data_path = None\n        # Initialize test set\n        if dataset_config.need_test_set:\n            test_known_indices, test_unknown_indices, test_data_path = init_or_load_test_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n        else:\n            test_known_indices = np.zeros((0,3), dtype=np.int64)\n            test_unknown_indices = np.zeros((0,3), dtype=np.int64)\n            test_data_path = None\n        # Fit scalers if needed\n        if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or\n            dataset_config.flowstats_transform is not None and dataset_config.flowstats_transform.needs_fitting):\n            if not dataset_config.need_train_set:\n                raise ValueError(\"Train set is needed to fit the scalers. Provide pre-fitted scalers.\")\n            fit_scalers(dataset_config=dataset_config, train_indices=train_indices)\n        # Subset dataset indices based on the selected sizes and compute application counts\n        dataset_indices = IndicesTuple(train_indices=train_indices, val_known_indices=val_known_indices, val_unknown_indices=val_unknown_indices, test_known_indices=test_known_indices, test_unknown_indices=test_unknown_indices)\n        dataset_indices = subset_and_sort_indices(dataset_config=dataset_config, dataset_indices=dataset_indices)\n        known_app_counts = compute_known_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        unknown_app_counts = compute_unknown_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        # Combine known and unknown test indicies to create a single dataloader\n        assert isinstance(dataset_config.test_unknown_size, int)\n        if dataset_config.test_unknown_size &gt; 0 and len(unknown_apps) &gt; 0:\n            test_combined_indices = np.concatenate((dataset_indices.test_known_indices, dataset_indices.test_unknown_indices))\n        else:\n            test_combined_indices = dataset_indices.test_known_indices\n        # Create encoder the class info structure\n        encoder = LabelEncoder().fit(known_apps)\n        encoder.classes_ = np.append(encoder.classes_, UNKNOWN_STR_LABEL)\n        class_info = create_class_info(servicemap=servicemap, encoder=encoder, known_apps=known_apps, unknown_apps=unknown_apps)\n        encode_labels_with_unknown_fn = partial(_encode_labels_with_unknown, encoder=encoder, class_info=class_info)\n        # Create train, validation, and test datasets\n        train_dataset = val_dataset = test_dataset = None\n        if dataset_config.need_train_set:\n            train_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_train_tables_paths(),\n                indices=dataset_indices.train_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,)\n        if dataset_config.need_val_set:\n            assert val_data_path is not None\n            val_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_train_tables_paths(),\n                indices=dataset_indices.val_known_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_val,\n                preload_blob=os.path.join(val_data_path, \"preload\", f\"val_dataset-{dataset_config.val_known_size}.npz\"),)\n        if dataset_config.need_test_set:\n            assert test_data_path is not None\n            test_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_test_tables_paths(),\n                indices=test_combined_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_test,\n                preload_blob=os.path.join(test_data_path, \"preload\", f\"test_dataset-{dataset_config.test_known_size}-{dataset_config.test_unknown_size}.npz\"),)\n        self.class_info = class_info\n        self.dataset_indices = dataset_indices\n        self.train_dataset = train_dataset\n        self.val_dataset = val_dataset\n        self.test_dataset = test_dataset\n        self.known_app_counts = known_app_counts\n        self.unknown_app_counts = unknown_app_counts\n        self._collate_fn = collate_fn_simple\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize","title":"set_dataset_config_and_initialize","text":"<pre><code>set_dataset_config_and_initialize(\n    dataset_config: DatasetConfig,\n    disable_indices_cache: bool = False,\n) -&gt; None\n</code></pre> <p>Initialize train, validation, and test sets. Data cannot be accessed before calling this method.</p> <p>Parameters:</p> Name Type Description Default <code>dataset_config</code> <code>DatasetConfig</code> <p>Desired configuration of the dataset.</p> required <code>disable_indices_cache</code> <code>bool</code> <p>Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.</p> <code>False</code> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -&gt; None:\n    \"\"\"\n    Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n    Parameters:\n        dataset_config: Desired configuration of the dataset.\n        disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n    \"\"\"\n    self.dataset_config = dataset_config\n    self._clear()\n    self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_dataloader","title":"get_train_dataloader","text":"<pre><code>get_train_dataloader() -&gt; DataLoader\n</code></pre> <p>Provides a PyTorch <code>DataLoader</code> for training. The dataloader is created on the first call and then cached. When the dataloader is iterated in random order, the last incomplete batch is dropped. The dataloader is configured with the following config attributes:</p> Dataset config Description <code>batch_size</code> Number of samples per batch. <code>train_workers</code> Number of workers for loading train data. <code>train_dataloader_order</code> Whether to load train data in sequential or random order. See config.DataLoaderOrder. <code>train_dataloader_seed</code> Seed for loading train data in random order. <p>Returns:</p> Type Description <code>DataLoader</code> <p>Train data as an iterable dataloader. See using dataloaders for more details.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_train_dataloader(self) -&gt; DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n    When the dataloader is iterated in random order, the last incomplete batch is dropped.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config               | Description                                                                                |\n    | ---------------------------- | ------------------------------------------------------------------------------------------ |\n    | `batch_size`                 | Number of samples per batch.                                                               |\n    | `train_workers`              | Number of workers for loading train data.                                                  |\n    | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n    | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n    Returns:\n        Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n    if not self.dataset_config.need_train_set:\n        raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n    assert self.train_dataset\n    if self.train_dataloader:\n        return self.train_dataloader\n    # Create sampler according to the selected order\n    if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n        if self.dataset_config.train_dataloader_seed is not None:\n            generator = torch.Generator()\n            generator.manual_seed(self.dataset_config.train_dataloader_seed)\n        else:\n            generator = None\n        self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n        self.train_dataloader_drop_last = True\n    elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n        self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n        self.train_dataloader_drop_last = False\n    else: assert_never(self.dataset_config.train_dataloader_order)\n    # Create dataloader\n    batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n    train_dataloader = DataLoader(\n        self.train_dataset,\n        num_workers=self.dataset_config.train_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.train_workers &gt; 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.train_workers == 0:\n        self.train_dataset.pytables_worker_init()\n    self.train_dataloader = train_dataloader\n    return train_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_dataloader","title":"get_val_dataloader","text":"<pre><code>get_val_dataloader() -&gt; DataLoader\n</code></pre> <p>Provides a PyTorch <code>DataLoader</code> for validation. The dataloader is created on the first call and then cached. The dataloader is configured with the following config attributes:</p> Dataset config Description <code>test_batch_size</code> Number of samples per batch for loading validation and test data. <code>val_workers</code> Number of workers for loading validation data. <p>Returns:</p> Type Description <code>DataLoader</code> <p>Validation data as an iterable dataloader. See using dataloaders for more details.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_val_dataloader(self) -&gt; DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n    The dataloader is created on the first call and then cached.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `val_workers`     | Number of workers for loading validation data.                    |\n\n    Returns:\n        Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n    if not self.dataset_config.need_val_set:\n        raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n    assert self.val_dataset is not None\n    if self.val_dataloader:\n        return self.val_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    val_dataloader = DataLoader(\n        self.val_dataset,\n        num_workers=self.dataset_config.val_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.val_workers &gt; 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.val_workers == 0:\n        self.val_dataset.pytables_worker_init()\n    self.val_dataloader = val_dataloader\n    return val_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_dataloader","title":"get_test_dataloader","text":"<pre><code>get_test_dataloader() -&gt; DataLoader\n</code></pre> <p>Provides a PyTorch <code>DataLoader</code> for testing. The dataloader is created on the first call and then cached.</p> <p>When the dataset is used in the open-world setting, and unknown classes are defined, the test dataloader returns <code>test_known_size</code> samples of known classes followed by <code>test_unknown_size</code> samples of unknown classes.</p> <p>The dataloader is configured with the following config attributes:</p> Dataset config Description <code>test_batch_size</code> Number of samples per batch for loading validation and test data. <code>test_workers</code> Number of workers for loading test data. <p>Returns:</p> Type Description <code>DataLoader</code> <p>Test data as an iterable dataloader. See using dataloaders for more details.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_test_dataloader(self) -&gt; DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    The dataloader is created on the first call and then cached.\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `test_workers`    | Number of workers for loading test data.                          |\n\n    Returns:\n        Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n    if not self.dataset_config.need_test_set:\n        raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n    assert self.test_dataset is not None\n    if self.test_dataloader:\n        return self.test_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    test_dataloader = DataLoader(\n        self.test_dataset,\n        num_workers=self.dataset_config.test_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=False,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.test_workers == 0:\n        self.test_dataset.pytables_worker_init()\n    self.test_dataloader = test_dataloader\n    return test_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_dataloaders","title":"get_dataloaders","text":"<pre><code>get_dataloaders() -&gt; (\n    tuple[DataLoader, DataLoader, DataLoader]\n)\n</code></pre> <p>Gets train, validation, and test dataloaders in one call.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_dataloaders(self) -&gt; tuple[DataLoader, DataLoader, DataLoader]:\n    \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n    train_dataloader = self.get_train_dataloader()\n    val_dataloader = self.get_val_dataloader()\n    test_dataloader = self.get_test_dataloader()\n    return train_dataloader, val_dataloader, test_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_df","title":"get_train_df","text":"<pre><code>get_train_df(flatten_ppi: bool = False) -&gt; pd.DataFrame\n</code></pre> <p>Creates a train Pandas <code>DataFrame</code>. The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.</p> <p>Memory usage</p> <p>The whole train set is loaded into memory. If the dataset size is larger than <code>'S'</code>, consider using <code>get_train_dataloader</code> instead.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten the PPI sequence into individual columns (named <code>IPT_X</code>, <code>DIR_X</code>, <code>SIZE_X</code>, <code>PUSH_X</code>, X being the index of the packet) or keep one <code>PPI</code> column with 2D data.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>Train data as a dataframe.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_train_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n    \"\"\"\n    Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n    !!! warning \"Memory usage\"\n\n        The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Train data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_train=True)\n    assert self.dataset_config is not None and self.train_dataset is not None\n    if len(self.train_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n    train_dataloader = self.get_train_dataloader()\n    assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n    # Read dataloader in sequential order\n    train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n    train_dataloader.sampler.drop_last = False\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    df = create_df_from_dataloader(dataloader=train_dataloader,\n                                   feature_names=feature_names,\n                                   flatten_ppi=flatten_ppi,\n                                   silent=self.silent)\n    # Restore the original dataloader sampler and drop_last\n    train_dataloader.sampler.sampler = self.train_dataloader_sampler\n    train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n    return df\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_df","title":"get_val_df","text":"<pre><code>get_val_df(flatten_ppi: bool = False) -&gt; pd.DataFrame\n</code></pre> <p>Creates validation Pandas <code>DataFrame</code>. The dataframe is in sequential (datetime) order.</p> <p>Memory usage</p> <p>The whole validation set is loaded into memory. If the dataset size is larger than <code>'S'</code>, consider using <code>get_val_dataloader</code> instead.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten the PPI sequence into individual columns (named <code>IPT_X</code>, <code>DIR_X</code>, <code>SIZE_X</code>, <code>PUSH_X</code>, X being the index of the packet) or keep one <code>PPI</code> column with 2D data.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>Validation data as a dataframe.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_val_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n    \"\"\"\n    Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n    !!! warning \"Memory usage\"\n\n        The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Validation data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_val=True)\n    assert self.dataset_config is not None and self.val_dataset is not None\n    if len(self.val_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_df","title":"get_test_df","text":"<pre><code>get_test_df(flatten_ppi: bool = False) -&gt; pd.DataFrame\n</code></pre> <p>Creates test Pandas <code>DataFrame</code>. The dataframe is in sequential (datetime) order.</p> <p>When the dataset is used in the open-world setting, and unknown classes are defined, the returned test dataframe is composed of <code>test_known_size</code> samples of known classes followed by <code>test_unknown_size</code> samples of unknown classes.</p> <p>Memory usage</p> <p>The whole test set is loaded into memory. If the dataset size is larger than <code>'S'</code>, consider using <code>get_test_dataloader</code> instead.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten the PPI sequence into individual columns (named <code>IPT_X</code>, <code>DIR_X</code>, <code>SIZE_X</code>, <code>PUSH_X</code>, X being the index of the packet) or keep one <code>PPI</code> column with 2D data.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>Test data as a dataframe.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_test_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n    \"\"\"\n    Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n    !!! warning \"Memory usage\"\n\n        The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Test data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_test=True)\n    assert self.dataset_config is not None and self.test_dataset is not None\n    if len(self.test_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_num_classes","title":"get_num_classes","text":"<pre><code>get_num_classes() -&gt; int\n</code></pre> <p>Returns the number of classes in the current configuration of the dataset.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_num_classes(self) -&gt; int:\n    \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n    return self.class_info.num_classes\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_known_apps","title":"get_known_apps","text":"<pre><code>get_known_apps() -&gt; list[str]\n</code></pre> <p>Returns the list of known applications in the current configuration of the dataset.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_known_apps(self) -&gt; list[str]:\n    \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n    return self.class_info.known_apps\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_unknown_apps","title":"get_unknown_apps","text":"<pre><code>get_unknown_apps() -&gt; list[str]\n</code></pre> <p>Returns the list of unknown applications in the current configuration of the dataset.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_unknown_apps(self) -&gt; list[str]:\n    \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n    return self.class_info.unknown_apps\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.compute_dataset_statistics","title":"compute_dataset_statistics","text":"<pre><code>compute_dataset_statistics(\n    num_samples: int | Literal[\"all\"] = 10000000,\n    num_workers: int = 4,\n    batch_size: int = 16384,\n    disabled_apps: Optional[list[str]] = None,\n) -&gt; None\n</code></pre> <p>Computes dataset statistics and saves them to the <code>statistics_path</code> folder.</p> <p>Parameters:</p> Name Type Description Default <code>num_samples</code> <code>int | Literal['all']</code> <p>Number of samples to use for computing the statistics.</p> <code>10000000</code> <code>num_workers</code> <code>int</code> <p>Number of workers for loading data.</p> <code>4</code> <code>batch_size</code> <code>int</code> <p>Number of samples per batch for loading data.</p> <code>16384</code> <code>disabled_apps</code> <code>Optional[list[str]]</code> <p>List of applications to exclude from the statistics.</p> <code>None</code> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -&gt; None:\n    \"\"\"\n    Computes dataset statistics and saves them to the `statistics_path` folder.\n\n    Parameters:\n        num_samples: Number of samples to use for computing the statistics.\n        num_workers: Number of workers for loading data.\n        batch_size: Number of samples per batch for loading data.\n        disabled_apps: List of applications to exclude from the statistics.\n    \"\"\"\n    if disabled_apps:\n        bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n        if len(bad_disabled_apps) &gt; 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n    if not os.path.exists(self.statistics_path):\n        os.mkdir(self.statistics_path)\n    compute_dataset_statistics(database_path=self.database_path,\n                               tables_app_enum=self._tables_app_enum,\n                               tables_cat_enum=self._tables_cat_enum,\n                               output_dir=self.statistics_path,\n                               packet_histograms=self.metadata.packet_histograms,\n                               flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                               protocol=self.metadata.protocol,\n                               extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                               disabled_apps=disabled_apps if disabled_apps is not None else [],\n                               num_samples=num_samples,\n                               num_workers=num_workers,\n                               batch_size=batch_size,\n                               silent=self.silent)\n</code></pre>"},{"location":"reference_dataset_config/","title":"Config class","text":""},{"location":"reference_dataset_config/#config.DatasetConfig","title":"config.DatasetConfig","text":"<p>The main class for the configuration of:</p> <ul> <li>Train, validation, test sets (dates, sizes, validation approach).</li> <li>Application selection \u2014 either the standard closed-world setting (only known classes) or the open-world setting (known and unknown classes).</li> <li>Data transformations. See the transforms page for more information.</li> <li>Dataloader options like batch sizes, order of loading, or number of workers.</li> </ul> <p>When initializing this class, pass a <code>CesnetDataset</code> instance to be configured and the desired configuration. Available options are here.</p> <p>Attributes:</p> Name Type Description <code>dataset</code> <code>InitVar[CesnetDataset]</code> <p>The dataset instance to be configured.</p> <code>data_root</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>database_filename</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>database_path</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>servicemap_path</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>flowstats_features</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.flowstats_features</code>.</p> <code>flowstats_features_boolean</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.flowstats_features_boolean</code>.</p> <code>flowstats_features_phist</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.packet_histograms</code> if <code>use_packet_histograms</code> is true, otherwise an empty list.</p> <code>other_fields</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.other_fields</code> if <code>return_other_fields</code> is true, otherwise an empty list.</p>"},{"location":"reference_dataset_config/#config.DatasetConfig--configuration-options","title":"Configuration options","text":"<p>Attributes:</p> Name Type Description <code>need_train_set</code> <code>bool</code> <p>Use to disable the train set. <code>Default: True</code></p> <code>need_val_set</code> <code>bool</code> <p>Use to disable the validation set. When <code>need_train_set</code> is false, the validation set will also be disabled. <code>Default: True</code></p> <code>need_test_set</code> <code>bool</code> <p>Use to disable the test set. <code>Default: True</code></p> <code>train_period_name</code> <code>str</code> <p>Name of the train period. See instructions.</p> <code>train_dates</code> <code>list[str]</code> <p>Dates used for creating a train set.</p> <code>train_dates_weigths</code> <code>Optional[list[int]]</code> <p>To use a non-uniform distribution of samples across train dates.</p> <code>val_approach</code> <code>ValidationApproach</code> <p>How a validation set should be created. Either split train data into train and validation or have a separate validation period. <code>Default: SPLIT_FROM_TRAIN</code></p> <code>train_val_split_fraction</code> <code>float</code> <p>The fraction of validation samples when splitting from the train set. <code>Default: 0.2</code></p> <code>val_period_name</code> <code>str</code> <p>Name of the validation period. See instructions.</p> <code>val_dates</code> <code>list[str]</code> <p>Dates used for creating a validation set.</p> <code>test_period_name</code> <code>str</code> <p>Name of the test period. See instructions.</p> <code>test_dates</code> <code>list[str]</code> <p>Dates used for creating a test set.</p> <code>apps_selection</code> <code>AppSelection</code> <p>How to select application classes. <code>Default: ALL_KNOWN</code></p> <code>apps_selection_topx</code> <code>int</code> <p>Take top X as known.</p> <code>apps_selection_background_unknown</code> <code>list[str]</code> <p>Provide a list of background traffic classes to be used as unknown.</p> <code>apps_selection_fixed_known</code> <code>list[str]</code> <p>Provide a list of manually selected known applications.</p> <code>apps_selection_fixed_unknown</code> <code>list[str]</code> <p>Provide a list of manually selected unknown applications.</p> <code>disabled_apps</code> <code>list[str]</code> <p>List of applications to be disabled and not used at all.</p> <code>min_train_samples_check</code> <code>MinTrainSamplesCheck</code> <p>How to handle applications with not enough training samples. <code>Default: DISABLE_APPS</code></p> <code>min_train_samples_per_app</code> <code>int</code> <p>Defines the threshold for not enough. <code>Default: 100</code></p> <code>random_state</code> <code>int</code> <p>Fix all random processes performed during dataset initialization. <code>Default: 420</code></p> <code>fold_id</code> <code>int</code> <p>To perform N-fold cross-validation, set this to <code>1..N</code>. Each fold will use the same configuration but a different random seed. <code>Default: 0</code></p> <code>train_workers</code> <code>int</code> <p>Number of workers for loading train data. <code>0</code> means that the data will be loaded in the main process. <code>Default: 4</code></p> <code>test_workers</code> <code>int</code> <p>Number of workers for loading test data. <code>0</code> means that the data will be loaded in the main process. <code>Default: 1</code></p> <code>val_workers</code> <code>int</code> <p>Number of workers for loading validation data. <code>0</code> means that the data will be loaded in the main process. <code>Default: 1</code></p> <code>batch_size</code> <code>int</code> <p>Number of samples per batch. <code>Default: 192</code></p> <code>test_batch_size</code> <code>int</code> <p>Number of samples per batch for loading validation and test data. <code>Default: 2048</code></p> <code>preload_val</code> <code>bool</code> <p>Whether to dump the validation set with <code>numpy.savez_compressed</code> and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. <code>Default: True</code></p> <code>preload_test</code> <code>bool</code> <p>Whether to dump the test set with <code>numpy.savez_compressed</code> and preload it in future runs. <code>Default: False</code></p> <code>train_size</code> <code>int | Literal['all']</code> <p>Size of the train set. See instructions. <code>Default: all</code></p> <code>val_known_size</code> <code>int | Literal['all']</code> <p>Size of the validation set. See instructions. <code>Default: all</code></p> <code>test_known_size</code> <code>int | Literal['all']</code> <p>Size of the test set. See instructions. <code>Default: all</code></p> <code>val_unknown_size</code> <code>int | Literal['all']</code> <p>Size of the unknown classes validation set. Use for evaluation in the open-world setting. <code>Default: 0</code></p> <code>test_unknown_size</code> <code>int | Literal['all']</code> <p>Size of the unknown classes test set. Use for evaluation in the open-world setting. <code>Default: 0</code></p> <code>train_dataloader_order</code> <code>DataLoaderOrder</code> <p>Whether to load train data in sequential or random order. <code>Default: RANDOM</code></p> <code>train_dataloader_seed</code> <code>Optional[int]</code> <p>Seed for loading train data in random order. <code>Default: None</code></p> <code>return_other_fields</code> <code>bool</code> <p>Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. <code>Default: False</code></p> <code>return_tensors</code> <code>bool</code> <p>Use for returning <code>torch.Tensor</code> from dataloaders. Dataframes are not available when this option is used. <code>Default: False</code></p> <code>use_packet_histograms</code> <code>bool</code> <p>Whether to use packet histogram features, if available in the dataset. <code>Default: True</code></p> <code>use_tcp_features</code> <code>bool</code> <p>Whether to use TCP features, if available in the dataset. <code>Default: True</code></p> <code>use_push_flags</code> <code>bool</code> <p>Whether to use push flags in packet sequences, if available in the dataset. <code>Default: False</code></p> <code>fit_scalers_samples</code> <code>int | float</code> <p>Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. <code>Default: 0.25</code></p> <code>ppi_transform</code> <code>Optional[Callable]</code> <p>Transform function for PPI sequences. See the transforms page for more information. <code>Default: None</code></p> <code>flowstats_transform</code> <code>Optional[Callable]</code> <p>Transform function for flow statistics. See the transforms page for more information. <code>Default: None</code></p> <code>flowstats_phist_transform</code> <code>Optional[Callable]</code> <p>Transform function for packet histograms. See the transforms page for more information. <code>Default: None</code></p>"},{"location":"reference_dataset_config/#config.DatasetConfig--how-to-configure-train-validation-and-test-sets","title":"How to configure train, validation, and test sets","text":"<p>There are three options for how to define train/validation/test dates.</p> <ol> <li>Choose a predefined time period (<code>train_period_name</code>, <code>val_period_name</code>, or <code>test_period_name</code>) available in <code>dataset.time_periods</code> and leave the list of dates (<code>train_dates</code>, <code>val_dates</code>, or <code>test_dates</code>) empty.</li> <li>Provide a list of dates and a name for the time period. The dates are checked against <code>dataset.available_dates</code>.</li> <li>Do not specify anything and use the dataset's defaults <code>dataset.default_train_period_name</code> and <code>dataset.default_test_period_name</code>.</li> </ol> <p>There are two options for configuring sizes of train/validation/test sets.</p> <ol> <li>Select an appropriate dataset size (default is <code>S</code>) when creating the <code>CesnetDataset</code> instance and leave <code>train_size</code>, <code>val_known_size</code>, and <code>test_known_size</code> with their default <code>all</code> value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).</li> <li>Provide exact sizes in <code>train_size</code>, <code>val_known_size</code>, and <code>test_known_size</code>. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the <code>ORIG</code> dataset size and want to control the size of experiments.</li> </ol> <p>Tip</p> <p>The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>@dataclass(config=C)\nclass DatasetConfig():\n    \"\"\"\n    The main class for the configuration of:\n\n    - Train, validation, test sets (dates, sizes, validation approach).\n    - Application selection \u2014 either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).\n    - Data transformations. See the [transforms][transforms] page for more information.\n    - Dataloader options like batch sizes, order of loading, or number of workers.\n\n    When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].\n\n    Attributes:\n        dataset: The dataset instance to be configured.\n        data_root: Taken from the dataset instance.\n        database_filename: Taken from the dataset instance.\n        database_path: Taken from the dataset instance.\n        servicemap_path: Taken from the dataset instance.\n        flowstats_features: Taken from `dataset.metadata.flowstats_features`.\n        flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.\n        flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.\n        other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.\n\n    # Configuration options\n\n    Attributes:\n        need_train_set: Use to disable the train set. `Default: True`\n        need_val_set: Use to disable the validation set. When `need_train_set` is false, the validation set will also be disabled. `Default: True`\n        need_test_set: Use to disable the test set. `Default: True`\n        train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        train_dates: Dates used for creating a train set.\n        train_dates_weigths: To use a non-uniform distribution of samples across train dates.\n        val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`\n        train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`\n        val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        val_dates: Dates used for creating a validation set.\n        test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        test_dates: Dates used for creating a test set.\n\n        apps_selection: How to select application classes. `Default: ALL_KNOWN`\n        apps_selection_topx: Take top X as known.\n        apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.\n        apps_selection_fixed_known: Provide a list of manually selected known applications.\n        apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.\n        disabled_apps: List of applications to be disabled and not used at all.\n        min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`\n        min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`\n\n        random_state: Fix all random processes performed during dataset initialization. `Default: 420`\n        fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`\n        train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`\n        test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`\n        val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`\n        batch_size: Number of samples per batch. `Default: 192`\n        test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`\n        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: True`\n        preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`\n        train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`\n        test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`\n        train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`\n        train_dataloader_seed: Seed for loading train data in random order. `Default: None`\n\n        return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`\n        return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`\n        use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`\n        use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`\n        use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`\n        fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`\n        ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`\n\n    # How to configure train, validation, and test sets\n    There are three options for how to define train/validation/test dates.\n\n    1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.\n    2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.\n    3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.\n\n    There are two options for configuring sizes of train/validation/test sets.\n\n    1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.\n    This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).\n    2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.\n    This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.\n\n    !!! tip Validation set\n        The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].\n\n    \"\"\"\n    dataset: InitVar[CesnetDataset]\n    data_root: str = field(init=False)\n    database_filename: str =  field(init=False)\n    database_path: str =  field(init=False)\n    servicemap_path: str = field(init=False)\n    flowstats_features: list[str] = field(init=False)\n    flowstats_features_boolean: list[str] = field(init=False)\n    flowstats_features_phist: list[str] = field(init=False)\n    other_fields: list[str] = field(init=False)\n\n    need_train_set: bool = True\n    need_val_set: bool = True\n    need_test_set: bool = True\n    train_period_name: str = \"\"\n    train_dates: list[str] = field(default_factory=list)\n    train_dates_weigths: Optional[list[int]] = None\n    val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN\n    train_val_split_fraction: float = 0.2\n    val_period_name: str = \"\"\n    val_dates: list[str] = field(default_factory=list)\n    test_period_name: str = \"\"\n    test_dates: list[str] = field(default_factory=list)\n\n    apps_selection: AppSelection = AppSelection.ALL_KNOWN\n    apps_selection_topx: int = 0\n    apps_selection_background_unknown: list[str] = field(default_factory=list)\n    apps_selection_fixed_known: list[str] = field(default_factory=list)\n    apps_selection_fixed_unknown: list[str] = field(default_factory=list)\n    disabled_apps: list[str] = field(default_factory=list)\n    min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS\n    min_train_samples_per_app: int = 100\n\n    random_state: int = 420\n    fold_id: int = 0\n    train_workers: int = 4\n    test_workers: int = 1\n    val_workers: int = 1\n    batch_size: int = 192\n    test_batch_size: int = 2048\n    preload_val: bool = True\n    preload_test: bool = False\n    train_size: int | Literal[\"all\"] = \"all\"\n    val_known_size: int | Literal[\"all\"] = \"all\"\n    test_known_size: int | Literal[\"all\"] = \"all\"\n    val_unknown_size: int | Literal[\"all\"] = 0\n    test_unknown_size: int | Literal[\"all\"] = 0\n    train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM\n    train_dataloader_seed: Optional[int] = None\n\n    return_other_fields: bool = False\n    return_tensors: bool = False\n    use_packet_histograms: bool = False\n    use_tcp_features: bool = False\n    use_push_flags: bool = False\n    fit_scalers_samples: int | float = 0.25\n    ppi_transform: Optional[Callable] = None\n    flowstats_transform: Optional[Callable] = None\n    flowstats_phist_transform: Optional[Callable] = None\n\n    def __post_init__(self, dataset: CesnetDataset):\n        \"\"\"\n        Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.\n        \"\"\"\n        self.data_root = dataset.data_root\n        self.servicemap_path = dataset.servicemap_path\n        self.database_filename = dataset.database_filename\n        self.database_path = dataset.database_path\n\n        if not self.need_train_set:\n            self.need_val_set = False\n            if self.apps_selection != AppSelection.FIXED:\n                raise ValueError(\"Application selection has to be fixed when need_train_set is false\")\n            if (len(self.train_dates) &gt; 0 or self.train_period_name != \"\"):\n                raise ValueError(\"train_dates and train_period_name cannot be specified when need_train_set is false\")\n        else:\n            # Configure train dates\n            if len(self.train_dates) &gt; 0 and self.train_period_name == \"\":\n                raise ValueError(\"train_period_name has to be specified when train_dates are set\")\n            if len(self.train_dates) == 0 and self.train_period_name != \"\":\n                if self.train_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods\")\n                self.train_dates = dataset.time_periods[self.train_period_name]\n            if len(self.train_dates) == 0 and self.train_period_name == \"\":\n                self.train_period_name = dataset.default_train_period_name\n                self.train_dates = dataset.time_periods[dataset.default_train_period_name]\n        # Configure test dates\n        if not self.need_test_set:\n            if (len(self.test_dates) &gt; 0 or self.test_period_name != \"\"):\n                raise ValueError(\"test_dates and test_period_name cannot be specified when need_test_set is false\")\n        else:\n            if len(self.test_dates) &gt; 0 and self.test_period_name == \"\":\n                raise ValueError(\"test_period_name has to be specified when test_dates are set\")\n            if len(self.test_dates) == 0 and self.test_period_name != \"\":\n                if self.test_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods\")\n                self.test_dates = dataset.time_periods[self.test_period_name]\n            if len(self.test_dates) == 0 and self.test_period_name == \"\":\n                self.test_period_name = dataset.default_test_period_name\n                self.test_dates = dataset.time_periods[dataset.default_test_period_name]\n        # Configure val dates\n        if (not self.need_val_set or self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN) and (len(self.val_dates) &gt; 0 or self.val_period_name != \"\"):\n            raise ValueError(\"val_dates and val_period_name cannot be specified when need_val_set is false or the validation approach is split-from-train\")\n        if self.val_approach == ValidationApproach.VALIDATION_DATES:\n            if len(self.val_dates) &gt; 0 and self.val_period_name == \"\":\n                raise ValueError(\"val_period_name has to be specified when val_dates are set\")\n            if len(self.val_dates) == 0 and self.val_period_name != \"\":\n                if self.val_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods\")\n                self.val_dates = dataset.time_periods[self.val_period_name]\n            if len(self.val_dates) == 0 and self.val_period_name == \"\":\n                raise ValueError(\"val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates\")\n        # Check if train, val, and test dates are available in the dataset\n        bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]\n        bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]\n        bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]\n        if len(bad_train_dates) &gt; 0:\n            raise ValueError(f\"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_val_dates) &gt; 0:\n            raise ValueError(f\"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_test_dates) &gt; 0:\n            raise ValueError(f\"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        # Check time order of train, val, and test periods\n        train_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.train_dates]\n        test_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.test_dates]\n        if len(train_dates) &gt; 0 and len(test_dates) &gt; 0  and min(test_dates) &lt;= max(train_dates):\n            warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        if self.val_approach == ValidationApproach.VALIDATION_DATES:\n            # Train dates are guaranteed to be set\n            val_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.val_dates]\n            if min(val_dates) &lt;= max(train_dates):\n                warnings.warn(f\"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n            if len(test_dates) &gt; 0 and min(test_dates) &lt;= max(val_dates):\n                warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        # Configure features\n        self.flowstats_features = dataset.metadata.flowstats_features\n        self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean\n        self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []\n        if self.use_packet_histograms:\n            if len(dataset.metadata.packet_histograms) == 0:\n                raise ValueError(\"This dataset does not support use_packet_histograms\")\n            self.flowstats_features_phist = dataset.metadata.packet_histograms\n        else:\n            self.flowstats_features_phist = []\n            if self.flowstats_phist_transform is not None:\n                raise ValueError(\"flowstats_phist_transform cannot be specified when use_packet_histograms is false\")\n        if dataset.metadata.protocol == Protocol.TLS:\n            if self.use_tcp_features:\n                self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS\n            if self.use_push_flags and \"PUSH_FLAG\" not in dataset.metadata.ppi_features:\n                raise ValueError(\"This TLS dataset does not support use_push_flags\")\n        if dataset.metadata.protocol == Protocol.QUIC:\n            if self.use_tcp_features:\n                raise ValueError(\"QUIC datasets do not support use_tcp_features\")\n            if self.use_push_flags:\n                raise ValueError(\"QUIC datasets do not support use_push_flags\")\n        # When train_dates_weigths are used, train_size and val_known_size have to be specified\n        if self.train_dates_weigths is not None:\n            if not self.need_train_set:\n                raise ValueError(\"train_dates_weigths cannot be specified when need_train_set is false\")\n            if len(self.train_dates_weigths) != len(self.train_dates):\n                raise ValueError(\"train_dates_weigths has to have the same length as train_dates\")\n            if self.train_size == \"all\":\n                raise ValueError(\"train_size cannot be 'all' when train_dates_weigths are speficied\")\n            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == \"all\":\n                raise ValueError(\"val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train\")\n        # App selection\n        if self.apps_selection == AppSelection.ALL_KNOWN:\n            self.val_unknown_size = 0\n            self.test_unknown_size = 0\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) &gt; 0 or len(self.apps_selection_fixed_known) &gt; 0 or len(self.apps_selection_fixed_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known\")\n        if self.apps_selection == AppSelection.TOPX_KNOWN:\n            if self.apps_selection_topx == 0:\n                raise ValueError(\"apps_selection_topx has to be greater than 0 when application selection is top-x-known\")\n            if len(self.apps_selection_background_unknown) &gt; 0 or len(self.apps_selection_fixed_known) &gt; 0 or len(self.apps_selection_fixed_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known\")\n        if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:\n            if len(self.apps_selection_background_unknown) == 0:\n                raise ValueError(\"apps_selection_background_unknown has to be specified when application selection is background-unknown\")\n            bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]\n            if len(bad_apps) &gt; 0:\n                raise ValueError(f\"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) &gt; 0 or len(self.apps_selection_fixed_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown\")\n        if self.apps_selection == AppSelection.FIXED:\n            if len(self.apps_selection_fixed_known) == 0:\n                raise ValueError(\"apps_selection_fixed_known has to be specified when application selection is fixed\")\n            bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]\n            if len(bad_apps) &gt; 0:\n                raise ValueError(f\"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if len(self.disabled_apps) &gt; 0:\n                raise ValueError(\"disabled_apps cannot be specified when application selection is fixed\")\n            if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:\n                warnings.warn(\"min_train_samples_per_app is not used when application selection is fixed\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed\")\n        # More asserts\n        bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]\n        if len(bad_disabled_apps) &gt; 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples &lt;= 0 or self.fit_scalers_samples &gt; 1):\n            raise ValueError(\"fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer\")\n\n    def get_flowstats_features_len(self) -&gt; int:\n        \"\"\"Gets the number of flow statistics features.\"\"\"\n        return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n\n    def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -&gt; list[str]:\n        \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n        phist_mapping = {\n            \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        }\n        short_names_mapping = {\n            \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n            \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n            \"FLOW_ENDREASON_END\": \"FEND_END\",\n            \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n            \"FLAG_CWR\": \"F_CWR\",\n            \"FLAG_CWR_REV\": \"F_CWR_REV\",\n            \"FLAG_ECE\": \"F_ECE\",\n            \"FLAG_ECE_REV\": \"F_ECE_REV\",\n            \"FLAG_PSH_REV\": \"F_PSH_REV\",\n            \"FLAG_RST\": \"F_RST\",\n            \"FLAG_RST_REV\": \"F_RST_REV\",\n            \"FLAG_FIN\": \"F_FIN\",\n            \"FLAG_FIN_REV\": \"F_FIN_REV\",\n        }\n        feature_names = self.flowstats_features[:]\n        for f in self.flowstats_features_boolean:\n            if shorter_names and f in short_names_mapping:\n                feature_names.append(short_names_mapping[f])\n            else:\n                feature_names.append(f)\n        for f in self.flowstats_features_phist:\n            feature_names.extend(phist_mapping[f])\n        assert len(feature_names) == self.get_flowstats_features_len()\n        return feature_names\n\n    def get_ppi_feature_names(self) -&gt; list[str]:\n        \"\"\"Gets the names of flattened PPI features.\"\"\"\n        ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        if self.use_push_flags:\n            ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        return ppi_feature_names\n\n    def get_ppi_channels(self) -&gt; list[int]:\n        \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n        if self.use_push_flags:\n            return TCP_PPI_CHANNELS\n        else:\n            return UDP_PPI_CHANNELS\n\n    def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -&gt; list[str]:\n        \"\"\"\n        Gets feature names.\n\n        Parameters:\n            flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n        \"\"\"\n        feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n        feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n        return feature_names\n\n    def _get_train_tables_paths(self) -&gt; list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n\n    def _get_val_tables_paths(self) -&gt; list[str]:\n        if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n            return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n        return list(map(lambda t: f\"/flows/D{t}\", self.val_dates))\n\n    def _get_test_tables_paths(self) -&gt; list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.test_dates))\n\n    def _get_train_data_hash(self) -&gt; str:\n        train_data_params = self._get_train_data_params()\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        return params_hash\n\n    def _get_train_data_path(self) -&gt; str:\n        if self.need_train_set:\n            params_hash = self._get_train_data_hash()\n            return os.path.join(self.data_root, \"train-data\", f\"{params_hash}_{self.random_state}\", f\"fold_{self.fold_id}\")\n        else:\n            return os.path.join(self.data_root, \"train-data\", \"default\")\n\n    def _get_train_data_params(self) -&gt; TrainDataParams:\n        return TrainDataParams(\n            database_filename=self.database_filename,\n            train_period_name=self.train_period_name,\n            train_tables_paths=self._get_train_tables_paths(),\n            apps_selection=self.apps_selection,\n            apps_selection_topx=self.apps_selection_topx,\n            apps_selection_background_unknown=self.apps_selection_background_unknown,\n            apps_selection_fixed_known=self.apps_selection_fixed_known,\n            apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,\n            disabled_apps=self.disabled_apps,\n            min_train_samples_per_app=self.min_train_samples_per_app,\n            min_train_samples_check=self.min_train_samples_check,)\n\n    def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -&gt; tuple[TestDataParams, str]:\n        assert self.val_approach == ValidationApproach.VALIDATION_DATES\n        val_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.val_period_name,\n            test_tables_paths=self._get_val_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        val_data_path = os.path.join(self.data_root, \"val-data\", f\"{params_hash}_{self.random_state}\")\n        return val_data_params, val_data_path\n\n    def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -&gt; tuple[TestDataParams, str]:\n        test_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.test_period_name,\n            test_tables_paths=self._get_test_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        test_data_path = os.path.join(self.data_root, \"test-data\", f\"{params_hash}_{self.random_state}\")\n        return test_data_params, test_data_path\n\n    @model_validator(mode=\"before\") # type: ignore\n    @classmethod\n    def check_deprecated_args(cls, values):\n        kwargs = values.kwargs\n        if \"train_period\" in kwargs:\n            warnings.warn(\"train_period is deprecated. Use train_period_name instead.\")\n            kwargs[\"train_period_name\"] = kwargs[\"train_period\"]\n        if \"val_period\" in kwargs:\n            warnings.warn(\"val_period is deprecated. Use val_period_name instead.\")\n            kwargs[\"val_period_name\"] = kwargs[\"val_period\"]\n        if \"test_period\" in kwargs:\n            warnings.warn(\"test_period is deprecated. Use test_period_name instead.\")\n            kwargs[\"test_period_name\"] = kwargs[\"test_period\"]\n        return values\n\n    def __str__(self):\n        _process_tag = yaml.emitter.Emitter.process_tag\n        _ignore_aliases = yaml.Dumper.ignore_aliases\n        yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None\n        yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True\n        s = yaml.dump(dataclasses.asdict(self), sort_keys=False)\n        yaml.emitter.Emitter.process_tag = _process_tag\n        yaml.Dumper.ignore_aliases = _ignore_aliases\n        return s\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig-functions","title":"Functions","text":""},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len","title":"get_flowstats_features_len","text":"<pre><code>get_flowstats_features_len() -&gt; int\n</code></pre> <p>Gets the number of flow statistics features.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_flowstats_features_len(self) -&gt; int:\n    \"\"\"Gets the number of flow statistics features.\"\"\"\n    return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded","title":"get_flowstats_feature_names_expanded","text":"<pre><code>get_flowstats_feature_names_expanded(\n    shorter_names: bool = False,\n) -&gt; list[str]\n</code></pre> <p>Gets names of flow statistics features. Packet histograms are expanded into bin features.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -&gt; list[str]:\n    \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n    phist_mapping = {\n        \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n    }\n    short_names_mapping = {\n        \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n        \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n        \"FLOW_ENDREASON_END\": \"FEND_END\",\n        \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n        \"FLAG_CWR\": \"F_CWR\",\n        \"FLAG_CWR_REV\": \"F_CWR_REV\",\n        \"FLAG_ECE\": \"F_ECE\",\n        \"FLAG_ECE_REV\": \"F_ECE_REV\",\n        \"FLAG_PSH_REV\": \"F_PSH_REV\",\n        \"FLAG_RST\": \"F_RST\",\n        \"FLAG_RST_REV\": \"F_RST_REV\",\n        \"FLAG_FIN\": \"F_FIN\",\n        \"FLAG_FIN_REV\": \"F_FIN_REV\",\n    }\n    feature_names = self.flowstats_features[:]\n    for f in self.flowstats_features_boolean:\n        if shorter_names and f in short_names_mapping:\n            feature_names.append(short_names_mapping[f])\n        else:\n            feature_names.append(f)\n    for f in self.flowstats_features_phist:\n        feature_names.extend(phist_mapping[f])\n    assert len(feature_names) == self.get_flowstats_features_len()\n    return feature_names\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_feature_names","title":"get_ppi_feature_names","text":"<pre><code>get_ppi_feature_names() -&gt; list[str]\n</code></pre> <p>Gets the names of flattened PPI features.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_ppi_feature_names(self) -&gt; list[str]:\n    \"\"\"Gets the names of flattened PPI features.\"\"\"\n    ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    if self.use_push_flags:\n        ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    return ppi_feature_names\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_channels","title":"get_ppi_channels","text":"<pre><code>get_ppi_channels() -&gt; list[int]\n</code></pre> <p>Gets the available features (channels) in PPI sequences.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_ppi_channels(self) -&gt; list[int]:\n    \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n    if self.use_push_flags:\n        return TCP_PPI_CHANNELS\n    else:\n        return UDP_PPI_CHANNELS\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_feature_names","title":"get_feature_names","text":"<pre><code>get_feature_names(\n    flatten_ppi: bool = False, shorter_names: bool = False\n) -&gt; list[str]\n</code></pre> <p>Gets feature names.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten PPI into individual feature names or keep one <code>PPI</code> column.</p> <code>False</code> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -&gt; list[str]:\n    \"\"\"\n    Gets feature names.\n\n    Parameters:\n        flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n    \"\"\"\n    feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n    feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n    return feature_names\n</code></pre>"},{"location":"reference_dataset_config/#enums-for-configuration","title":"Enums for configuration","text":"<p>The following enums are used for dataset configuration.</p>"},{"location":"reference_dataset_config/#config.ValidationApproach","title":"config.ValidationApproach","text":"<p>The validation approach defines which samples should be used for creating a validation set.</p> SPLIT_FROM_TRAIN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>SPLIT_FROM_TRAIN = 'split-from-train'\n</code></pre> <p>Split train data into train and validation. Scikit-learn <code>train_test_split</code> is used to create a random stratified validation set. The fraction of validation samples is defined in <code>train_val_split_fraction</code>.</p> VALIDATION_DATES <code>class-attribute</code> <code>instance-attribute</code> <pre><code>VALIDATION_DATES = 'validation-dates'\n</code></pre> <p>Use separate validation dates to create a validation set. Validation dates need to be specified in <code>val_dates</code>, and the name of the validation period in <code>val_period_name</code>.</p>"},{"location":"reference_dataset_config/#config.AppSelection","title":"config.AppSelection","text":"<p>Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use <code>ALL_KNOWN</code> to select all the applications as known. Use <code>TOPX_KNOWN</code> or <code>BACKGROUND_UNKNOWN</code> for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The <code>FIXED</code> is for manual selection of known and unknown applications.</p> ALL_KNOWN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>ALL_KNOWN = 'all-known'\n</code></pre> <p>Use all applications as known.</p> TOPX_KNOWN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>TOPX_KNOWN = 'topx-known'\n</code></pre> <p>Use the first X (<code>apps_selection_topx</code>) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.</p> BACKGROUND_UNKNOWN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>BACKGROUND_UNKNOWN = 'background-unknown'\n</code></pre> <p>Use the list of background traffic classes (<code>apps_selection_background_unknown</code>) as unknown, and the rest as known.</p> FIXED <code>class-attribute</code> <code>instance-attribute</code> <pre><code>FIXED = 'fixed'\n</code></pre> <p>Manual application selection. Provide lists of known applications (<code>apps_selection_fixed_known</code>) and unknown applications (<code>apps_selection_fixed_unknown</code>).</p>"},{"location":"reference_dataset_config/#config.MinTrainSamplesCheck","title":"config.MinTrainSamplesCheck","text":"<p>Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with <code>min_train_samples_per_app</code>, and its default value is 100. With the <code>DISABLE_APPS</code> approach, these applications will be disabled and not used for training or testing. With the <code>WARN_AND_EXIT</code> approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set <code>min_train_samples_per_app</code> to 0.</p> WARN_AND_EXIT <code>class-attribute</code> <code>instance-attribute</code> <pre><code>WARN_AND_EXIT = 'warn-and-exit'\n</code></pre> <p>Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to <code>disabled_apps</code>.</p> DISABLE_APPS <code>class-attribute</code> <code>instance-attribute</code> <pre><code>DISABLE_APPS = 'disable-apps'\n</code></pre> <p>Disable applications with not enough training samples.</p>"},{"location":"reference_dataset_config/#config.DataLoaderOrder","title":"config.DataLoaderOrder","text":"<p>Validation and test sets are always loaded in sequential order \u2014 sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use <code>RANDOM</code> if your classification model requires it; <code>SEQUENTIAL</code> otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.</p> RANDOM <code>class-attribute</code> <code>instance-attribute</code> <pre><code>RANDOM = 'random'\n</code></pre> <p>Iterate train data in random order.</p> SEQUENTIAL <code>class-attribute</code> <code>instance-attribute</code> <pre><code>SEQUENTIAL = 'sequential'\n</code></pre> <p>Iterate train data in sequential (datetime) order.</p>"},{"location":"reference_datasets/","title":"Dataset classes","text":"<p>These are subclasses of <code>CesnetDataset</code> representing individual datasets available in <code>cesnet-datazoo</code>.</p>"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS22","title":"datasets.datasets.CESNET_TLS22","text":"<p>             Bases: <code>CesnetDataset</code></p> <p>Dataset class for CESNET-TLS22.</p> Source code in <code>cesnet_datazoo\\datasets\\datasets.py</code> <pre><code>class CESNET_TLS22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS22][cesnet-tls22].\"\"\"\n    name = \"CESNET-TLS22\"\n    database_filename = \"CESNET-TLS22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls22\"\n    available_dates = _CESNET_TLS22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2021-40\": [\"20211004\", \"20211005\", \"20211006\", \"20211007\", \"20211008\", \"20211009\", \"20211010\"],\n        \"W-2021-41\": [\"20211011\", \"20211012\", \"20211013\", \"20211014\", \"20211015\", \"20211016\", \"20211017\"],\n    }\n    default_train_period_name = \"W-2021-40\"\n    default_test_period_name = \"W-2021-41\"\n    _tables_app_enum = _CESNET_TLS22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS22_TABLES_CATEGORY_ENUM\n</code></pre>"},{"location":"reference_datasets/#datasets.datasets.CESNET_QUIC22","title":"datasets.datasets.CESNET_QUIC22","text":"<p>             Bases: <code>CesnetDataset</code></p> <p>Dataset class for CESNET-QUIC22.</p> Source code in <code>cesnet_datazoo\\datasets\\datasets.py</code> <pre><code>class CESNET_QUIC22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-QUIC22][cesnet-quic22].\"\"\"\n    name = \"CESNET-QUIC22\"\n    database_filename = \"CESNET-QUIC22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-quic22\"\n    available_dates = _CESNET_QUIC22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2022-44\": [\"20221031\", \"20221101\", \"20221102\", \"20221103\", \"20221104\", \"20221105\", \"20221106\"],\n        \"W-2022-45\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\"],\n        \"W-2022-46\": [\"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\"],\n        \"W-2022-47\": [\"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n        \"W45-47\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\",\n                   \"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\",\n                   \"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n    }\n    default_train_period_name = \"W-2022-44\"\n    default_test_period_name = \"W-2022-45\"\n    _tables_app_enum = _CESNET_QUIC22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_QUIC22_TABLES_CATEGORY_ENUM\n</code></pre>"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS_Year22","title":"datasets.datasets.CESNET_TLS_Year22","text":"<p>             Bases: <code>CesnetDataset</code></p> <p>Dataset class for CESNET-TLS-Year22.</p> Source code in <code>cesnet_datazoo\\datasets\\datasets.py</code> <pre><code>class CESNET_TLS_Year22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS-Year22][cesnet-tls-year22].\"\"\"\n    name = \"CESNET-TLS-Year22\"\n    database_filename = \"CESNET-TLS-Year22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls-year22\"\n    available_dates = _CESNET_TLS_YEAR22_AVAILABLE_DATES\n    time_periods = _CESNET_TLS_YEAR22_TIME_PERIODS\n    default_train_period_name = \"M-2022-9\"\n    default_test_period_name = \"M-2022-10\"\n    _tables_app_enum = _CESNET_TLS_YEAR22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS_YEAR22_TABLES_CATEGORY_ENUM\n</code></pre>"},{"location":"transforms/","title":"Transforms","text":"<p>The <code>cesnet_datazoo</code> package supports configurable transforms of input data in a similar fashion to what torchvision is doing for the computer vision field. Input features are split into three groups, each having its own transformation. Those groups are PPI sequences, flow statistics, and packet histograms.</p> <ul> <li>Transformation configured in <code>ppi_transform</code> of <code>DatasetConfig</code> is applied to PPI sequences.</li> <li><code>flowstats_transform</code> is applied to flow statistics (excluding boolean features, such as flow end reasons or TCP flags).</li> <li><code>flowstats_phist_transform</code> is applied to packet histograms.</li> </ul> <p>Transforms are implemented in a separate package CESNET Models. See <code>cesnet_models.transforms</code> documentation for details.</p> <p>Limitations</p> <p>The current implementation does not support the composing of transformations.</p>"},{"location":"transforms/#available-transformations","title":"Available transformations","text":"<p>PPI sequences</p> <ul> <li>ClipAndScalePPI</li> </ul> <p>Flow statistics</p> <ul> <li>ClipAndScaleFlowstats</li> </ul> <p>Packet histograms</p> <ul> <li>NormalizeHistograms</li> </ul> <p>More transformations will be implemented in future versions.</p>"},{"location":"transforms/#data-scaling","title":"Data scaling","text":"<p>Transformations implementing data scaling will be fitted, if needed, on a subset of training data during dataset initialization.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"CESNET DataZoo","text":"<p>This is the documentation of the CESNET DataZoo project. </p> <p>The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the <code>cesnet-datazoo</code> package are:</p> <ul> <li>A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic \u2014 CESNET-TLS22, CESNET-QUIC22, and CESNET-TLS-Year22. Details about the available datasets are on the dataset overview page.</li> <li>Provides standard features used for traffic classification, such as sizes, directions, and inter-packet times of the first 30 packets of each flow. More details on the data features page.</li> <li>Extensive configuration options for:<ul> <li>Selection of train, validation, and test periods. The datasets span from two weeks to one year; therefore, it is possible to evaluate classification methods in a time-based fashion that is closer to practical deployment.</li> <li>Selection of application classes and splitting classes between known and unknown. This enables research in the open-world setting, in which classification models need to handle new classes that were not seen during the training process.</li> <li>Data transformations, such as feature scaling. Transforms are implemented in a separate package CESNET Models. See <code>cesnet_models.transforms</code> documentation for details.</li> </ul> </li> <li>Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.</li> <li>Datasets are offered in multiple sizes to give users an option to start experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the <code>S</code> size containing 25 million samples. </li> </ul>"},{"location":"#papers","title":"Papers","text":"<ul> <li>DataZoo: Streamlining Traffic Classification Experiments  Jan Luxemburk and Karel Hynek  CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023</li> </ul>"},{"location":"dataloaders/","title":"Using dataloaders","text":"<p>Apart from loading data into dataframes, the <code>cesnet-datazoo</code> package provides dataloaders for processing data in smaller batches.</p> <p>An example of how dataloaders can be used is in <code>cesnet_datazoo.datasets.loaders</code> or in the following snippet:</p> <pre><code>def load_from_dataloader(dataloader: DataLoader):\n    other_fields = []\n    data_ppi = []\n    data_flowstats = []\n    labels = []\n    for batch_other_fields, batch_ppi, batch_flowstats, batch_labels in dataloader:\n        other_fields.append(batch_other_fields)\n        data_ppi.append(batch_ppi)\n        data_flowstats.append(batch_flowstats)\n        labels.append(batch_labels)\n    df_other_fields = pd.concat(other_fields, ignore_index=True)\n    data_ppi = np.concatenate(data_ppi)\n    data_flowstats = np.concatenate(data_flowstats)\n    labels = np.concatenate(labels)\n    return df_other_fields, data_ppi, data_flowstats, labels\n</code></pre> <p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_other_fields,  batch_ppi, batch_flowstats, batch_labels)</code>. Batch size B is configured with <code>batch_size</code> and <code>test_batch_size</code> config options. The shapes are:</p> <ul> <li>batch_other_fields <code>pd.DataFrame (B, C)</code> - a Pandas DataFrame with auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the <code>return_other_fields</code> config option is false, this will be an empty DataFrame. Columns C depend on the used dataset and are available at <code>dataset_config.other_fields</code>.</li> <li>batch_ppi - <code>np.ndarray (B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li> <li>batch_flowstats <code>np.ndarray (B, F)</code> - where F is the number of flowstats features computed with DatasetConfig.get_flowstats_features_len. To get the order and names of flowstats features, call DatasetConfig.get_flowstats_feature_names_expanded. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the data features page for more information about features.</li> <li>batch_labels <code>np.ndarray (B)</code> - integer labels encoded with a <code>LabelEncoder</code> instance available at <code>dataset.class_info.encoder</code>.</li> </ul> <p>PPI and flow statistics features returned from dataloaders are transformed depending on the selected configuration. See the transforms page for more information.</p>"},{"location":"dataset_metadata/","title":"DatasetMetadata","text":"<p>Each dataset class has its metadata available as a <code>DatasetMetadata</code> instance in the <code>metadata</code> attribute.</p>"},{"location":"dataset_metadata/#metadata","title":"Metadata","text":"Name CESNET-TLS22 CESNET-QUIC22 CESNET-TLS-Year22 Protocol TLS QUIC TLS Published in 2022 2023 2023 Collected in 2021 2022 2022 Collection duration 2 weeks 4 weeks 1 year Available samples 141392195 153226273 507739073 Available dataset sizes XS, S, M, L XS, S, M, L XS, S, M, L Collection period 4.10.2021 - 17.10.2021 31.10.2022 - 27.11.2022 1.1.2022 - 31.12.2022 Missing dates in collection period 20220128, 20220129, 20220130, 20221212, 20221213, 20221229, 20221230, 20221231 Application count 191 102 180 Background traffic classes default-background, google-background, facebook-background PPI features IPT, DIR, SIZE IPT, DIR, SIZE IPT, DIR, SIZE, PUSH_FLAG Flowstats features BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION Flowstats features boolean FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_OTHER FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_END, FLOW_ENDREASON_OTHER Packet histograms PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT TCP features FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV Other fields ID ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST Cite https://doi.org/10.1016/j.comnet.2022.109467 https://doi.org/10.1016/j.dib.2023.108888 Zenodo URL https://zenodo.org/record/7965515 https://zenodo.org/record/7963302 Related papers https://doi.org/10.23919/TMA58422.2023.10199052"},{"location":"datasets_overview/","title":"Overview of datasets","text":""},{"location":"datasets_overview/#cesnet-tls22","title":"CESNET-TLS22","text":"<p>CESNET-TLS22</p> <ul> <li>TLS protocol</li> <li>Collected in 2021</li> <li>Spans two weeks</li> <li>Contains 141 million samples</li> <li>Has 191 application classes</li> </ul> <p>This dataset was published in \"Fine-grained TLS services classification with reject option\" (DOI, arXiv). It was built from live traffic collected using high-speed monitoring probes at the perimeter of the CESNET2 network.</p> <p>For detailed information about the dataset, see the linked paper and the dataset metadata page.</p>"},{"location":"datasets_overview/#cesnet-quic22","title":"CESNET-QUIC22","text":"<p>CESNET-QUIC22</p> <ul> <li>QUIC protocol</li> <li>Collected in 2022</li> <li>Spans four weeks</li> <li>Contains 153 million samples</li> <li>Has 102 application classes and three background traffic classes</li> </ul> <p>This dataset was published in \"CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines\" (DOI). The QUIC protocol has the potential to replace TLS over TLS as the standard protocol for reliable and secure Internet communication. Due to its design that makes the inspection of connection handshakes challenging and its usage in HTTP/3, there is an increasing demand for QUIC traffic classification methods.</p> <p>For detailed information about the dataset, see the linked paper and the dataset metadata page. Experiments based on this dataset were published in \"Encrypted traffic classification: the QUIC case\" (DOI).</p>"},{"location":"datasets_overview/#cesnet-tls-year22","title":"CESNET-TLS-Year22","text":"<p>CESNET-TLS-Year22</p> <ul> <li>TLS protocol</li> <li>Collected in 2022</li> <li>Spans one year</li> <li>Contains 507 million samples</li> <li>Has 180 application classes</li> </ul> <p>This dataset is similar to CESNET-TLS22; however, it spans the entire year 2022. It will be published in the near future.</p>"},{"location":"features/","title":"Features","text":"<p>This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.</p>"},{"location":"features/#ppi-sequence","title":"PPI sequence","text":"<p>A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set <code>use_push_flags</code> for using PUSH flags in PPI sequences, if available in the used dataset.</p> Name Description SIZE Size of the transport payload IPT Inter-packet time in milliseconds. The IPT of the first packet is set to zero DIR Direction of the packet encoded as \u00b11 PUSH_FLAG Whether the push flag was set in the TCP packet"},{"location":"features/#flow-statistics","title":"Flow statistics","text":"<p>Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.</p> Name Description DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons"},{"location":"features/#packet-histograms","title":"Packet histograms","text":"<p>Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0\u201315, 16\u201331, 32\u201363, 64\u2013127, 128\u2013255, 256\u2013511, 512\u20131024, &gt;1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set <code>use_packet_histograms</code> for using packet histograms features, if available in the dataset.</p> Name Description PSIZE_BIN{x} Packet sizes histogram x-th bin for the forward direction PSIZE_BIN{x}_REV Packet sizes histogram x-th bin for the reverse direction IPT_BIN{x} Inter-packet times histogram x-th bin for the forward direction IPT_BIN{x}_REV Inter-packet times histogram x-th bin for the reverse direction <p>On the dataset metadata page, packet histogram features are called <code>PHIST_SRC_SIZES</code>, <code>PHIST_DST_SIZES</code>, <code>PHIST_SRC_IPT</code>, <code>PHIST_DST_IPT</code>. Those are the names of database columns that are flattened to the _BIN{x} features.</p>"},{"location":"features/#tcp-features","title":"TCP features","text":"<p>Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set <code>use_tcp_features</code> for using a subset of flags defined in <code>cesnet_datazoo.constants.SELECTED_TCP_FLAGS</code>.</p> Name Description FLAG_{F} Whether F flag was present in the forward (client to server) direction FLAG_{F}_REV Whether F flag was present in the reverse (server to client) direction"},{"location":"features/#other-fields","title":"Other fields","text":"<p>Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets.  Set <code>return_other_fields</code> to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.</p> Name Description ID Per-dataset unique flow identifier TIME_FIRST Timestamp of the first packet TIME_LAST Timestamp of the last packet SRC_IP Source IP address DST_IP Destination IP address DST_ASN Destination Autonomous System number SRC_PORT Source port DST_PORT Destination port PROTOCOL Transport protocol TLS_SNI / QUIC_SNI Server Name Indication domain TLS_JA3 JA3 fingerprint QUIC_VERSION QUIC protocol version QUIC_USER_AGENT User agent string if available in the QUIC Initial Packet"},{"location":"features/#details-about-packet-histograms-and-ppi","title":"Details about packet histograms and PPI","text":"<p>Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.</p> TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV Zero-length packets(without L4 payload, e.g. ACKs) Not included Not included Included Retransmissions(and out-of-order packets) Included Not included* Included Computed from Entire flow First 30 packets Entire flow <p>*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.</p> <p>For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.</p>"},{"location":"getting_started/","title":"Getting started","text":""},{"location":"getting_started/#jupyter-notebooks","title":"Jupyter notebooks","text":"<p>Example Jupyter notebooks are provided at https://github.com/CESNET/cesnet-tcexamples. Start with:</p> <ul> <li>Initialize the CESNET-QUIC22 dataset and explore its data features - explore_data.ipynb</li> <li>Training of a LightGBM classifier and its evaluation on a per-week and per-day basis - example_evaluation.ipynb</li> </ul>"},{"location":"getting_started/#code-snippets","title":"Code snippets","text":""},{"location":"getting_started/#download-a-dataset-and-compute-statistics","title":"Download a dataset and compute statistics","text":"<p><pre><code>from cesnet_datazoo.datasets import CESNET_QUIC22\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset.compute_dataset_statistics(num_samples=100_000, num_workers=0)\n</code></pre> This will download the dataset, compute dataset statistics, and save them into <code>/datasets/CESNET-QUIC22/statistics</code>.</p>"},{"location":"getting_started/#enable-logging-and-set-the-spawn-method-on-windows","title":"Enable logging and set the spawn method on Windows","text":"<p><pre><code>import logging\nimport multiprocessing as mp\n\nmp.set_start_method(\"spawn\") \nlogging.basicConfig(\n    level=logging.INFO,\n    format=\"[%(asctime)s][%(name)s][%(levelname)s] - %(message)s\")\n</code></pre> For running on Windows, we recommend using the <code>spawn</code> method for creating dataloader worker processes. Set up logging to get more information from the package.</p>"},{"location":"getting_started/#initialize-dataset-to-create-train-validation-and-test-dataframes","title":"Initialize dataset to create train, validation, and test dataframes","text":"<pre><code>from cesnet_datazoo.datasets import CESNET_QUIC22\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\n\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset_config = DatasetConfig(\n    dataset=dataset,\n    apps_selection=AppSelection.ALL_KNOWN,\n    train_period_name=\"W-2022-44\",\n    test_period_name=\"W-2022-45\",\n)\ndataset.set_dataset_config_and_initialize(dataset_config)\ntrain_dataframe = dataset.get_train_df()\nval_dataframe = dataset.get_val_df()\ntest_dataframe = dataset.get_test_df()\n</code></pre> <p>The <code>DatasetConfig</code> class handles the configuration of datasets, and calling <code>set_dataset_config_and_initialize</code> initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See <code>CesnetDataset</code> reference.</p>"},{"location":"installation/","title":"Installation","text":"<p>Install the package from pip with:</p> <pre><code>pip install cesnet-datazoo\n</code></pre> <p>or for editable install with:</p> <pre><code>pip install -e git+https://github.com/CESNET/cesnet-datazoo\n</code></pre>"},{"location":"installation/#requirements","title":"Requirements","text":"<p>The <code>cesnet-datazoo</code> package requires Python &gt;=3.10.</p>"},{"location":"installation/#dependencies","title":"Dependencies","text":"Name Version matplotlib numpy pandas pydantic &gt;=2.0 PyYAML requests scikit-learn seaborn tables &gt;=3.8.0 torch &gt;=1.10 tqdm"},{"location":"reference_cesnet_dataset/","title":"Base dataset class","text":""},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset","title":"datasets.cesnet_dataset.CesnetDataset","text":"<p>The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:</p> <ul> <li>Iterable PyTorch DataLoader for batch processing. See using dataloaders for more details.</li> <li>Pandas DataFrame for loading the entire train, validation, or test set at once.</li> </ul> <p>The dataset is stored in a PyTables database. The internal <code>PyTablesDataset</code> class is used as a wrapper that implements the PyTorch <code>Dataset</code> interface and is compatible with <code>DataLoader</code>, which provides efficient parallel loading of the data. The dataset configuration is done through the <code>DatasetConfig</code> class.</p> <p>Intended usage:</p> <ol> <li>Create an instance of the dataset class with the desired size and data root. This will download the dataset if it has not already been downloaded.</li> <li>Create an instance of <code>DatasetConfig</code> and set it with <code>set_dataset_config_and_initialize</code>. This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.</li> <li>Use <code>get_train_dataloader</code> or <code>get_train_df</code> to get training data for a classification model.</li> <li>Validate the model and perform the hyperparameter optimalization on <code>get_val_dataloader</code> or <code>get_val_df</code>.</li> <li>Evaluate the model on <code>get_test_dataloader</code> or <code>get_test_df</code>.</li> </ol> <p>Parameters:</p> Name Type Description Default <code>data_root</code> <code>str</code> <p>Path to the folder where the dataset will be stored. Each dataset size has its own subfolder <code>data_root/size</code></p> required <code>size</code> <code>str</code> <p>Size of the dataset. Options are <code>XS</code>, <code>S</code>, <code>M</code>, <code>L</code>, <code>ORIG</code>.</p> <code>'S'</code> <code>silent</code> <code>bool</code> <p>Whether to suppress print and tqdm output.</p> <code>False</code> <p>Attributes:</p> Name Type Description <code>name</code> <code>str</code> <p>Name of the dataset.</p> <code>database_filename</code> <code>str</code> <p>Name of the database file.</p> <code>database_path</code> <code>str</code> <p>Path to the database file.</p> <code>servicemap_path</code> <code>str</code> <p>Path to the servicemap file.</p> <code>statistics_path</code> <code>str</code> <p>Path to the dataset statistics folder.</p> <code>bucket_url</code> <code>str</code> <p>URL of the bucket where the database is stored.</p> <code>metadata</code> <code>DatasetMetadata</code> <p>Additional dataset metadata.</p> <code>available_classes</code> <code>list[str]</code> <p>List of all available classes in the dataset.</p> <code>available_dates</code> <code>list[str]</code> <p>List of all available dates in the dataset.</p> <code>time_periods</code> <code>dict[str, list[str]]</code> <p>Predefined time periods. Each time period is a list of dates.</p> <code>default_train_period_name</code> <code>str</code> <p>Default time period for training.</p> <code>default_test_period_name</code> <code>str</code> <p>Default time period for testing.</p> <p>The following attributes are initialized when <code>set_dataset_config_and_initialize</code> is called.</p> <p>Attributes:</p> Name Type Description <code>dataset_config</code> <code>Optional[DatasetConfig]</code> <p>Configuration of the dataset.</p> <code>class_info</code> <code>Optional[ClassInfo]</code> <p>Structured information about the classes.</p> <code>dataset_indices</code> <code>Optional[IndicesTuple]</code> <p>Named tuple containing <code>train_indices</code>, <code>val_known_indices</code>, <code>val_unknown_indices</code>, <code>test_known_indices</code>, <code>test_unknown_indices</code>. These are the indices into PyTables database that define train, validation, and test sets.</p> <code>train_dataset</code> <code>Optional[PyTablesDataset]</code> <p>Train set in the form of <code>PyTablesDataset</code> instance wrapping the PyTables database.</p> <code>val_dataset</code> <code>Optional[PyTablesDataset]</code> <p>Validation set in the form of <code>PyTablesDataset</code> instance wrapping the PyTables database.</p> <code>test_dataset</code> <code>Optional[PyTablesDataset]</code> <p>Test set in the form of <code>PyTablesDataset</code> instance wrapping the PyTables database.</p> <code>known_app_counts</code> <code>Optional[DataFrame]</code> <p>Known application counts in the train, validation, and test sets.</p> <code>unknown_app_counts</code> <code>Optional[DataFrame]</code> <p>Unknown application counts in the validation and test sets.</p> <code>train_dataloader</code> <code>Optional[DataLoader]</code> <p>Iterable PyTorch <code>DataLoader</code> for training.</p> <code>train_dataloader_sampler</code> <code>Optional[Sampler]</code> <p>Sampler used for iterating the training dataloader. Either <code>RandomSampler</code> or <code>SequentialSampler</code>.</p> <code>train_dataloader_drop_last</code> <code>bool</code> <p>Whether to drop the last incomplete batch when iterating the training dataloader.</p> <code>val_dataloader</code> <code>Optional[DataLoader]</code> <p>Iterable PyTorch <code>DataLoader</code> for validation.</p> <code>test_dataloader</code> <code>Optional[DataLoader]</code> <p>Iterable PyTorch <code>DataLoader</code> for testing.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>class CesnetDataset():\n    \"\"\"\n    The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:\n\n    - Iterable PyTorch DataLoader for batch processing. See [using dataloaders][using-dataloaders] for more details.\n    - Pandas DataFrame for loading the entire train, validation, or test set at once.\n\n    The dataset is stored in a [PyTables](https://www.pytables.org/) database. The internal `PyTablesDataset` class is used as a wrapper\n    that implements the PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) interface\n    and is compatible with [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),\n    which provides efficient parallel loading of the data. The dataset configuration is done through the [`DatasetConfig`][config.DatasetConfig] class.\n\n    **Intended usage:**\n\n    1. Create an instance of the [dataset class][dataset-classes] with the desired size and data root. This will download the dataset if it has not already been downloaded.\n    2. Create an instance of [`DatasetConfig`][config.DatasetConfig] and set it with [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize].\n    This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.\n    3. Use [`get_train_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_train_dataloader] or [`get_train_df`][datasets.cesnet_dataset.CesnetDataset.get_train_df] to get training data for a classification model.\n    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_val_dataloader] or [`get_val_df`][datasets.cesnet_dataset.CesnetDataset.get_val_df].\n    5. Evaluate the model on [`get_test_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_test_dataloader] or [`get_test_df`][datasets.cesnet_dataset.CesnetDataset.get_test_df].\n\n    Parameters:\n        data_root: Path to the folder where the dataset will be stored. Each dataset size has its own subfolder `data_root/size`\n        size: Size of the dataset. Options are `XS`, `S`, `M`, `L`, `ORIG`.\n        silent: Whether to suppress print and tqdm output.\n\n    Attributes:\n        name: Name of the dataset.\n        database_filename: Name of the database file.\n        database_path: Path to the database file.\n        servicemap_path: Path to the servicemap file.\n        statistics_path: Path to the dataset statistics folder.\n        bucket_url: URL of the bucket where the database is stored.\n        metadata: Additional [dataset metadata][metadata].\n        available_classes: List of all available classes in the dataset.\n        available_dates: List of all available dates in the dataset.\n        time_periods: Predefined time periods. Each time period is a list of dates.\n        default_train_period_name: Default time period for training.\n        default_test_period_name: Default time period for testing.\n\n    The following attributes are initialized when [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize] is called.\n\n    Attributes:\n        dataset_config: Configuration of the dataset.\n        class_info: Structured information about the classes.\n        dataset_indices: Named tuple containing `train_indices`, `val_known_indices`, `val_unknown_indices`, `test_known_indices`, `test_unknown_indices`. These are the indices into PyTables database that define train, validation, and test sets.\n        train_dataset: Train set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        val_dataset: Validation set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        test_dataset: Test set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        known_app_counts: Known application counts in the train, validation, and test sets.\n        unknown_app_counts: Unknown application counts in the validation and test sets.\n        train_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training.\n        train_dataloader_sampler: Sampler used for iterating the training dataloader. Either [`RandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) or [`SequentialSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler).\n        train_dataloader_drop_last: Whether to drop the last incomplete batch when iterating the training dataloader.\n        val_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        test_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    \"\"\"\n    data_root: str\n    size: str\n    silent: bool = False\n\n    name: str\n    database_filename: str\n    database_path: str\n    servicemap_path: str\n    statistics_path: str\n    bucket_url: str\n    metadata: DatasetMetadata\n    available_classes: list[str]\n    available_dates: list[str]\n    time_periods: dict[str, list[str]]\n    default_train_period_name: str\n    default_test_period_name: str\n\n    dataset_config: Optional[DatasetConfig] = None\n    class_info: Optional[ClassInfo] = None\n    dataset_indices: Optional[IndicesTuple] = None\n    train_dataset: Optional[PyTablesDataset] = None\n    val_dataset: Optional[PyTablesDataset] = None\n    test_dataset: Optional[PyTablesDataset] = None\n    known_app_counts: Optional[pd.DataFrame] = None\n    unknown_app_counts: Optional[pd.DataFrame] = None\n    train_dataloader: Optional[DataLoader] = None\n    train_dataloader_sampler: Optional[Sampler] = None\n    train_dataloader_drop_last: bool = True\n    val_dataloader: Optional[DataLoader] = None\n    test_dataloader: Optional[DataLoader] = None\n\n    _collate_fn: Optional[Callable] = None\n    _tables_app_enum: dict[int, str]\n    _tables_cat_enum: dict[int, str]\n\n    def __init__(self, data_root: str, size: str = \"S\", database_checks_at_init: bool = False, silent: bool = False) -&gt; None:\n        self.silent = silent\n        self.metadata = load_metadata(self.name)\n        self.size = size\n        if self.size != \"ORIG\":\n            if size not in self.metadata.available_dataset_sizes:\n                raise ValueError(f\"Unknown dataset size {self.size}\")\n            self.name = f\"{self.name}-{self.size}\"\n            filename, ext = os.path.splitext(self.database_filename)\n            self.database_filename = f\"{filename}-{self.size}{ext}\"\n        self.data_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, self.size)))\n        self.database_path = os.path.join(self.data_root, self.database_filename)\n        self.servicemap_path = os.path.join(self.data_root, SERVICEMAP_FILE)\n        self.statistics_path = os.path.join(self.data_root, \"statistics\")\n        if not os.path.exists(self.data_root):\n            os.makedirs(self.data_root)\n        if not self._is_downloaded():\n            self._download()\n        if database_checks_at_init:\n            with tb.open_file(self.database_path, mode=\"r\") as database:\n                tables_paths = list(map(lambda x: x._v_pathname, iter(database.get_node(f\"/flows\"))))\n                num_samples = 0\n                for p in tables_paths:\n                    table = database.get_node(p)\n                    assert isinstance(table, tb.Table)\n                    if self._tables_app_enum != {v: k for k, v in dict(table.get_enum(APP_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_app_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    if self._tables_cat_enum != {v: k for k, v in dict(table.get_enum(CATEGORY_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_cat_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    num_samples += len(table)\n                if self.size == \"ORIG\" and num_samples != self.metadata.available_samples:\n                    raise ValueError(f\"Expected {self.metadata.available_samples} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.size != \"ORIG\" and num_samples != DATASET_SIZES[self.size]:\n                    raise ValueError(f\"Expected {DATASET_SIZES[self.size]} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.available_dates != list(map(lambda x: x.removeprefix(\"/flows/D\"), tables_paths)):\n                    raise ValueError(f\"Found mismatch between available_dates and the dates available in the PyTables database. Please report this issue.\")\n        # Add all available dates as single date time periods\n        for d in self.available_dates:\n            self.time_periods[d] = [d]\n        available_applications = sorted([app for app in pd.read_csv(self.servicemap_path, index_col=\"Tag\").index if not is_background_app(app)])\n        if len(available_applications) != self.metadata.application_count:\n            raise ValueError(f\"Found {len(available_applications)} applications in the servicemap (omitting background traffic classes), but expected {self.metadata.application_count}. Please report this issue.\")\n        self.available_classes = available_applications + self.metadata.background_traffic_classes\n\n    def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -&gt; None:\n        \"\"\"\n        Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n        Parameters:\n            dataset_config: Desired configuration of the dataset.\n            disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n        \"\"\"\n        self.dataset_config = dataset_config\n        self._clear()\n        self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n\n    def get_train_dataloader(self) -&gt; DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n        When the dataloader is iterated in random order, the last incomplete batch is dropped.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config               | Description                                                                                |\n        | ---------------------------- | ------------------------------------------------------------------------------------------ |\n        | `batch_size`                 | Number of samples per batch.                                                               |\n        | `train_workers`              | Number of workers for loading train data.                                                  |\n        | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n        | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n        Returns:\n            Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n        if not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n        assert self.train_dataset\n        if self.train_dataloader:\n            return self.train_dataloader\n        # Create sampler according to the selected order\n        if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n            if self.dataset_config.train_dataloader_seed is not None:\n                generator = torch.Generator()\n                generator.manual_seed(self.dataset_config.train_dataloader_seed)\n            else:\n                generator = None\n            self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n            self.train_dataloader_drop_last = True\n        elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n            self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n            self.train_dataloader_drop_last = False\n        else: assert_never(self.dataset_config.train_dataloader_order)\n        # Create dataloader\n        batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n        train_dataloader = DataLoader(\n            self.train_dataset,\n            num_workers=self.dataset_config.train_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.train_workers &gt; 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.train_workers == 0:\n            self.train_dataset.pytables_worker_init()\n        self.train_dataloader = train_dataloader\n        return train_dataloader\n\n    def get_val_dataloader(self) -&gt; DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        The dataloader is created on the first call and then cached.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `val_workers`     | Number of workers for loading validation data.                    |\n\n        Returns:\n            Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n        if not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n        assert self.val_dataset is not None\n        if self.val_dataloader:\n            return self.val_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        val_dataloader = DataLoader(\n            self.val_dataset,\n            num_workers=self.dataset_config.val_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.val_workers &gt; 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.val_workers == 0:\n            self.val_dataset.pytables_worker_init()\n        self.val_dataloader = val_dataloader\n        return val_dataloader\n\n    def get_test_dataloader(self) -&gt; DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n        The dataloader is created on the first call and then cached.\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `test_workers`    | Number of workers for loading test data.                          |\n\n        Returns:\n            Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n        if not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n        assert self.test_dataset is not None\n        if self.test_dataloader:\n            return self.test_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        test_dataloader = DataLoader(\n            self.test_dataset,\n            num_workers=self.dataset_config.test_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=False,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.test_workers == 0:\n            self.test_dataset.pytables_worker_init()\n        self.test_dataloader = test_dataloader\n        return test_dataloader\n\n    def get_dataloaders(self) -&gt; tuple[DataLoader, DataLoader, DataLoader]:\n        \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n        train_dataloader = self.get_train_dataloader()\n        val_dataloader = self.get_val_dataloader()\n        test_dataloader = self.get_test_dataloader()\n        return train_dataloader, val_dataloader, test_dataloader\n\n    def get_train_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n        \"\"\"\n        Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n        !!! warning \"Memory usage\"\n\n            The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Train data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_train=True)\n        assert self.dataset_config is not None and self.train_dataset is not None\n        if len(self.train_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n        train_dataloader = self.get_train_dataloader()\n        assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n        # Read dataloader in sequential order\n        train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n        train_dataloader.sampler.drop_last = False\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        df = create_df_from_dataloader(dataloader=train_dataloader,\n                                       feature_names=feature_names,\n                                       flatten_ppi=flatten_ppi,\n                                       silent=self.silent)\n        # Restore the original dataloader sampler and drop_last\n        train_dataloader.sampler.sampler = self.train_dataloader_sampler\n        train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n        return df\n\n    def get_val_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n        \"\"\"\n        Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n        !!! warning \"Memory usage\"\n\n            The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Validation data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_val=True)\n        assert self.dataset_config is not None and self.val_dataset is not None\n        if len(self.val_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_test_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n        \"\"\"\n        Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n        !!! warning \"Memory usage\"\n\n            The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Test data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_test=True)\n        assert self.dataset_config is not None and self.test_dataset is not None\n        if len(self.test_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_num_classes(self) -&gt; int:\n        \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n        return self.class_info.num_classes\n\n    def get_known_apps(self) -&gt; list[str]:\n        \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n        return self.class_info.known_apps\n\n    def get_unknown_apps(self) -&gt; list[str]:\n        \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n        return self.class_info.unknown_apps\n\n    def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -&gt; None:\n        \"\"\"\n        Computes dataset statistics and saves them to the `statistics_path` folder.\n\n        Parameters:\n            num_samples: Number of samples to use for computing the statistics.\n            num_workers: Number of workers for loading data.\n            batch_size: Number of samples per batch for loading data.\n            disabled_apps: List of applications to exclude from the statistics.\n        \"\"\"\n        if disabled_apps:\n            bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n            if len(bad_disabled_apps) &gt; 0:\n                raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if not os.path.exists(self.statistics_path):\n            os.mkdir(self.statistics_path)\n        compute_dataset_statistics(database_path=self.database_path,\n                                   tables_app_enum=self._tables_app_enum,\n                                   tables_cat_enum=self._tables_cat_enum,\n                                   output_dir=self.statistics_path,\n                                   packet_histograms=self.metadata.packet_histograms,\n                                   flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                                   protocol=self.metadata.protocol,\n                                   extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                                   disabled_apps=disabled_apps if disabled_apps is not None else [],\n                                   num_samples=num_samples,\n                                   num_workers=num_workers,\n                                   batch_size=batch_size,\n                                   silent=self.silent)\n\n    def _generate_time_periods(self) -&gt; None:\n        time_periods = {}\n        for period in self.time_periods:\n            time_periods[period] = []\n            if period.startswith(\"W\"):\n                split = period.split(\"-\")\n                collection_year, week = int(split[1]), int(split[2])\n                for d in range(1, 8):\n                    s = datetime.date.fromisocalendar(collection_year, week, d).strftime(\"%Y%m%d\")\n                    # last week of a year can span into the following year\n                    if s not in self.metadata.missing_dates_in_collection_period and s.startswith(str(collection_year)):\n                        time_periods[period].append(s)\n            elif period.startswith(\"M\"):\n                split = period.split(\"-\")\n                collection_year, month = int(split[1]), int(split[2])\n                for d in range(1, calendar.monthrange(collection_year, month)[1]):\n                    s = datetime.date(collection_year, month, d).strftime(\"%Y%m%d\")\n                    if s not in self.metadata.missing_dates_in_collection_period:\n                        time_periods[period].append(s)\n        self.time_periods = time_periods\n\n    def _is_downloaded(self) -&gt; bool:\n        \"\"\"Servicemap is downloaded after the database; thus if it exists, the database is also downloaded\"\"\"\n        return os.path.exists(self.servicemap_path) and os.path.exists(self.database_path)\n\n    def _download(self) -&gt; None:\n        if not self.silent:\n            print(f\"Downloading {self.name} dataset\")\n        database_url = f\"{self.bucket_url}&amp;file={self.database_filename}\"\n        servicemap_url = f\"{self.bucket_url}&amp;file={SERVICEMAP_FILE}\"\n        resumable_download(url=database_url, file_path=self.database_path, silent=self.silent)\n        simple_download(url=servicemap_url, file_path=self.servicemap_path)\n\n    def _clear(self) -&gt; None:\n        self.class_info = None\n        self.dataset_indices = None\n        self.train_dataset = None\n        self.val_dataset = None\n        self.test_dataset = None\n        self.known_app_counts = None\n        self.unknown_app_counts = None\n        self.train_dataloader = None\n        self.train_dataloader_sampler = None\n        self.train_dataloader_drop_last = True\n        self.val_dataloader = None\n        self.test_dataloader = None\n        self._collate_fn = None\n\n    def _check_before_dataframe(self, check_train: bool = False, check_val: bool = False, check_test: bool = False) -&gt; None:\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting a dataframe\")\n        if self.dataset_config.return_tensors:\n            raise ValueError(\"Dataframes are not available when return_tensors is set. Use a dataloader instead.\")\n        if check_train and not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataframe is not available when need_train_set is false\")\n        if check_val and not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataframe is not available when need_val_set is false\")\n        if check_test and not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataframe is not available when need_test_set is false\")\n\n    def _initialize_train_val_test(self, disable_indices_cache: bool = False) -&gt; None:\n        assert self.dataset_config is not None\n        dataset_config = self.dataset_config\n        servicemap = pd.read_csv(dataset_config.servicemap_path, index_col=\"Tag\")\n        # Initialize train set\n        if dataset_config.need_train_set:\n            train_indices, train_unknown_indices, known_apps, unknown_apps = init_or_load_train_indices(dataset_config=dataset_config,\n                                                                                                        tables_app_enum=self._tables_app_enum,\n                                                                                                        servicemap=servicemap,\n                                                                                                        disable_indices_cache=disable_indices_cache,)\n            # Date weight sampling of train indices\n            if dataset_config.train_dates_weigths is not None:\n                assert dataset_config.train_size != \"all\"\n                if dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                    # requested number of samples is train_size + val_known_size when using the split-from-train validation approach\n                    assert dataset_config.val_known_size != \"all\"\n                    num_samples = dataset_config.train_size + dataset_config.val_known_size\n                else:\n                    num_samples = dataset_config.train_size\n                if num_samples &gt; len(train_indices):\n                    raise ValueError(f\"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})\")\n                train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)\n        elif dataset_config.apps_selection == AppSelection.FIXED:\n            known_apps = dataset_config.apps_selection_fixed_known\n            unknown_apps = dataset_config.apps_selection_fixed_unknown\n            train_indices = np.zeros((0,3), dtype=np.int64)\n            train_unknown_indices = np.zeros((0,3), dtype=np.int64)\n        else:\n            raise ValueError(\"Either need train set or the fixed application selection\")\n        # Initialize validation set\n        if dataset_config.need_val_set:\n            if dataset_config.val_approach == ValidationApproach.VALIDATION_DATES:\n                val_known_indices, val_unknown_indices, val_data_path = init_or_load_val_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n            elif dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                train_val_rng = get_fresh_random_generator(dataset_config=dataset_config, section=RandomizedSection.TRAIN_VAL_SPLIT)\n                val_data_path = dataset_config._get_train_data_path()\n                val_unknown_indices = train_unknown_indices\n                train_labels = train_indices[:, INDICES_LABEL_POS]\n                if dataset_config.train_dates_weigths is not None:\n                    assert dataset_config.val_known_size != \"all\"\n                    # When weight sampling is used, val_known_size is kept but the resulting train size can be smaller due to no enough samples in some train dates\n                    if dataset_config.val_known_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples after weight sampling ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.val_known_size, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                    dataset_config.train_size = len(train_indices)\n                elif dataset_config.train_size == \"all\" and dataset_config.val_known_size == \"all\":\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.train_val_split_fraction, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                else:\n                    if dataset_config.val_known_size != \"all\" and  dataset_config.train_size != \"all\" and dataset_config.train_size + dataset_config.val_known_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested train size + validation size ({dataset_config.train_size + dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.train_size != \"all\" and dataset_config.train_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested train size ({dataset_config.train_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.val_known_size != \"all\" and dataset_config.val_known_size &gt; len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices,\n                                                                        train_size=dataset_config.train_size if dataset_config.train_size != \"all\" else None,\n                                                                        test_size=dataset_config.val_known_size if dataset_config.val_known_size != \"all\" else None,\n                                                                        stratify=train_labels, shuffle=True, random_state=train_val_rng)\n        else:\n            val_known_indices = np.zeros((0,3), dtype=np.int64)\n            val_unknown_indices = np.zeros((0,3), dtype=np.int64)\n            val_data_path = None\n        # Initialize test set\n        if dataset_config.need_test_set:\n            test_known_indices, test_unknown_indices, test_data_path = init_or_load_test_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n        else:\n            test_known_indices = np.zeros((0,3), dtype=np.int64)\n            test_unknown_indices = np.zeros((0,3), dtype=np.int64)\n            test_data_path = None\n        # Fit scalers if needed\n        if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or\n            dataset_config.flowstats_transform is not None and dataset_config.flowstats_transform.needs_fitting):\n            if not dataset_config.need_train_set:\n                raise ValueError(\"Train set is needed to fit the scalers. Provide pre-fitted scalers.\")\n            fit_scalers(dataset_config=dataset_config, train_indices=train_indices)\n        # Subset dataset indices based on the selected sizes and compute application counts\n        dataset_indices = IndicesTuple(train_indices=train_indices, val_known_indices=val_known_indices, val_unknown_indices=val_unknown_indices, test_known_indices=test_known_indices, test_unknown_indices=test_unknown_indices)\n        dataset_indices = subset_and_sort_indices(dataset_config=dataset_config, dataset_indices=dataset_indices)\n        known_app_counts = compute_known_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        unknown_app_counts = compute_unknown_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        # Combine known and unknown test indicies to create a single dataloader\n        assert isinstance(dataset_config.test_unknown_size, int)\n        if dataset_config.test_unknown_size &gt; 0 and len(unknown_apps) &gt; 0:\n            test_combined_indices = np.concatenate((dataset_indices.test_known_indices, dataset_indices.test_unknown_indices))\n        else:\n            test_combined_indices = dataset_indices.test_known_indices\n        # Create encoder the class info structure\n        encoder = LabelEncoder().fit(known_apps)\n        encoder.classes_ = np.append(encoder.classes_, UNKNOWN_STR_LABEL)\n        class_info = create_class_info(servicemap=servicemap, encoder=encoder, known_apps=known_apps, unknown_apps=unknown_apps)\n        encode_labels_with_unknown_fn = partial(_encode_labels_with_unknown, encoder=encoder, class_info=class_info)\n        # Create train, validation, and test datasets\n        train_dataset = val_dataset = test_dataset = None\n        if dataset_config.need_train_set:\n            train_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_train_tables_paths(),\n                indices=dataset_indices.train_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,)\n        if dataset_config.need_val_set:\n            assert val_data_path is not None\n            val_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_train_tables_paths(),\n                indices=dataset_indices.val_known_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_val,\n                preload_blob=os.path.join(val_data_path, \"preload\", f\"val_dataset-{dataset_config.val_known_size}.npz\"),)\n        if dataset_config.need_test_set:\n            assert test_data_path is not None\n            test_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_test_tables_paths(),\n                indices=test_combined_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_test,\n                preload_blob=os.path.join(test_data_path, \"preload\", f\"test_dataset-{dataset_config.test_known_size}-{dataset_config.test_unknown_size}.npz\"),)\n        self.class_info = class_info\n        self.dataset_indices = dataset_indices\n        self.train_dataset = train_dataset\n        self.val_dataset = val_dataset\n        self.test_dataset = test_dataset\n        self.known_app_counts = known_app_counts\n        self.unknown_app_counts = unknown_app_counts\n        self._collate_fn = collate_fn_simple\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize","title":"set_dataset_config_and_initialize","text":"<pre><code>set_dataset_config_and_initialize(\n    dataset_config: DatasetConfig,\n    disable_indices_cache: bool = False,\n) -&gt; None\n</code></pre> <p>Initialize train, validation, and test sets. Data cannot be accessed before calling this method.</p> <p>Parameters:</p> Name Type Description Default <code>dataset_config</code> <code>DatasetConfig</code> <p>Desired configuration of the dataset.</p> required <code>disable_indices_cache</code> <code>bool</code> <p>Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.</p> <code>False</code> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -&gt; None:\n    \"\"\"\n    Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n    Parameters:\n        dataset_config: Desired configuration of the dataset.\n        disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n    \"\"\"\n    self.dataset_config = dataset_config\n    self._clear()\n    self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_dataloader","title":"get_train_dataloader","text":"<pre><code>get_train_dataloader() -&gt; DataLoader\n</code></pre> <p>Provides a PyTorch <code>DataLoader</code> for training. The dataloader is created on the first call and then cached. When the dataloader is iterated in random order, the last incomplete batch is dropped. The dataloader is configured with the following config attributes:</p> Dataset config Description <code>batch_size</code> Number of samples per batch. <code>train_workers</code> Number of workers for loading train data. <code>train_dataloader_order</code> Whether to load train data in sequential or random order. See config.DataLoaderOrder. <code>train_dataloader_seed</code> Seed for loading train data in random order. <p>Returns:</p> Type Description <code>DataLoader</code> <p>Train data as an iterable dataloader. See using dataloaders for more details.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_train_dataloader(self) -&gt; DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n    When the dataloader is iterated in random order, the last incomplete batch is dropped.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config               | Description                                                                                |\n    | ---------------------------- | ------------------------------------------------------------------------------------------ |\n    | `batch_size`                 | Number of samples per batch.                                                               |\n    | `train_workers`              | Number of workers for loading train data.                                                  |\n    | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n    | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n    Returns:\n        Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n    if not self.dataset_config.need_train_set:\n        raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n    assert self.train_dataset\n    if self.train_dataloader:\n        return self.train_dataloader\n    # Create sampler according to the selected order\n    if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n        if self.dataset_config.train_dataloader_seed is not None:\n            generator = torch.Generator()\n            generator.manual_seed(self.dataset_config.train_dataloader_seed)\n        else:\n            generator = None\n        self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n        self.train_dataloader_drop_last = True\n    elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n        self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n        self.train_dataloader_drop_last = False\n    else: assert_never(self.dataset_config.train_dataloader_order)\n    # Create dataloader\n    batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n    train_dataloader = DataLoader(\n        self.train_dataset,\n        num_workers=self.dataset_config.train_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.train_workers &gt; 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.train_workers == 0:\n        self.train_dataset.pytables_worker_init()\n    self.train_dataloader = train_dataloader\n    return train_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_dataloader","title":"get_val_dataloader","text":"<pre><code>get_val_dataloader() -&gt; DataLoader\n</code></pre> <p>Provides a PyTorch <code>DataLoader</code> for validation. The dataloader is created on the first call and then cached. The dataloader is configured with the following config attributes:</p> Dataset config Description <code>test_batch_size</code> Number of samples per batch for loading validation and test data. <code>val_workers</code> Number of workers for loading validation data. <p>Returns:</p> Type Description <code>DataLoader</code> <p>Validation data as an iterable dataloader. See using dataloaders for more details.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_val_dataloader(self) -&gt; DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n    The dataloader is created on the first call and then cached.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `val_workers`     | Number of workers for loading validation data.                    |\n\n    Returns:\n        Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n    if not self.dataset_config.need_val_set:\n        raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n    assert self.val_dataset is not None\n    if self.val_dataloader:\n        return self.val_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    val_dataloader = DataLoader(\n        self.val_dataset,\n        num_workers=self.dataset_config.val_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.val_workers &gt; 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.val_workers == 0:\n        self.val_dataset.pytables_worker_init()\n    self.val_dataloader = val_dataloader\n    return val_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_dataloader","title":"get_test_dataloader","text":"<pre><code>get_test_dataloader() -&gt; DataLoader\n</code></pre> <p>Provides a PyTorch <code>DataLoader</code> for testing. The dataloader is created on the first call and then cached.</p> <p>When the dataset is used in the open-world setting, and unknown classes are defined, the test dataloader returns <code>test_known_size</code> samples of known classes followed by <code>test_unknown_size</code> samples of unknown classes.</p> <p>The dataloader is configured with the following config attributes:</p> Dataset config Description <code>test_batch_size</code> Number of samples per batch for loading validation and test data. <code>test_workers</code> Number of workers for loading test data. <p>Returns:</p> Type Description <code>DataLoader</code> <p>Test data as an iterable dataloader. See using dataloaders for more details.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_test_dataloader(self) -&gt; DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    The dataloader is created on the first call and then cached.\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `test_workers`    | Number of workers for loading test data.                          |\n\n    Returns:\n        Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n    if not self.dataset_config.need_test_set:\n        raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n    assert self.test_dataset is not None\n    if self.test_dataloader:\n        return self.test_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    test_dataloader = DataLoader(\n        self.test_dataset,\n        num_workers=self.dataset_config.test_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=False,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.test_workers == 0:\n        self.test_dataset.pytables_worker_init()\n    self.test_dataloader = test_dataloader\n    return test_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_dataloaders","title":"get_dataloaders","text":"<pre><code>get_dataloaders() -&gt; (\n    tuple[DataLoader, DataLoader, DataLoader]\n)\n</code></pre> <p>Gets train, validation, and test dataloaders in one call.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_dataloaders(self) -&gt; tuple[DataLoader, DataLoader, DataLoader]:\n    \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n    train_dataloader = self.get_train_dataloader()\n    val_dataloader = self.get_val_dataloader()\n    test_dataloader = self.get_test_dataloader()\n    return train_dataloader, val_dataloader, test_dataloader\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_df","title":"get_train_df","text":"<pre><code>get_train_df(flatten_ppi: bool = False) -&gt; pd.DataFrame\n</code></pre> <p>Creates a train Pandas <code>DataFrame</code>. The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.</p> <p>Memory usage</p> <p>The whole train set is loaded into memory. If the dataset size is larger than <code>'S'</code>, consider using <code>get_train_dataloader</code> instead.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten the PPI sequence into individual columns (named <code>IPT_X</code>, <code>DIR_X</code>, <code>SIZE_X</code>, <code>PUSH_X</code>, X being the index of the packet) or keep one <code>PPI</code> column with 2D data.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>Train data as a dataframe.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_train_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n    \"\"\"\n    Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n    !!! warning \"Memory usage\"\n\n        The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Train data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_train=True)\n    assert self.dataset_config is not None and self.train_dataset is not None\n    if len(self.train_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n    train_dataloader = self.get_train_dataloader()\n    assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n    # Read dataloader in sequential order\n    train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n    train_dataloader.sampler.drop_last = False\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    df = create_df_from_dataloader(dataloader=train_dataloader,\n                                   feature_names=feature_names,\n                                   flatten_ppi=flatten_ppi,\n                                   silent=self.silent)\n    # Restore the original dataloader sampler and drop_last\n    train_dataloader.sampler.sampler = self.train_dataloader_sampler\n    train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n    return df\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_df","title":"get_val_df","text":"<pre><code>get_val_df(flatten_ppi: bool = False) -&gt; pd.DataFrame\n</code></pre> <p>Creates validation Pandas <code>DataFrame</code>. The dataframe is in sequential (datetime) order.</p> <p>Memory usage</p> <p>The whole validation set is loaded into memory. If the dataset size is larger than <code>'S'</code>, consider using <code>get_val_dataloader</code> instead.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten the PPI sequence into individual columns (named <code>IPT_X</code>, <code>DIR_X</code>, <code>SIZE_X</code>, <code>PUSH_X</code>, X being the index of the packet) or keep one <code>PPI</code> column with 2D data.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>Validation data as a dataframe.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_val_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n    \"\"\"\n    Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n    !!! warning \"Memory usage\"\n\n        The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Validation data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_val=True)\n    assert self.dataset_config is not None and self.val_dataset is not None\n    if len(self.val_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_df","title":"get_test_df","text":"<pre><code>get_test_df(flatten_ppi: bool = False) -&gt; pd.DataFrame\n</code></pre> <p>Creates test Pandas <code>DataFrame</code>. The dataframe is in sequential (datetime) order.</p> <p>When the dataset is used in the open-world setting, and unknown classes are defined, the returned test dataframe is composed of <code>test_known_size</code> samples of known classes followed by <code>test_unknown_size</code> samples of unknown classes.</p> <p>Memory usage</p> <p>The whole test set is loaded into memory. If the dataset size is larger than <code>'S'</code>, consider using <code>get_test_dataloader</code> instead.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten the PPI sequence into individual columns (named <code>IPT_X</code>, <code>DIR_X</code>, <code>SIZE_X</code>, <code>PUSH_X</code>, X being the index of the packet) or keep one <code>PPI</code> column with 2D data.</p> <code>False</code> <p>Returns:</p> Type Description <code>DataFrame</code> <p>Test data as a dataframe.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_test_df(self, flatten_ppi: bool = False) -&gt; pd.DataFrame:\n    \"\"\"\n    Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n    !!! warning \"Memory usage\"\n\n        The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Test data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_test=True)\n    assert self.dataset_config is not None and self.test_dataset is not None\n    if len(self.test_dataset) &gt; DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_num_classes","title":"get_num_classes","text":"<pre><code>get_num_classes() -&gt; int\n</code></pre> <p>Returns the number of classes in the current configuration of the dataset.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_num_classes(self) -&gt; int:\n    \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n    return self.class_info.num_classes\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_known_apps","title":"get_known_apps","text":"<pre><code>get_known_apps() -&gt; list[str]\n</code></pre> <p>Returns the list of known applications in the current configuration of the dataset.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_known_apps(self) -&gt; list[str]:\n    \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n    return self.class_info.known_apps\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_unknown_apps","title":"get_unknown_apps","text":"<pre><code>get_unknown_apps() -&gt; list[str]\n</code></pre> <p>Returns the list of unknown applications in the current configuration of the dataset.</p> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def get_unknown_apps(self) -&gt; list[str]:\n    \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n    return self.class_info.unknown_apps\n</code></pre>"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.compute_dataset_statistics","title":"compute_dataset_statistics","text":"<pre><code>compute_dataset_statistics(\n    num_samples: int | Literal[\"all\"] = 10000000,\n    num_workers: int = 4,\n    batch_size: int = 16384,\n    disabled_apps: Optional[list[str]] = None,\n) -&gt; None\n</code></pre> <p>Computes dataset statistics and saves them to the <code>statistics_path</code> folder.</p> <p>Parameters:</p> Name Type Description Default <code>num_samples</code> <code>int | Literal['all']</code> <p>Number of samples to use for computing the statistics.</p> <code>10000000</code> <code>num_workers</code> <code>int</code> <p>Number of workers for loading data.</p> <code>4</code> <code>batch_size</code> <code>int</code> <p>Number of samples per batch for loading data.</p> <code>16384</code> <code>disabled_apps</code> <code>Optional[list[str]]</code> <p>List of applications to exclude from the statistics.</p> <code>None</code> Source code in <code>cesnet_datazoo\\datasets\\cesnet_dataset.py</code> <pre><code>def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -&gt; None:\n    \"\"\"\n    Computes dataset statistics and saves them to the `statistics_path` folder.\n\n    Parameters:\n        num_samples: Number of samples to use for computing the statistics.\n        num_workers: Number of workers for loading data.\n        batch_size: Number of samples per batch for loading data.\n        disabled_apps: List of applications to exclude from the statistics.\n    \"\"\"\n    if disabled_apps:\n        bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n        if len(bad_disabled_apps) &gt; 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n    if not os.path.exists(self.statistics_path):\n        os.mkdir(self.statistics_path)\n    compute_dataset_statistics(database_path=self.database_path,\n                               tables_app_enum=self._tables_app_enum,\n                               tables_cat_enum=self._tables_cat_enum,\n                               output_dir=self.statistics_path,\n                               packet_histograms=self.metadata.packet_histograms,\n                               flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                               protocol=self.metadata.protocol,\n                               extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                               disabled_apps=disabled_apps if disabled_apps is not None else [],\n                               num_samples=num_samples,\n                               num_workers=num_workers,\n                               batch_size=batch_size,\n                               silent=self.silent)\n</code></pre>"},{"location":"reference_dataset_config/","title":"Config class","text":""},{"location":"reference_dataset_config/#config.DatasetConfig","title":"config.DatasetConfig","text":"<p>The main class for the configuration of:</p> <ul> <li>Train, validation, test sets (dates, sizes, validation approach).</li> <li>Application selection \u2014 either the standard closed-world setting (only known classes) or the open-world setting (known and unknown classes).</li> <li>Data transformations. See the transforms page for more information.</li> <li>Dataloader options like batch sizes, order of loading, or number of workers.</li> </ul> <p>When initializing this class, pass a <code>CesnetDataset</code> instance to be configured and the desired configuration. Available options are here.</p> <p>Attributes:</p> Name Type Description <code>dataset</code> <code>InitVar[CesnetDataset]</code> <p>The dataset instance to be configured.</p> <code>data_root</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>database_filename</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>database_path</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>servicemap_path</code> <code>str</code> <p>Taken from the dataset instance.</p> <code>flowstats_features</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.flowstats_features</code>.</p> <code>flowstats_features_boolean</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.flowstats_features_boolean</code>.</p> <code>flowstats_features_phist</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.packet_histograms</code> if <code>use_packet_histograms</code> is true, otherwise an empty list.</p> <code>other_fields</code> <code>list[str]</code> <p>Taken from <code>dataset.metadata.other_fields</code> if <code>return_other_fields</code> is true, otherwise an empty list.</p>"},{"location":"reference_dataset_config/#config.DatasetConfig--configuration-options","title":"Configuration options","text":"<p>Attributes:</p> Name Type Description <code>need_train_set</code> <code>bool</code> <p>Use to disable the train set. <code>Default: True</code></p> <code>need_val_set</code> <code>bool</code> <p>Use to disable the validation set. When <code>need_train_set</code> is false, the validation set will also be disabled. <code>Default: True</code></p> <code>need_test_set</code> <code>bool</code> <p>Use to disable the test set. <code>Default: True</code></p> <code>train_period_name</code> <code>str</code> <p>Name of the train period. See instructions.</p> <code>train_dates</code> <code>list[str]</code> <p>Dates used for creating a train set.</p> <code>train_dates_weigths</code> <code>Optional[list[int]]</code> <p>To use a non-uniform distribution of samples across train dates.</p> <code>val_approach</code> <code>ValidationApproach</code> <p>How a validation set should be created. Either split train data into train and validation or have a separate validation period. <code>Default: SPLIT_FROM_TRAIN</code></p> <code>train_val_split_fraction</code> <code>float</code> <p>The fraction of validation samples when splitting from the train set. <code>Default: 0.2</code></p> <code>val_period_name</code> <code>str</code> <p>Name of the validation period. See instructions.</p> <code>val_dates</code> <code>list[str]</code> <p>Dates used for creating a validation set.</p> <code>test_period_name</code> <code>str</code> <p>Name of the test period. See instructions.</p> <code>test_dates</code> <code>list[str]</code> <p>Dates used for creating a test set.</p> <code>apps_selection</code> <code>AppSelection</code> <p>How to select application classes. <code>Default: ALL_KNOWN</code></p> <code>apps_selection_topx</code> <code>int</code> <p>Take top X as known.</p> <code>apps_selection_background_unknown</code> <code>list[str]</code> <p>Provide a list of background traffic classes to be used as unknown.</p> <code>apps_selection_fixed_known</code> <code>list[str]</code> <p>Provide a list of manually selected known applications.</p> <code>apps_selection_fixed_unknown</code> <code>list[str]</code> <p>Provide a list of manually selected unknown applications.</p> <code>disabled_apps</code> <code>list[str]</code> <p>List of applications to be disabled and not used at all.</p> <code>min_train_samples_check</code> <code>MinTrainSamplesCheck</code> <p>How to handle applications with not enough training samples. <code>Default: DISABLE_APPS</code></p> <code>min_train_samples_per_app</code> <code>int</code> <p>Defines the threshold for not enough. <code>Default: 100</code></p> <code>random_state</code> <code>int</code> <p>Fix all random processes performed during dataset initialization. <code>Default: 420</code></p> <code>fold_id</code> <code>int</code> <p>To perform N-fold cross-validation, set this to <code>1..N</code>. Each fold will use the same configuration but a different random seed. <code>Default: 0</code></p> <code>train_workers</code> <code>int</code> <p>Number of workers for loading train data. <code>0</code> means that the data will be loaded in the main process. <code>Default: 4</code></p> <code>test_workers</code> <code>int</code> <p>Number of workers for loading test data. <code>0</code> means that the data will be loaded in the main process. <code>Default: 1</code></p> <code>val_workers</code> <code>int</code> <p>Number of workers for loading validation data. <code>0</code> means that the data will be loaded in the main process. <code>Default: 1</code></p> <code>batch_size</code> <code>int</code> <p>Number of samples per batch. <code>Default: 192</code></p> <code>test_batch_size</code> <code>int</code> <p>Number of samples per batch for loading validation and test data. <code>Default: 2048</code></p> <code>preload_val</code> <code>bool</code> <p>Whether to dump the validation set with <code>numpy.savez_compressed</code> and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. <code>Default: True</code></p> <code>preload_test</code> <code>bool</code> <p>Whether to dump the test set with <code>numpy.savez_compressed</code> and preload it in future runs. <code>Default: False</code></p> <code>train_size</code> <code>int | Literal['all']</code> <p>Size of the train set. See instructions. <code>Default: all</code></p> <code>val_known_size</code> <code>int | Literal['all']</code> <p>Size of the validation set. See instructions. <code>Default: all</code></p> <code>test_known_size</code> <code>int | Literal['all']</code> <p>Size of the test set. See instructions. <code>Default: all</code></p> <code>val_unknown_size</code> <code>int | Literal['all']</code> <p>Size of the unknown classes validation set. Use for evaluation in the open-world setting. <code>Default: 0</code></p> <code>test_unknown_size</code> <code>int | Literal['all']</code> <p>Size of the unknown classes test set. Use for evaluation in the open-world setting. <code>Default: 0</code></p> <code>train_dataloader_order</code> <code>DataLoaderOrder</code> <p>Whether to load train data in sequential or random order. <code>Default: RANDOM</code></p> <code>train_dataloader_seed</code> <code>Optional[int]</code> <p>Seed for loading train data in random order. <code>Default: None</code></p> <code>return_other_fields</code> <code>bool</code> <p>Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. <code>Default: False</code></p> <code>return_tensors</code> <code>bool</code> <p>Use for returning <code>torch.Tensor</code> from dataloaders. Dataframes are not available when this option is used. <code>Default: False</code></p> <code>use_packet_histograms</code> <code>bool</code> <p>Whether to use packet histogram features, if available in the dataset. <code>Default: True</code></p> <code>use_tcp_features</code> <code>bool</code> <p>Whether to use TCP features, if available in the dataset. <code>Default: True</code></p> <code>use_push_flags</code> <code>bool</code> <p>Whether to use push flags in packet sequences, if available in the dataset. <code>Default: False</code></p> <code>fit_scalers_samples</code> <code>int | float</code> <p>Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. <code>Default: 0.25</code></p> <code>ppi_transform</code> <code>Optional[Callable]</code> <p>Transform function for PPI sequences. See the transforms page for more information. <code>Default: None</code></p> <code>flowstats_transform</code> <code>Optional[Callable]</code> <p>Transform function for flow statistics. See the transforms page for more information. <code>Default: None</code></p> <code>flowstats_phist_transform</code> <code>Optional[Callable]</code> <p>Transform function for packet histograms. See the transforms page for more information. <code>Default: None</code></p>"},{"location":"reference_dataset_config/#config.DatasetConfig--how-to-configure-train-validation-and-test-sets","title":"How to configure train, validation, and test sets","text":"<p>There are three options for how to define train/validation/test dates.</p> <ol> <li>Choose a predefined time period (<code>train_period_name</code>, <code>val_period_name</code>, or <code>test_period_name</code>) available in <code>dataset.time_periods</code> and leave the list of dates (<code>train_dates</code>, <code>val_dates</code>, or <code>test_dates</code>) empty.</li> <li>Provide a list of dates and a name for the time period. The dates are checked against <code>dataset.available_dates</code>.</li> <li>Do not specify anything and use the dataset's defaults <code>dataset.default_train_period_name</code> and <code>dataset.default_test_period_name</code>.</li> </ol> <p>There are two options for configuring sizes of train/validation/test sets.</p> <ol> <li>Select an appropriate dataset size (default is <code>S</code>) when creating the <code>CesnetDataset</code> instance and leave <code>train_size</code>, <code>val_known_size</code>, and <code>test_known_size</code> with their default <code>all</code> value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).</li> <li>Provide exact sizes in <code>train_size</code>, <code>val_known_size</code>, and <code>test_known_size</code>. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the <code>ORIG</code> dataset size and want to control the size of experiments.</li> </ol> <p>Tip</p> <p>The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>@dataclass(config=C)\nclass DatasetConfig():\n    \"\"\"\n    The main class for the configuration of:\n\n    - Train, validation, test sets (dates, sizes, validation approach).\n    - Application selection \u2014 either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).\n    - Data transformations. See the [transforms][transforms] page for more information.\n    - Dataloader options like batch sizes, order of loading, or number of workers.\n\n    When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].\n\n    Attributes:\n        dataset: The dataset instance to be configured.\n        data_root: Taken from the dataset instance.\n        database_filename: Taken from the dataset instance.\n        database_path: Taken from the dataset instance.\n        servicemap_path: Taken from the dataset instance.\n        flowstats_features: Taken from `dataset.metadata.flowstats_features`.\n        flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.\n        flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.\n        other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.\n\n    # Configuration options\n\n    Attributes:\n        need_train_set: Use to disable the train set. `Default: True`\n        need_val_set: Use to disable the validation set. When `need_train_set` is false, the validation set will also be disabled. `Default: True`\n        need_test_set: Use to disable the test set. `Default: True`\n        train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        train_dates: Dates used for creating a train set.\n        train_dates_weigths: To use a non-uniform distribution of samples across train dates.\n        val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`\n        train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`\n        val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        val_dates: Dates used for creating a validation set.\n        test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        test_dates: Dates used for creating a test set.\n\n        apps_selection: How to select application classes. `Default: ALL_KNOWN`\n        apps_selection_topx: Take top X as known.\n        apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.\n        apps_selection_fixed_known: Provide a list of manually selected known applications.\n        apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.\n        disabled_apps: List of applications to be disabled and not used at all.\n        min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`\n        min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`\n\n        random_state: Fix all random processes performed during dataset initialization. `Default: 420`\n        fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`\n        train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`\n        test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`\n        val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`\n        batch_size: Number of samples per batch. `Default: 192`\n        test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`\n        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: True`\n        preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`\n        train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`\n        test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`\n        train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`\n        train_dataloader_seed: Seed for loading train data in random order. `Default: None`\n\n        return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`\n        return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`\n        use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`\n        use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`\n        use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`\n        fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`\n        ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`\n\n    # How to configure train, validation, and test sets\n    There are three options for how to define train/validation/test dates.\n\n    1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.\n    2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.\n    3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.\n\n    There are two options for configuring sizes of train/validation/test sets.\n\n    1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.\n    This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).\n    2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.\n    This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.\n\n    !!! tip Validation set\n        The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].\n\n    \"\"\"\n    dataset: InitVar[CesnetDataset]\n    data_root: str = field(init=False)\n    database_filename: str =  field(init=False)\n    database_path: str =  field(init=False)\n    servicemap_path: str = field(init=False)\n    flowstats_features: list[str] = field(init=False)\n    flowstats_features_boolean: list[str] = field(init=False)\n    flowstats_features_phist: list[str] = field(init=False)\n    other_fields: list[str] = field(init=False)\n\n    need_train_set: bool = True\n    need_val_set: bool = True\n    need_test_set: bool = True\n    train_period_name: str = \"\"\n    train_dates: list[str] = field(default_factory=list)\n    train_dates_weigths: Optional[list[int]] = None\n    val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN\n    train_val_split_fraction: float = 0.2\n    val_period_name: str = \"\"\n    val_dates: list[str] = field(default_factory=list)\n    test_period_name: str = \"\"\n    test_dates: list[str] = field(default_factory=list)\n\n    apps_selection: AppSelection = AppSelection.ALL_KNOWN\n    apps_selection_topx: int = 0\n    apps_selection_background_unknown: list[str] = field(default_factory=list)\n    apps_selection_fixed_known: list[str] = field(default_factory=list)\n    apps_selection_fixed_unknown: list[str] = field(default_factory=list)\n    disabled_apps: list[str] = field(default_factory=list)\n    min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS\n    min_train_samples_per_app: int = 100\n\n    random_state: int = 420\n    fold_id: int = 0\n    train_workers: int = 4\n    test_workers: int = 1\n    val_workers: int = 1\n    batch_size: int = 192\n    test_batch_size: int = 2048\n    preload_val: bool = True\n    preload_test: bool = False\n    train_size: int | Literal[\"all\"] = \"all\"\n    val_known_size: int | Literal[\"all\"] = \"all\"\n    test_known_size: int | Literal[\"all\"] = \"all\"\n    val_unknown_size: int | Literal[\"all\"] = 0\n    test_unknown_size: int | Literal[\"all\"] = 0\n    train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM\n    train_dataloader_seed: Optional[int] = None\n\n    return_other_fields: bool = False\n    return_tensors: bool = False\n    use_packet_histograms: bool = False\n    use_tcp_features: bool = False\n    use_push_flags: bool = False\n    fit_scalers_samples: int | float = 0.25\n    ppi_transform: Optional[Callable] = None\n    flowstats_transform: Optional[Callable] = None\n    flowstats_phist_transform: Optional[Callable] = None\n\n    def __post_init__(self, dataset: CesnetDataset):\n        \"\"\"\n        Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.\n        \"\"\"\n        self.data_root = dataset.data_root\n        self.servicemap_path = dataset.servicemap_path\n        self.database_filename = dataset.database_filename\n        self.database_path = dataset.database_path\n\n        if not self.need_train_set:\n            self.need_val_set = False\n            if self.apps_selection != AppSelection.FIXED:\n                raise ValueError(\"Application selection has to be fixed when need_train_set is false\")\n            if (len(self.train_dates) &gt; 0 or self.train_period_name != \"\"):\n                raise ValueError(\"train_dates and train_period_name cannot be specified when need_train_set is false\")\n        else:\n            # Configure train dates\n            if len(self.train_dates) &gt; 0 and self.train_period_name == \"\":\n                raise ValueError(\"train_period_name has to be specified when train_dates are set\")\n            if len(self.train_dates) == 0 and self.train_period_name != \"\":\n                if self.train_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods\")\n                self.train_dates = dataset.time_periods[self.train_period_name]\n            if len(self.train_dates) == 0 and self.train_period_name == \"\":\n                self.train_period_name = dataset.default_train_period_name\n                self.train_dates = dataset.time_periods[dataset.default_train_period_name]\n        # Configure test dates\n        if not self.need_test_set:\n            if (len(self.test_dates) &gt; 0 or self.test_period_name != \"\"):\n                raise ValueError(\"test_dates and test_period_name cannot be specified when need_test_set is false\")\n        else:\n            if len(self.test_dates) &gt; 0 and self.test_period_name == \"\":\n                raise ValueError(\"test_period_name has to be specified when test_dates are set\")\n            if len(self.test_dates) == 0 and self.test_period_name != \"\":\n                if self.test_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods\")\n                self.test_dates = dataset.time_periods[self.test_period_name]\n            if len(self.test_dates) == 0 and self.test_period_name == \"\":\n                self.test_period_name = dataset.default_test_period_name\n                self.test_dates = dataset.time_periods[dataset.default_test_period_name]\n        # Configure val dates\n        if (not self.need_val_set or self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN) and (len(self.val_dates) &gt; 0 or self.val_period_name != \"\"):\n            raise ValueError(\"val_dates and val_period_name cannot be specified when need_val_set is false or the validation approach is split-from-train\")\n        if self.val_approach == ValidationApproach.VALIDATION_DATES:\n            if len(self.val_dates) &gt; 0 and self.val_period_name == \"\":\n                raise ValueError(\"val_period_name has to be specified when val_dates are set\")\n            if len(self.val_dates) == 0 and self.val_period_name != \"\":\n                if self.val_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods\")\n                self.val_dates = dataset.time_periods[self.val_period_name]\n            if len(self.val_dates) == 0 and self.val_period_name == \"\":\n                raise ValueError(\"val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates\")\n        # Check if train, val, and test dates are available in the dataset\n        bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]\n        bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]\n        bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]\n        if len(bad_train_dates) &gt; 0:\n            raise ValueError(f\"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_val_dates) &gt; 0:\n            raise ValueError(f\"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_test_dates) &gt; 0:\n            raise ValueError(f\"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        # Check time order of train, val, and test periods\n        train_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.train_dates]\n        test_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.test_dates]\n        if len(train_dates) &gt; 0 and len(test_dates) &gt; 0  and min(test_dates) &lt;= max(train_dates):\n            warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        if self.val_approach == ValidationApproach.VALIDATION_DATES:\n            # Train dates are guaranteed to be set\n            val_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.val_dates]\n            if min(val_dates) &lt;= max(train_dates):\n                warnings.warn(f\"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n            if len(test_dates) &gt; 0 and min(test_dates) &lt;= max(val_dates):\n                warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        # Configure features\n        self.flowstats_features = dataset.metadata.flowstats_features\n        self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean\n        self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []\n        if self.use_packet_histograms:\n            if len(dataset.metadata.packet_histograms) == 0:\n                raise ValueError(\"This dataset does not support use_packet_histograms\")\n            self.flowstats_features_phist = dataset.metadata.packet_histograms\n        else:\n            self.flowstats_features_phist = []\n            if self.flowstats_phist_transform is not None:\n                raise ValueError(\"flowstats_phist_transform cannot be specified when use_packet_histograms is false\")\n        if dataset.metadata.protocol == Protocol.TLS:\n            if self.use_tcp_features:\n                self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS\n            if self.use_push_flags and \"PUSH_FLAG\" not in dataset.metadata.ppi_features:\n                raise ValueError(\"This TLS dataset does not support use_push_flags\")\n        if dataset.metadata.protocol == Protocol.QUIC:\n            if self.use_tcp_features:\n                raise ValueError(\"QUIC datasets do not support use_tcp_features\")\n            if self.use_push_flags:\n                raise ValueError(\"QUIC datasets do not support use_push_flags\")\n        # When train_dates_weigths are used, train_size and val_known_size have to be specified\n        if self.train_dates_weigths is not None:\n            if not self.need_train_set:\n                raise ValueError(\"train_dates_weigths cannot be specified when need_train_set is false\")\n            if len(self.train_dates_weigths) != len(self.train_dates):\n                raise ValueError(\"train_dates_weigths has to have the same length as train_dates\")\n            if self.train_size == \"all\":\n                raise ValueError(\"train_size cannot be 'all' when train_dates_weigths are speficied\")\n            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == \"all\":\n                raise ValueError(\"val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train\")\n        # App selection\n        if self.apps_selection == AppSelection.ALL_KNOWN:\n            self.val_unknown_size = 0\n            self.test_unknown_size = 0\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) &gt; 0 or len(self.apps_selection_fixed_known) &gt; 0 or len(self.apps_selection_fixed_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known\")\n        if self.apps_selection == AppSelection.TOPX_KNOWN:\n            if self.apps_selection_topx == 0:\n                raise ValueError(\"apps_selection_topx has to be greater than 0 when application selection is top-x-known\")\n            if len(self.apps_selection_background_unknown) &gt; 0 or len(self.apps_selection_fixed_known) &gt; 0 or len(self.apps_selection_fixed_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known\")\n        if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:\n            if len(self.apps_selection_background_unknown) == 0:\n                raise ValueError(\"apps_selection_background_unknown has to be specified when application selection is background-unknown\")\n            bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]\n            if len(bad_apps) &gt; 0:\n                raise ValueError(f\"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) &gt; 0 or len(self.apps_selection_fixed_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown\")\n        if self.apps_selection == AppSelection.FIXED:\n            if len(self.apps_selection_fixed_known) == 0:\n                raise ValueError(\"apps_selection_fixed_known has to be specified when application selection is fixed\")\n            bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]\n            if len(bad_apps) &gt; 0:\n                raise ValueError(f\"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if len(self.disabled_apps) &gt; 0:\n                raise ValueError(\"disabled_apps cannot be specified when application selection is fixed\")\n            if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:\n                warnings.warn(\"min_train_samples_per_app is not used when application selection is fixed\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) &gt; 0:\n                raise ValueError(\"apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed\")\n        # More asserts\n        bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]\n        if len(bad_disabled_apps) &gt; 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples &lt;= 0 or self.fit_scalers_samples &gt; 1):\n            raise ValueError(\"fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer\")\n\n    def get_flowstats_features_len(self) -&gt; int:\n        \"\"\"Gets the number of flow statistics features.\"\"\"\n        return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n\n    def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -&gt; list[str]:\n        \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n        phist_mapping = {\n            \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        }\n        short_names_mapping = {\n            \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n            \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n            \"FLOW_ENDREASON_END\": \"FEND_END\",\n            \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n            \"FLAG_CWR\": \"F_CWR\",\n            \"FLAG_CWR_REV\": \"F_CWR_REV\",\n            \"FLAG_ECE\": \"F_ECE\",\n            \"FLAG_ECE_REV\": \"F_ECE_REV\",\n            \"FLAG_PSH_REV\": \"F_PSH_REV\",\n            \"FLAG_RST\": \"F_RST\",\n            \"FLAG_RST_REV\": \"F_RST_REV\",\n            \"FLAG_FIN\": \"F_FIN\",\n            \"FLAG_FIN_REV\": \"F_FIN_REV\",\n        }\n        feature_names = self.flowstats_features[:]\n        for f in self.flowstats_features_boolean:\n            if shorter_names and f in short_names_mapping:\n                feature_names.append(short_names_mapping[f])\n            else:\n                feature_names.append(f)\n        for f in self.flowstats_features_phist:\n            feature_names.extend(phist_mapping[f])\n        assert len(feature_names) == self.get_flowstats_features_len()\n        return feature_names\n\n    def get_ppi_feature_names(self) -&gt; list[str]:\n        \"\"\"Gets the names of flattened PPI features.\"\"\"\n        ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        if self.use_push_flags:\n            ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        return ppi_feature_names\n\n    def get_ppi_channels(self) -&gt; list[int]:\n        \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n        if self.use_push_flags:\n            return TCP_PPI_CHANNELS\n        else:\n            return UDP_PPI_CHANNELS\n\n    def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -&gt; list[str]:\n        \"\"\"\n        Gets feature names.\n\n        Parameters:\n            flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n        \"\"\"\n        feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n        feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n        return feature_names\n\n    def _get_train_tables_paths(self) -&gt; list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n\n    def _get_val_tables_paths(self) -&gt; list[str]:\n        if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n            return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n        return list(map(lambda t: f\"/flows/D{t}\", self.val_dates))\n\n    def _get_test_tables_paths(self) -&gt; list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.test_dates))\n\n    def _get_train_data_hash(self) -&gt; str:\n        train_data_params = self._get_train_data_params()\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        return params_hash\n\n    def _get_train_data_path(self) -&gt; str:\n        if self.need_train_set:\n            params_hash = self._get_train_data_hash()\n            return os.path.join(self.data_root, \"train-data\", f\"{params_hash}_{self.random_state}\", f\"fold_{self.fold_id}\")\n        else:\n            return os.path.join(self.data_root, \"train-data\", \"default\")\n\n    def _get_train_data_params(self) -&gt; TrainDataParams:\n        return TrainDataParams(\n            database_filename=self.database_filename,\n            train_period_name=self.train_period_name,\n            train_tables_paths=self._get_train_tables_paths(),\n            apps_selection=self.apps_selection,\n            apps_selection_topx=self.apps_selection_topx,\n            apps_selection_background_unknown=self.apps_selection_background_unknown,\n            apps_selection_fixed_known=self.apps_selection_fixed_known,\n            apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,\n            disabled_apps=self.disabled_apps,\n            min_train_samples_per_app=self.min_train_samples_per_app,\n            min_train_samples_check=self.min_train_samples_check,)\n\n    def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -&gt; tuple[TestDataParams, str]:\n        assert self.val_approach == ValidationApproach.VALIDATION_DATES\n        val_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.val_period_name,\n            test_tables_paths=self._get_val_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        val_data_path = os.path.join(self.data_root, \"val-data\", f\"{params_hash}_{self.random_state}\")\n        return val_data_params, val_data_path\n\n    def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -&gt; tuple[TestDataParams, str]:\n        test_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.test_period_name,\n            test_tables_paths=self._get_test_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        test_data_path = os.path.join(self.data_root, \"test-data\", f\"{params_hash}_{self.random_state}\")\n        return test_data_params, test_data_path\n\n    @model_validator(mode=\"before\") # type: ignore\n    @classmethod\n    def check_deprecated_args(cls, values):\n        kwargs = values.kwargs\n        if \"train_period\" in kwargs:\n            warnings.warn(\"train_period is deprecated. Use train_period_name instead.\")\n            kwargs[\"train_period_name\"] = kwargs[\"train_period\"]\n        if \"val_period\" in kwargs:\n            warnings.warn(\"val_period is deprecated. Use val_period_name instead.\")\n            kwargs[\"val_period_name\"] = kwargs[\"val_period\"]\n        if \"test_period\" in kwargs:\n            warnings.warn(\"test_period is deprecated. Use test_period_name instead.\")\n            kwargs[\"test_period_name\"] = kwargs[\"test_period\"]\n        return values\n\n    def __str__(self):\n        _process_tag = yaml.emitter.Emitter.process_tag\n        _ignore_aliases = yaml.Dumper.ignore_aliases\n        yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None\n        yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True\n        s = yaml.dump(dataclasses.asdict(self), sort_keys=False)\n        yaml.emitter.Emitter.process_tag = _process_tag\n        yaml.Dumper.ignore_aliases = _ignore_aliases\n        return s\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig-functions","title":"Functions","text":""},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len","title":"get_flowstats_features_len","text":"<pre><code>get_flowstats_features_len() -&gt; int\n</code></pre> <p>Gets the number of flow statistics features.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_flowstats_features_len(self) -&gt; int:\n    \"\"\"Gets the number of flow statistics features.\"\"\"\n    return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded","title":"get_flowstats_feature_names_expanded","text":"<pre><code>get_flowstats_feature_names_expanded(\n    shorter_names: bool = False,\n) -&gt; list[str]\n</code></pre> <p>Gets names of flow statistics features. Packet histograms are expanded into bin features.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -&gt; list[str]:\n    \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n    phist_mapping = {\n        \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n    }\n    short_names_mapping = {\n        \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n        \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n        \"FLOW_ENDREASON_END\": \"FEND_END\",\n        \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n        \"FLAG_CWR\": \"F_CWR\",\n        \"FLAG_CWR_REV\": \"F_CWR_REV\",\n        \"FLAG_ECE\": \"F_ECE\",\n        \"FLAG_ECE_REV\": \"F_ECE_REV\",\n        \"FLAG_PSH_REV\": \"F_PSH_REV\",\n        \"FLAG_RST\": \"F_RST\",\n        \"FLAG_RST_REV\": \"F_RST_REV\",\n        \"FLAG_FIN\": \"F_FIN\",\n        \"FLAG_FIN_REV\": \"F_FIN_REV\",\n    }\n    feature_names = self.flowstats_features[:]\n    for f in self.flowstats_features_boolean:\n        if shorter_names and f in short_names_mapping:\n            feature_names.append(short_names_mapping[f])\n        else:\n            feature_names.append(f)\n    for f in self.flowstats_features_phist:\n        feature_names.extend(phist_mapping[f])\n    assert len(feature_names) == self.get_flowstats_features_len()\n    return feature_names\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_feature_names","title":"get_ppi_feature_names","text":"<pre><code>get_ppi_feature_names() -&gt; list[str]\n</code></pre> <p>Gets the names of flattened PPI features.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_ppi_feature_names(self) -&gt; list[str]:\n    \"\"\"Gets the names of flattened PPI features.\"\"\"\n    ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    if self.use_push_flags:\n        ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    return ppi_feature_names\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_channels","title":"get_ppi_channels","text":"<pre><code>get_ppi_channels() -&gt; list[int]\n</code></pre> <p>Gets the available features (channels) in PPI sequences.</p> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_ppi_channels(self) -&gt; list[int]:\n    \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n    if self.use_push_flags:\n        return TCP_PPI_CHANNELS\n    else:\n        return UDP_PPI_CHANNELS\n</code></pre>"},{"location":"reference_dataset_config/#config.DatasetConfig.get_feature_names","title":"get_feature_names","text":"<pre><code>get_feature_names(\n    flatten_ppi: bool = False, shorter_names: bool = False\n) -&gt; list[str]\n</code></pre> <p>Gets feature names.</p> <p>Parameters:</p> Name Type Description Default <code>flatten_ppi</code> <code>bool</code> <p>Whether to flatten PPI into individual feature names or keep one <code>PPI</code> column.</p> <code>False</code> Source code in <code>cesnet_datazoo\\config.py</code> <pre><code>def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -&gt; list[str]:\n    \"\"\"\n    Gets feature names.\n\n    Parameters:\n        flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n    \"\"\"\n    feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n    feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n    return feature_names\n</code></pre>"},{"location":"reference_dataset_config/#enums-for-configuration","title":"Enums for configuration","text":"<p>The following enums are used for dataset configuration.</p>"},{"location":"reference_dataset_config/#config.ValidationApproach","title":"config.ValidationApproach","text":"<p>The validation approach defines which samples should be used for creating a validation set.</p> SPLIT_FROM_TRAIN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>SPLIT_FROM_TRAIN = 'split-from-train'\n</code></pre> <p>Split train data into train and validation. Scikit-learn <code>train_test_split</code> is used to create a random stratified validation set. The fraction of validation samples is defined in <code>train_val_split_fraction</code>.</p> VALIDATION_DATES <code>class-attribute</code> <code>instance-attribute</code> <pre><code>VALIDATION_DATES = 'validation-dates'\n</code></pre> <p>Use separate validation dates to create a validation set. Validation dates need to be specified in <code>val_dates</code>, and the name of the validation period in <code>val_period_name</code>.</p>"},{"location":"reference_dataset_config/#config.AppSelection","title":"config.AppSelection","text":"<p>Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use <code>ALL_KNOWN</code> to select all the applications as known. Use <code>TOPX_KNOWN</code> or <code>BACKGROUND_UNKNOWN</code> for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The <code>FIXED</code> is for manual selection of known and unknown applications.</p> ALL_KNOWN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>ALL_KNOWN = 'all-known'\n</code></pre> <p>Use all applications as known.</p> TOPX_KNOWN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>TOPX_KNOWN = 'topx-known'\n</code></pre> <p>Use the first X (<code>apps_selection_topx</code>) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.</p> BACKGROUND_UNKNOWN <code>class-attribute</code> <code>instance-attribute</code> <pre><code>BACKGROUND_UNKNOWN = 'background-unknown'\n</code></pre> <p>Use the list of background traffic classes (<code>apps_selection_background_unknown</code>) as unknown, and the rest as known.</p> FIXED <code>class-attribute</code> <code>instance-attribute</code> <pre><code>FIXED = 'fixed'\n</code></pre> <p>Manual application selection. Provide lists of known applications (<code>apps_selection_fixed_known</code>) and unknown applications (<code>apps_selection_fixed_unknown</code>).</p>"},{"location":"reference_dataset_config/#config.MinTrainSamplesCheck","title":"config.MinTrainSamplesCheck","text":"<p>Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with <code>min_train_samples_per_app</code>, and its default value is 100. With the <code>DISABLE_APPS</code> approach, these applications will be disabled and not used for training or testing. With the <code>WARN_AND_EXIT</code> approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set <code>min_train_samples_per_app</code> to 0.</p> WARN_AND_EXIT <code>class-attribute</code> <code>instance-attribute</code> <pre><code>WARN_AND_EXIT = 'warn-and-exit'\n</code></pre> <p>Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to <code>disabled_apps</code>.</p> DISABLE_APPS <code>class-attribute</code> <code>instance-attribute</code> <pre><code>DISABLE_APPS = 'disable-apps'\n</code></pre> <p>Disable applications with not enough training samples.</p>"},{"location":"reference_dataset_config/#config.DataLoaderOrder","title":"config.DataLoaderOrder","text":"<p>Validation and test sets are always loaded in sequential order \u2014 sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use <code>RANDOM</code> if your classification model requires it; <code>SEQUENTIAL</code> otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.</p> RANDOM <code>class-attribute</code> <code>instance-attribute</code> <pre><code>RANDOM = 'random'\n</code></pre> <p>Iterate train data in random order.</p> SEQUENTIAL <code>class-attribute</code> <code>instance-attribute</code> <pre><code>SEQUENTIAL = 'sequential'\n</code></pre> <p>Iterate train data in sequential (datetime) order.</p>"},{"location":"reference_datasets/","title":"Dataset classes","text":"<p>These are subclasses of <code>CesnetDataset</code> representing individual datasets available in <code>cesnet-datazoo</code>.</p>"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS22","title":"datasets.datasets.CESNET_TLS22","text":"<p>             Bases: <code>CesnetDataset</code></p> <p>Dataset class for CESNET-TLS22.</p> Source code in <code>cesnet_datazoo\\datasets\\datasets.py</code> <pre><code>class CESNET_TLS22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS22][cesnet-tls22].\"\"\"\n    name = \"CESNET-TLS22\"\n    database_filename = \"CESNET-TLS22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls22\"\n    available_dates = _CESNET_TLS22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2021-40\": [\"20211004\", \"20211005\", \"20211006\", \"20211007\", \"20211008\", \"20211009\", \"20211010\"],\n        \"W-2021-41\": [\"20211011\", \"20211012\", \"20211013\", \"20211014\", \"20211015\", \"20211016\", \"20211017\"],\n    }\n    default_train_period_name = \"W-2021-40\"\n    default_test_period_name = \"W-2021-41\"\n    _tables_app_enum = _CESNET_TLS22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS22_TABLES_CATEGORY_ENUM\n</code></pre>"},{"location":"reference_datasets/#datasets.datasets.CESNET_QUIC22","title":"datasets.datasets.CESNET_QUIC22","text":"<p>             Bases: <code>CesnetDataset</code></p> <p>Dataset class for CESNET-QUIC22.</p> Source code in <code>cesnet_datazoo\\datasets\\datasets.py</code> <pre><code>class CESNET_QUIC22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-QUIC22][cesnet-quic22].\"\"\"\n    name = \"CESNET-QUIC22\"\n    database_filename = \"CESNET-QUIC22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-quic22\"\n    available_dates = _CESNET_QUIC22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2022-44\": [\"20221031\", \"20221101\", \"20221102\", \"20221103\", \"20221104\", \"20221105\", \"20221106\"],\n        \"W-2022-45\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\"],\n        \"W-2022-46\": [\"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\"],\n        \"W-2022-47\": [\"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n        \"W45-47\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\",\n                   \"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\",\n                   \"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n    }\n    default_train_period_name = \"W-2022-44\"\n    default_test_period_name = \"W-2022-45\"\n    _tables_app_enum = _CESNET_QUIC22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_QUIC22_TABLES_CATEGORY_ENUM\n</code></pre>"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS_Year22","title":"datasets.datasets.CESNET_TLS_Year22","text":"<p>             Bases: <code>CesnetDataset</code></p> <p>Dataset class for CESNET-TLS-Year22.</p> Source code in <code>cesnet_datazoo\\datasets\\datasets.py</code> <pre><code>class CESNET_TLS_Year22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS-Year22][cesnet-tls-year22].\"\"\"\n    name = \"CESNET-TLS-Year22\"\n    database_filename = \"CESNET-TLS-Year22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls-year22\"\n    available_dates = _CESNET_TLS_YEAR22_AVAILABLE_DATES\n    time_periods = _CESNET_TLS_YEAR22_TIME_PERIODS\n    default_train_period_name = \"M-2022-9\"\n    default_test_period_name = \"M-2022-10\"\n    _tables_app_enum = _CESNET_TLS_YEAR22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS_YEAR22_TABLES_CATEGORY_ENUM\n</code></pre>"},{"location":"transforms/","title":"Transforms","text":"<p>The <code>cesnet_datazoo</code> package supports configurable transforms of input data in a similar fashion to what torchvision is doing for the computer vision field. Input features are split into three groups, each having its own transformation. Those groups are PPI sequences, flow statistics, and packet histograms.</p> <ul> <li>Transformation configured in <code>ppi_transform</code> of <code>DatasetConfig</code> is applied to PPI sequences.</li> <li><code>flowstats_transform</code> is applied to flow statistics (excluding boolean features, such as flow end reasons or TCP flags).</li> <li><code>flowstats_phist_transform</code> is applied to packet histograms.</li> </ul> <p>Transforms are implemented in a separate package CESNET Models. See <code>cesnet_models.transforms</code> documentation for details.</p> <p>Limitations</p> <p>The current implementation does not support the composing of transformations.</p>"},{"location":"transforms/#available-transformations","title":"Available transformations","text":"<p>PPI sequences</p> <ul> <li>ClipAndScalePPI</li> </ul> <p>Flow statistics</p> <ul> <li>ClipAndScaleFlowstats</li> </ul> <p>Packet histograms</p> <ul> <li>NormalizeHistograms</li> </ul> <p>More transformations will be implemented in future versions.</p>"},{"location":"transforms/#data-scaling","title":"Data scaling","text":"<p>Transformations implementing data scaling will be fitted, if needed, on a subset of training data during dataset initialization.</p>"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index b0f62ab..98b3e81 100755
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ