diff --git a/dataset_metadata/index.html b/dataset_metadata/index.html index 1871754..a949c80 100755 --- a/dataset_metadata/index.html +++ b/dataset_metadata/index.html @@ -670,7 +670,7 @@
This is the documentation of the CESNET DataZoo project.
The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the cesnet-datazoo
package are:
cesnet_models.transforms
documentation for details.S
size containing 25 million samples. Apart from loading data into dataframes, the cesnet-datazoo
package provides dataloaders for processing data in smaller batches.
An example of how dataloaders can be used is in cesnet_datazoo.datasets.loaders
or in the following snippet:
def load_from_dataloader(dataloader: DataLoader):\n other_fields = []\n data_ppi = []\n data_flowstats = []\n labels = []\n for batch_other_fields, batch_ppi, batch_flowstats, batch_labels in dataloader:\n other_fields.append(batch_other_fields)\n data_ppi.append(batch_ppi)\n data_flowstats.append(batch_flowstats)\n labels.append(batch_labels)\n df_other_fields = pd.concat(other_fields, ignore_index=True)\n data_ppi = np.concatenate(data_ppi)\n data_flowstats = np.concatenate(data_flowstats)\n labels = np.concatenate(labels)\n return df_other_fields, data_ppi, data_flowstats, labels\n
When a dataloader is iterated, the returned data are in the format tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels)
. Batch size B is configured with batch_size
and test_batch_size
config options. The shapes are:
pd.DataFrame (B, C)
- a Pandas DataFrame with auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the return_other_fields
config option is false, this will be an empty DataFrame. Columns C depend on the used dataset and are available at dataset_config.other_fields
.np.ndarray (B, [3, 4], 30)
- the middle dimension is either 4 when TCP push flags are used (use_push_flags
) or 3 otherwise.np.ndarray (B, F)
- where F is the number of flowstats features computed with DatasetConfig.get_flowstats_features_len. To get the order and names of flowstats features, call DatasetConfig.get_flowstats_feature_names_expanded. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the data features page for more information about features.np.ndarray (B)
- integer labels encoded with a LabelEncoder
instance available at dataset.class_info.encoder
.PPI and flow statistics features returned from dataloaders are transformed depending on the selected configuration. See the transforms page for more information.
"},{"location":"dataset_metadata/","title":"DatasetMetadata","text":"Each dataset class has its metadata available as a DatasetMetadata
instance in the metadata
attribute.
CESNET-TLS22
This dataset was published in \"Fine-grained TLS services classification with reject option\" (DOI, arXiv). It was built from live traffic collected using high-speed monitoring probes at the perimeter of the CESNET2 network.
For detailed information about the dataset, see the linked paper and the dataset metadata page.
"},{"location":"datasets_overview/#cesnet-quic22","title":"CESNET-QUIC22","text":"CESNET-QUIC22
This dataset was published in \"CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines\" (DOI). The QUIC protocol has the potential to replace TLS over TLS as the standard protocol for reliable and secure Internet communication. Due to its design that makes the inspection of connection handshakes challenging and its usage in HTTP/3, there is an increasing demand for QUIC traffic classification methods.
For detailed information about the dataset, see the linked paper and the dataset metadata page. Experiments based on this dataset were published in \"Encrypted traffic classification: the QUIC case\" (DOI).
"},{"location":"datasets_overview/#cesnet-tls-year22","title":"CESNET-TLS-Year22","text":"CESNET-TLS-Year22
This dataset is similar to CESNET-TLS22; however, it spans the entire year 2022. It will be published in the near future.
"},{"location":"features/","title":"Features","text":"This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.
"},{"location":"features/#ppi-sequence","title":"PPI sequence","text":"A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set use_push_flags
for using PUSH flags in PPI sequences, if available in the used dataset.
Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.
Name Description DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons"},{"location":"features/#packet-histograms","title":"Packet histograms","text":"Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0\u201315, 16\u201331, 32\u201363, 64\u2013127, 128\u2013255, 256\u2013511, 512\u20131024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set use_packet_histograms
for using packet histograms features, if available in the dataset.
On the dataset metadata page, packet histogram features are called PHIST_SRC_SIZES
, PHIST_DST_SIZES
, PHIST_SRC_IPT
, PHIST_DST_IPT
. Those are the names of database columns that are flattened to the _BIN{x} features.
Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set use_tcp_features
for using a subset of flags defined in cesnet_datazoo.constants.SELECTED_TCP_FLAGS
.
Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets. Set return_other_fields
to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.
Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.
TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV Zero-length packets(without L4 payload, e.g. ACKs) Not included Not included Included Retransmissions(and out-of-order packets) Included Not included* Included Computed from Entire flow First 30 packets Entire flow*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.
For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.
"},{"location":"getting_started/","title":"Getting started","text":""},{"location":"getting_started/#jupyter-notebooks","title":"Jupyter notebooks","text":"Example Jupyter notebooks are provided at https://github.com/CESNET/cesnet-tcexamples. Start with:
from cesnet_datazoo.datasets import CESNET_QUIC22\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset.compute_dataset_statistics(num_samples=100_000, num_workers=0)\n
This will download the dataset, compute dataset statistics, and save them into /datasets/CESNET-QUIC22/statistics
."},{"location":"getting_started/#enable-logging-and-set-the-spawn-method-on-windows","title":"Enable logging and set the spawn method on Windows","text":"import logging\nimport multiprocessing as mp\n\nmp.set_start_method(\"spawn\") \nlogging.basicConfig(\n level=logging.INFO,\n format=\"[%(asctime)s][%(name)s][%(levelname)s] - %(message)s\")\n
For running on Windows, we recommend using the spawn
method for creating dataloader worker processes. Set up logging to get more information from the package."},{"location":"getting_started/#initialize-dataset-to-create-train-validation-and-test-dataframes","title":"Initialize dataset to create train, validation, and test dataframes","text":"from cesnet_datazoo.datasets import CESNET_QUIC22\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\n\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset_config = DatasetConfig(\n dataset=dataset,\n apps_selection=AppSelection.ALL_KNOWN,\n train_period_name=\"W-2022-44\",\n test_period_name=\"W-2022-45\",\n)\ndataset.set_dataset_config_and_initialize(dataset_config)\ntrain_dataframe = dataset.get_train_df()\nval_dataframe = dataset.get_val_df()\ntest_dataframe = dataset.get_test_df()\n
The DatasetConfig
class handles the configuration of datasets, and calling set_dataset_config_and_initialize
initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See CesnetDataset
reference.
Install the package from pip with:
pip install cesnet-datazoo\n
or for editable install with:
pip install -e git+https://github.com/CESNET/cesnet-datazoo\n
"},{"location":"installation/#requirements","title":"Requirements","text":"The cesnet-datazoo
package requires Python >=3.10.
The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:
The dataset is stored in a PyTables database. The internal PyTablesDataset
class is used as a wrapper that implements the PyTorch Dataset
interface and is compatible with DataLoader
, which provides efficient parallel loading of the data. The dataset configuration is done through the DatasetConfig
class.
Intended usage:
DatasetConfig
and set it with set_dataset_config_and_initialize
. This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.get_train_dataloader
or get_train_df
to get training data for a classification model.get_val_dataloader
or get_val_df
.get_test_dataloader
or get_test_df
.Parameters:
Name Type Description Defaultdata_root
str
Path to the folder where the dataset will be stored. Each dataset size has its own subfolder data_root/size
size
str
Size of the dataset. Options are XS
, S
, M
, L
, ORIG
.
'S'
silent
bool
Whether to suppress print and tqdm output.
False
Attributes:
Name Type Descriptionname
str
Name of the dataset.
database_filename
str
Name of the database file.
database_path
str
Path to the database file.
servicemap_path
str
Path to the servicemap file.
statistics_path
str
Path to the dataset statistics folder.
bucket_url
str
URL of the bucket where the database is stored.
metadata
DatasetMetadata
Additional dataset metadata.
available_classes
list[str]
List of all available classes in the dataset.
available_dates
list[str]
List of all available dates in the dataset.
time_periods
dict[str, list[str]]
Predefined time periods. Each time period is a list of dates.
default_train_period_name
str
Default time period for training.
default_test_period_name
str
Default time period for testing.
The following attributes are initialized when set_dataset_config_and_initialize
is called.
Attributes:
Name Type Descriptiondataset_config
Optional[DatasetConfig]
Configuration of the dataset.
class_info
Optional[ClassInfo]
Structured information about the classes.
dataset_indices
Optional[IndicesTuple]
Named tuple containing train_indices
, val_known_indices
, val_unknown_indices
, test_known_indices
, test_unknown_indices
. These are the indices into PyTables database that define train, validation, and test sets.
train_dataset
Optional[PyTablesDataset]
Train set in the form of PyTablesDataset
instance wrapping the PyTables database.
val_dataset
Optional[PyTablesDataset]
Validation set in the form of PyTablesDataset
instance wrapping the PyTables database.
test_dataset
Optional[PyTablesDataset]
Test set in the form of PyTablesDataset
instance wrapping the PyTables database.
known_app_counts
Optional[DataFrame]
Known application counts in the train, validation, and test sets.
unknown_app_counts
Optional[DataFrame]
Unknown application counts in the validation and test sets.
train_dataloader
Optional[DataLoader]
Iterable PyTorch DataLoader
for training.
train_dataloader_sampler
Optional[Sampler]
Sampler used for iterating the training dataloader. Either RandomSampler
or SequentialSampler
.
train_dataloader_drop_last
bool
Whether to drop the last incomplete batch when iterating the training dataloader.
val_dataloader
Optional[DataLoader]
Iterable PyTorch DataLoader
for validation.
test_dataloader
Optional[DataLoader]
Iterable PyTorch DataLoader
for testing.
cesnet_datazoo\\datasets\\cesnet_dataset.py
class CesnetDataset():\n \"\"\"\n The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:\n\n - Iterable PyTorch DataLoader for batch processing. See [using dataloaders][using-dataloaders] for more details.\n - Pandas DataFrame for loading the entire train, validation, or test set at once.\n\n The dataset is stored in a [PyTables](https://www.pytables.org/) database. The internal `PyTablesDataset` class is used as a wrapper\n that implements the PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) interface\n and is compatible with [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),\n which provides efficient parallel loading of the data. The dataset configuration is done through the [`DatasetConfig`][config.DatasetConfig] class.\n\n **Intended usage:**\n\n 1. Create an instance of the [dataset class][dataset-classes] with the desired size and data root. This will download the dataset if it has not already been downloaded.\n 2. Create an instance of [`DatasetConfig`][config.DatasetConfig] and set it with [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize].\n This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.\n 3. Use [`get_train_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_train_dataloader] or [`get_train_df`][datasets.cesnet_dataset.CesnetDataset.get_train_df] to get training data for a classification model.\n 4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_val_dataloader] or [`get_val_df`][datasets.cesnet_dataset.CesnetDataset.get_val_df].\n 5. Evaluate the model on [`get_test_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_test_dataloader] or [`get_test_df`][datasets.cesnet_dataset.CesnetDataset.get_test_df].\n\n Parameters:\n data_root: Path to the folder where the dataset will be stored. Each dataset size has its own subfolder `data_root/size`\n size: Size of the dataset. Options are `XS`, `S`, `M`, `L`, `ORIG`.\n silent: Whether to suppress print and tqdm output.\n\n Attributes:\n name: Name of the dataset.\n database_filename: Name of the database file.\n database_path: Path to the database file.\n servicemap_path: Path to the servicemap file.\n statistics_path: Path to the dataset statistics folder.\n bucket_url: URL of the bucket where the database is stored.\n metadata: Additional [dataset metadata][metadata].\n available_classes: List of all available classes in the dataset.\n available_dates: List of all available dates in the dataset.\n time_periods: Predefined time periods. Each time period is a list of dates.\n default_train_period_name: Default time period for training.\n default_test_period_name: Default time period for testing.\n\n The following attributes are initialized when [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize] is called.\n\n Attributes:\n dataset_config: Configuration of the dataset.\n class_info: Structured information about the classes.\n dataset_indices: Named tuple containing `train_indices`, `val_known_indices`, `val_unknown_indices`, `test_known_indices`, `test_unknown_indices`. These are the indices into PyTables database that define train, validation, and test sets.\n train_dataset: Train set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n val_dataset: Validation set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n test_dataset: Test set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n known_app_counts: Known application counts in the train, validation, and test sets.\n unknown_app_counts: Unknown application counts in the validation and test sets.\n train_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training.\n train_dataloader_sampler: Sampler used for iterating the training dataloader. Either [`RandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) or [`SequentialSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler).\n train_dataloader_drop_last: Whether to drop the last incomplete batch when iterating the training dataloader.\n val_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n test_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n \"\"\"\n data_root: str\n size: str\n silent: bool = False\n\n name: str\n database_filename: str\n database_path: str\n servicemap_path: str\n statistics_path: str\n bucket_url: str\n metadata: DatasetMetadata\n available_classes: list[str]\n available_dates: list[str]\n time_periods: dict[str, list[str]]\n default_train_period_name: str\n default_test_period_name: str\n\n dataset_config: Optional[DatasetConfig] = None\n class_info: Optional[ClassInfo] = None\n dataset_indices: Optional[IndicesTuple] = None\n train_dataset: Optional[PyTablesDataset] = None\n val_dataset: Optional[PyTablesDataset] = None\n test_dataset: Optional[PyTablesDataset] = None\n known_app_counts: Optional[pd.DataFrame] = None\n unknown_app_counts: Optional[pd.DataFrame] = None\n train_dataloader: Optional[DataLoader] = None\n train_dataloader_sampler: Optional[Sampler] = None\n train_dataloader_drop_last: bool = True\n val_dataloader: Optional[DataLoader] = None\n test_dataloader: Optional[DataLoader] = None\n\n _collate_fn: Optional[Callable] = None\n _tables_app_enum: dict[int, str]\n _tables_cat_enum: dict[int, str]\n\n def __init__(self, data_root: str, size: str = \"S\", database_checks_at_init: bool = False, silent: bool = False) -> None:\n self.silent = silent\n self.metadata = load_metadata(self.name)\n self.size = size\n if self.size != \"ORIG\":\n if size not in self.metadata.available_dataset_sizes:\n raise ValueError(f\"Unknown dataset size {self.size}\")\n self.name = f\"{self.name}-{self.size}\"\n filename, ext = os.path.splitext(self.database_filename)\n self.database_filename = f\"{filename}-{self.size}{ext}\"\n self.data_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, self.size)))\n self.database_path = os.path.join(self.data_root, self.database_filename)\n self.servicemap_path = os.path.join(self.data_root, SERVICEMAP_FILE)\n self.statistics_path = os.path.join(self.data_root, \"statistics\")\n if not os.path.exists(self.data_root):\n os.makedirs(self.data_root)\n if not self._is_downloaded():\n self._download()\n if database_checks_at_init:\n with tb.open_file(self.database_path, mode=\"r\") as database:\n tables_paths = list(map(lambda x: x._v_pathname, iter(database.get_node(f\"/flows\"))))\n num_samples = 0\n for p in tables_paths:\n table = database.get_node(p)\n assert isinstance(table, tb.Table)\n if self._tables_app_enum != {v: k for k, v in dict(table.get_enum(APP_COLUMN)).items()}:\n raise ValueError(f\"Found mismatch between _tables_app_enum and the PyTables database enum in table {p}. Please report this issue.\")\n if self._tables_cat_enum != {v: k for k, v in dict(table.get_enum(CATEGORY_COLUMN)).items()}:\n raise ValueError(f\"Found mismatch between _tables_cat_enum and the PyTables database enum in table {p}. Please report this issue.\")\n num_samples += len(table)\n if self.size == \"ORIG\" and num_samples != self.metadata.available_samples:\n raise ValueError(f\"Expected {self.metadata.available_samples} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n if self.size != \"ORIG\" and num_samples != DATASET_SIZES[self.size]:\n raise ValueError(f\"Expected {DATASET_SIZES[self.size]} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n if self.available_dates != list(map(lambda x: x.removeprefix(\"/flows/D\"), tables_paths)):\n raise ValueError(f\"Found mismatch between available_dates and the dates available in the PyTables database. Please report this issue.\")\n # Add all available dates as single date time periods\n for d in self.available_dates:\n self.time_periods[d] = [d]\n available_applications = sorted([app for app in pd.read_csv(self.servicemap_path, index_col=\"Tag\").index if not is_background_app(app)])\n if len(available_applications) != self.metadata.application_count:\n raise ValueError(f\"Found {len(available_applications)} applications in the servicemap (omitting background traffic classes), but expected {self.metadata.application_count}. Please report this issue.\")\n self.available_classes = available_applications + self.metadata.background_traffic_classes\n\n def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n \"\"\"\n Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n Parameters:\n dataset_config: Desired configuration of the dataset.\n disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n \"\"\"\n self.dataset_config = dataset_config\n self._clear()\n self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n\n def get_train_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n When the dataloader is iterated in random order, the last incomplete batch is dropped.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ---------------------------- | ------------------------------------------------------------------------------------------ |\n | `batch_size` | Number of samples per batch. |\n | `train_workers` | Number of workers for loading train data. |\n | `train_dataloader_order` | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][]. |\n | `train_dataloader_seed` | Seed for loading train data in random order. |\n\n Returns:\n Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n if not self.dataset_config.need_train_set:\n raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n assert self.train_dataset\n if self.train_dataloader:\n return self.train_dataloader\n # Create sampler according to the selected order\n if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n if self.dataset_config.train_dataloader_seed is not None:\n generator = torch.Generator()\n generator.manual_seed(self.dataset_config.train_dataloader_seed)\n else:\n generator = None\n self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n self.train_dataloader_drop_last = True\n elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n self.train_dataloader_drop_last = False\n else: assert_never(self.dataset_config.train_dataloader_order)\n # Create dataloader\n batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n train_dataloader = DataLoader(\n self.train_dataset,\n num_workers=self.dataset_config.train_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.train_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.train_workers == 0:\n self.train_dataset.pytables_worker_init()\n self.train_dataloader = train_dataloader\n return train_dataloader\n\n def get_val_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n The dataloader is created on the first call and then cached.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `val_workers` | Number of workers for loading validation data. |\n\n Returns:\n Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n if not self.dataset_config.need_val_set:\n raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n assert self.val_dataset is not None\n if self.val_dataloader:\n return self.val_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n val_dataloader = DataLoader(\n self.val_dataset,\n num_workers=self.dataset_config.val_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.val_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.val_workers == 0:\n self.val_dataset.pytables_worker_init()\n self.val_dataloader = val_dataloader\n return val_dataloader\n\n def get_test_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n The dataloader is created on the first call and then cached.\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `test_workers` | Number of workers for loading test data. |\n\n Returns:\n Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n if not self.dataset_config.need_test_set:\n raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n assert self.test_dataset is not None\n if self.test_dataloader:\n return self.test_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n test_dataloader = DataLoader(\n self.test_dataset,\n num_workers=self.dataset_config.test_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=False,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.test_workers == 0:\n self.test_dataset.pytables_worker_init()\n self.test_dataloader = test_dataloader\n return test_dataloader\n\n def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n train_dataloader = self.get_train_dataloader()\n val_dataloader = self.get_val_dataloader()\n test_dataloader = self.get_test_dataloader()\n return train_dataloader, val_dataloader, test_dataloader\n\n def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n !!! warning \"Memory usage\"\n\n The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Train data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_train=True)\n assert self.dataset_config is not None and self.train_dataset is not None\n if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n train_dataloader = self.get_train_dataloader()\n assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n # Read dataloader in sequential order\n train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n train_dataloader.sampler.drop_last = False\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n df = create_df_from_dataloader(dataloader=train_dataloader,\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n # Restore the original dataloader sampler and drop_last\n train_dataloader.sampler.sampler = self.train_dataloader_sampler\n train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n return df\n\n def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n !!! warning \"Memory usage\"\n\n The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Validation data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_val=True)\n assert self.dataset_config is not None and self.val_dataset is not None\n if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n\n def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n !!! warning \"Memory usage\"\n\n The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Test data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_test=True)\n assert self.dataset_config is not None and self.test_dataset is not None\n if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n\n def get_num_classes(self) -> int:\n \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n return self.class_info.num_classes\n\n def get_known_apps(self) -> list[str]:\n \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n return self.class_info.known_apps\n\n def get_unknown_apps(self) -> list[str]:\n \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n return self.class_info.unknown_apps\n\n def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n \"\"\"\n Computes dataset statistics and saves them to the `statistics_path` folder.\n\n Parameters:\n num_samples: Number of samples to use for computing the statistics.\n num_workers: Number of workers for loading data.\n batch_size: Number of samples per batch for loading data.\n disabled_apps: List of applications to exclude from the statistics.\n \"\"\"\n if disabled_apps:\n bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n if len(bad_disabled_apps) > 0:\n raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n if not os.path.exists(self.statistics_path):\n os.mkdir(self.statistics_path)\n compute_dataset_statistics(database_path=self.database_path,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n output_dir=self.statistics_path,\n packet_histograms=self.metadata.packet_histograms,\n flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n protocol=self.metadata.protocol,\n extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n disabled_apps=disabled_apps if disabled_apps is not None else [],\n num_samples=num_samples,\n num_workers=num_workers,\n batch_size=batch_size,\n silent=self.silent)\n\n def _generate_time_periods(self) -> None:\n time_periods = {}\n for period in self.time_periods:\n time_periods[period] = []\n if period.startswith(\"W\"):\n split = period.split(\"-\")\n collection_year, week = int(split[1]), int(split[2])\n for d in range(1, 8):\n s = datetime.date.fromisocalendar(collection_year, week, d).strftime(\"%Y%m%d\")\n # last week of a year can span into the following year\n if s not in self.metadata.missing_dates_in_collection_period and s.startswith(str(collection_year)):\n time_periods[period].append(s)\n elif period.startswith(\"M\"):\n split = period.split(\"-\")\n collection_year, month = int(split[1]), int(split[2])\n for d in range(1, calendar.monthrange(collection_year, month)[1]):\n s = datetime.date(collection_year, month, d).strftime(\"%Y%m%d\")\n if s not in self.metadata.missing_dates_in_collection_period:\n time_periods[period].append(s)\n self.time_periods = time_periods\n\n def _is_downloaded(self) -> bool:\n \"\"\"Servicemap is downloaded after the database; thus if it exists, the database is also downloaded\"\"\"\n return os.path.exists(self.servicemap_path) and os.path.exists(self.database_path)\n\n def _download(self) -> None:\n if not self.silent:\n print(f\"Downloading {self.name} dataset\")\n database_url = f\"{self.bucket_url}&file={self.database_filename}\"\n servicemap_url = f\"{self.bucket_url}&file={SERVICEMAP_FILE}\"\n resumable_download(url=database_url, file_path=self.database_path, silent=self.silent)\n simple_download(url=servicemap_url, file_path=self.servicemap_path)\n\n def _clear(self) -> None:\n self.class_info = None\n self.dataset_indices = None\n self.train_dataset = None\n self.val_dataset = None\n self.test_dataset = None\n self.known_app_counts = None\n self.unknown_app_counts = None\n self.train_dataloader = None\n self.train_dataloader_sampler = None\n self.train_dataloader_drop_last = True\n self.val_dataloader = None\n self.test_dataloader = None\n self._collate_fn = None\n\n def _check_before_dataframe(self, check_train: bool = False, check_val: bool = False, check_test: bool = False) -> None:\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting a dataframe\")\n if self.dataset_config.return_tensors:\n raise ValueError(\"Dataframes are not available when return_tensors is set. Use a dataloader instead.\")\n if check_train and not self.dataset_config.need_train_set:\n raise ValueError(\"Train dataframe is not available when need_train_set is false\")\n if check_val and not self.dataset_config.need_val_set:\n raise ValueError(\"Validation dataframe is not available when need_val_set is false\")\n if check_test and not self.dataset_config.need_test_set:\n raise ValueError(\"Test dataframe is not available when need_test_set is false\")\n\n def _initialize_train_val_test(self, disable_indices_cache: bool = False) -> None:\n assert self.dataset_config is not None\n dataset_config = self.dataset_config\n servicemap = pd.read_csv(dataset_config.servicemap_path, index_col=\"Tag\")\n # Initialize train set\n if dataset_config.need_train_set:\n train_indices, train_unknown_indices, known_apps, unknown_apps = init_or_load_train_indices(dataset_config=dataset_config,\n tables_app_enum=self._tables_app_enum,\n servicemap=servicemap,\n disable_indices_cache=disable_indices_cache,)\n # Date weight sampling of train indices\n if dataset_config.train_dates_weigths is not None:\n assert dataset_config.train_size != \"all\"\n if dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n # requested number of samples is train_size + val_known_size when using the split-from-train validation approach\n assert dataset_config.val_known_size != \"all\"\n num_samples = dataset_config.train_size + dataset_config.val_known_size\n else:\n num_samples = dataset_config.train_size\n if num_samples > len(train_indices):\n raise ValueError(f\"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})\")\n train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)\n elif dataset_config.apps_selection == AppSelection.FIXED:\n known_apps = dataset_config.apps_selection_fixed_known\n unknown_apps = dataset_config.apps_selection_fixed_unknown\n train_indices = np.zeros((0,3), dtype=np.int64)\n train_unknown_indices = np.zeros((0,3), dtype=np.int64)\n else:\n raise ValueError(\"Either need train set or the fixed application selection\")\n # Initialize validation set\n if dataset_config.need_val_set:\n if dataset_config.val_approach == ValidationApproach.VALIDATION_DATES:\n val_known_indices, val_unknown_indices, val_data_path = init_or_load_val_indices(dataset_config=dataset_config,\n known_apps=known_apps,\n unknown_apps=unknown_apps,\n tables_app_enum=self._tables_app_enum,\n disable_indices_cache=disable_indices_cache,)\n elif dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n train_val_rng = get_fresh_random_generator(dataset_config=dataset_config, section=RandomizedSection.TRAIN_VAL_SPLIT)\n val_data_path = dataset_config._get_train_data_path()\n val_unknown_indices = train_unknown_indices\n train_labels = train_indices[:, INDICES_LABEL_POS]\n if dataset_config.train_dates_weigths is not None:\n assert dataset_config.val_known_size != \"all\"\n # When weight sampling is used, val_known_size is kept but the resulting train size can be smaller due to no enough samples in some train dates\n if dataset_config.val_known_size > len(train_indices):\n raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples after weight sampling ({len(train_indices)})\")\n train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.val_known_size, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n dataset_config.train_size = len(train_indices)\n elif dataset_config.train_size == \"all\" and dataset_config.val_known_size == \"all\":\n train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.train_val_split_fraction, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n else:\n if dataset_config.val_known_size != \"all\" and dataset_config.train_size != \"all\" and dataset_config.train_size + dataset_config.val_known_size > len(train_indices):\n raise ValueError(f\"Requested train size + validation size ({dataset_config.train_size + dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n if dataset_config.train_size != \"all\" and dataset_config.train_size > len(train_indices):\n raise ValueError(f\"Requested train size ({dataset_config.train_size}) is larger than the number of available train samples ({len(train_indices)})\")\n if dataset_config.val_known_size != \"all\" and dataset_config.val_known_size > len(train_indices):\n raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n train_indices, val_known_indices = train_test_split(train_indices,\n train_size=dataset_config.train_size if dataset_config.train_size != \"all\" else None,\n test_size=dataset_config.val_known_size if dataset_config.val_known_size != \"all\" else None,\n stratify=train_labels, shuffle=True, random_state=train_val_rng)\n else:\n val_known_indices = np.zeros((0,3), dtype=np.int64)\n val_unknown_indices = np.zeros((0,3), dtype=np.int64)\n val_data_path = None\n # Initialize test set\n if dataset_config.need_test_set:\n test_known_indices, test_unknown_indices, test_data_path = init_or_load_test_indices(dataset_config=dataset_config,\n known_apps=known_apps,\n unknown_apps=unknown_apps,\n tables_app_enum=self._tables_app_enum,\n disable_indices_cache=disable_indices_cache,)\n else:\n test_known_indices = np.zeros((0,3), dtype=np.int64)\n test_unknown_indices = np.zeros((0,3), dtype=np.int64)\n test_data_path = None\n # Fit scalers if needed\n if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or\n dataset_config.flowstats_transform is not None and dataset_config.flowstats_transform.needs_fitting):\n if not dataset_config.need_train_set:\n raise ValueError(\"Train set is needed to fit the scalers. Provide pre-fitted scalers.\")\n fit_scalers(dataset_config=dataset_config, train_indices=train_indices)\n # Subset dataset indices based on the selected sizes and compute application counts\n dataset_indices = IndicesTuple(train_indices=train_indices, val_known_indices=val_known_indices, val_unknown_indices=val_unknown_indices, test_known_indices=test_known_indices, test_unknown_indices=test_unknown_indices)\n dataset_indices = subset_and_sort_indices(dataset_config=dataset_config, dataset_indices=dataset_indices)\n known_app_counts = compute_known_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n unknown_app_counts = compute_unknown_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n # Combine known and unknown test indicies to create a single dataloader\n assert isinstance(dataset_config.test_unknown_size, int)\n if dataset_config.test_unknown_size > 0 and len(unknown_apps) > 0:\n test_combined_indices = np.concatenate((dataset_indices.test_known_indices, dataset_indices.test_unknown_indices))\n else:\n test_combined_indices = dataset_indices.test_known_indices\n # Create encoder the class info structure\n encoder = LabelEncoder().fit(known_apps)\n encoder.classes_ = np.append(encoder.classes_, UNKNOWN_STR_LABEL)\n class_info = create_class_info(servicemap=servicemap, encoder=encoder, known_apps=known_apps, unknown_apps=unknown_apps)\n encode_labels_with_unknown_fn = partial(_encode_labels_with_unknown, encoder=encoder, class_info=class_info)\n # Create train, validation, and test datasets\n train_dataset = val_dataset = test_dataset = None\n if dataset_config.need_train_set:\n train_dataset = PyTablesDataset(\n database_path=dataset_config.database_path,\n tables_paths=dataset_config._get_train_tables_paths(),\n indices=dataset_indices.train_indices,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n flowstats_features=dataset_config.flowstats_features,\n flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n flowstats_features_phist=dataset_config.flowstats_features_phist,\n other_fields=self.dataset_config.other_fields,\n ppi_channels=dataset_config.get_ppi_channels(),\n ppi_transform=dataset_config.ppi_transform,\n flowstats_transform=dataset_config.flowstats_transform,\n flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n target_transform=encode_labels_with_unknown_fn,\n return_tensors=dataset_config.return_tensors,)\n if dataset_config.need_val_set:\n assert val_data_path is not None\n val_dataset = PyTablesDataset(\n database_path=dataset_config.database_path,\n tables_paths=dataset_config._get_train_tables_paths(),\n indices=dataset_indices.val_known_indices,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n flowstats_features=dataset_config.flowstats_features,\n flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n flowstats_features_phist=dataset_config.flowstats_features_phist,\n other_fields=self.dataset_config.other_fields,\n ppi_channels=dataset_config.get_ppi_channels(),\n ppi_transform=dataset_config.ppi_transform,\n flowstats_transform=dataset_config.flowstats_transform,\n flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n target_transform=encode_labels_with_unknown_fn,\n return_tensors=dataset_config.return_tensors,\n preload=dataset_config.preload_val,\n preload_blob=os.path.join(val_data_path, \"preload\", f\"val_dataset-{dataset_config.val_known_size}.npz\"),)\n if dataset_config.need_test_set:\n assert test_data_path is not None\n test_dataset = PyTablesDataset(\n database_path=dataset_config.database_path,\n tables_paths=dataset_config._get_test_tables_paths(),\n indices=test_combined_indices,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n flowstats_features=dataset_config.flowstats_features,\n flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n flowstats_features_phist=dataset_config.flowstats_features_phist,\n other_fields=self.dataset_config.other_fields,\n ppi_channels=dataset_config.get_ppi_channels(),\n ppi_transform=dataset_config.ppi_transform,\n flowstats_transform=dataset_config.flowstats_transform,\n flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n target_transform=encode_labels_with_unknown_fn,\n return_tensors=dataset_config.return_tensors,\n preload=dataset_config.preload_test,\n preload_blob=os.path.join(test_data_path, \"preload\", f\"test_dataset-{dataset_config.test_known_size}-{dataset_config.test_unknown_size}.npz\"),)\n self.class_info = class_info\n self.dataset_indices = dataset_indices\n self.train_dataset = train_dataset\n self.val_dataset = val_dataset\n self.test_dataset = test_dataset\n self.known_app_counts = known_app_counts\n self.unknown_app_counts = unknown_app_counts\n self._collate_fn = collate_fn_simple\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize","title":"set_dataset_config_and_initialize","text":"set_dataset_config_and_initialize(\n dataset_config: DatasetConfig,\n disable_indices_cache: bool = False,\n) -> None\n
Initialize train, validation, and test sets. Data cannot be accessed before calling this method.
Parameters:
Name Type Description Defaultdataset_config
DatasetConfig
Desired configuration of the dataset.
requireddisable_indices_cache
bool
Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.
False
Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n \"\"\"\n Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n Parameters:\n dataset_config: Desired configuration of the dataset.\n disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n \"\"\"\n self.dataset_config = dataset_config\n self._clear()\n self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_dataloader","title":"get_train_dataloader","text":"get_train_dataloader() -> DataLoader\n
Provides a PyTorch DataLoader
for training. The dataloader is created on the first call and then cached. When the dataloader is iterated in random order, the last incomplete batch is dropped. The dataloader is configured with the following config attributes:
batch_size
Number of samples per batch. train_workers
Number of workers for loading train data. train_dataloader_order
Whether to load train data in sequential or random order. See config.DataLoaderOrder. train_dataloader_seed
Seed for loading train data in random order. Returns:
Type DescriptionDataLoader
Train data as an iterable dataloader. See using dataloaders for more details.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n When the dataloader is iterated in random order, the last incomplete batch is dropped.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ---------------------------- | ------------------------------------------------------------------------------------------ |\n | `batch_size` | Number of samples per batch. |\n | `train_workers` | Number of workers for loading train data. |\n | `train_dataloader_order` | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][]. |\n | `train_dataloader_seed` | Seed for loading train data in random order. |\n\n Returns:\n Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n if not self.dataset_config.need_train_set:\n raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n assert self.train_dataset\n if self.train_dataloader:\n return self.train_dataloader\n # Create sampler according to the selected order\n if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n if self.dataset_config.train_dataloader_seed is not None:\n generator = torch.Generator()\n generator.manual_seed(self.dataset_config.train_dataloader_seed)\n else:\n generator = None\n self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n self.train_dataloader_drop_last = True\n elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n self.train_dataloader_drop_last = False\n else: assert_never(self.dataset_config.train_dataloader_order)\n # Create dataloader\n batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n train_dataloader = DataLoader(\n self.train_dataset,\n num_workers=self.dataset_config.train_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.train_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.train_workers == 0:\n self.train_dataset.pytables_worker_init()\n self.train_dataloader = train_dataloader\n return train_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_dataloader","title":"get_val_dataloader","text":"get_val_dataloader() -> DataLoader\n
Provides a PyTorch DataLoader
for validation. The dataloader is created on the first call and then cached. The dataloader is configured with the following config attributes:
test_batch_size
Number of samples per batch for loading validation and test data. val_workers
Number of workers for loading validation data. Returns:
Type DescriptionDataLoader
Validation data as an iterable dataloader. See using dataloaders for more details.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n The dataloader is created on the first call and then cached.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `val_workers` | Number of workers for loading validation data. |\n\n Returns:\n Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n if not self.dataset_config.need_val_set:\n raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n assert self.val_dataset is not None\n if self.val_dataloader:\n return self.val_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n val_dataloader = DataLoader(\n self.val_dataset,\n num_workers=self.dataset_config.val_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.val_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.val_workers == 0:\n self.val_dataset.pytables_worker_init()\n self.val_dataloader = val_dataloader\n return val_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_dataloader","title":"get_test_dataloader","text":"get_test_dataloader() -> DataLoader\n
Provides a PyTorch DataLoader
for testing. The dataloader is created on the first call and then cached.
When the dataset is used in the open-world setting, and unknown classes are defined, the test dataloader returns test_known_size
samples of known classes followed by test_unknown_size
samples of unknown classes.
The dataloader is configured with the following config attributes:
Dataset config Descriptiontest_batch_size
Number of samples per batch for loading validation and test data. test_workers
Number of workers for loading test data. Returns:
Type DescriptionDataLoader
Test data as an iterable dataloader. See using dataloaders for more details.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n The dataloader is created on the first call and then cached.\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `test_workers` | Number of workers for loading test data. |\n\n Returns:\n Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n if not self.dataset_config.need_test_set:\n raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n assert self.test_dataset is not None\n if self.test_dataloader:\n return self.test_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n test_dataloader = DataLoader(\n self.test_dataset,\n num_workers=self.dataset_config.test_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=False,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.test_workers == 0:\n self.test_dataset.pytables_worker_init()\n self.test_dataloader = test_dataloader\n return test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_dataloaders","title":"get_dataloaders","text":"get_dataloaders() -> (\n tuple[DataLoader, DataLoader, DataLoader]\n)\n
Gets train, validation, and test dataloaders in one call.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n train_dataloader = self.get_train_dataloader()\n val_dataloader = self.get_val_dataloader()\n test_dataloader = self.get_test_dataloader()\n return train_dataloader, val_dataloader, test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_df","title":"get_train_df","text":"get_train_df(flatten_ppi: bool = False) -> pd.DataFrame\n
Creates a train Pandas DataFrame
. The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.
Memory usage
The whole train set is loaded into memory. If the dataset size is larger than 'S'
, consider using get_train_dataloader
instead.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten the PPI sequence into individual columns (named IPT_X
, DIR_X
, SIZE_X
, PUSH_X
, X being the index of the packet) or keep one PPI
column with 2D data.
False
Returns:
Type DescriptionDataFrame
Train data as a dataframe.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n !!! warning \"Memory usage\"\n\n The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Train data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_train=True)\n assert self.dataset_config is not None and self.train_dataset is not None\n if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n train_dataloader = self.get_train_dataloader()\n assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n # Read dataloader in sequential order\n train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n train_dataloader.sampler.drop_last = False\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n df = create_df_from_dataloader(dataloader=train_dataloader,\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n # Restore the original dataloader sampler and drop_last\n train_dataloader.sampler.sampler = self.train_dataloader_sampler\n train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n return df\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_df","title":"get_val_df","text":"get_val_df(flatten_ppi: bool = False) -> pd.DataFrame\n
Creates validation Pandas DataFrame
. The dataframe is in sequential (datetime) order.
Memory usage
The whole validation set is loaded into memory. If the dataset size is larger than 'S'
, consider using get_val_dataloader
instead.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten the PPI sequence into individual columns (named IPT_X
, DIR_X
, SIZE_X
, PUSH_X
, X being the index of the packet) or keep one PPI
column with 2D data.
False
Returns:
Type DescriptionDataFrame
Validation data as a dataframe.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n !!! warning \"Memory usage\"\n\n The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Validation data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_val=True)\n assert self.dataset_config is not None and self.val_dataset is not None\n if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_df","title":"get_test_df","text":"get_test_df(flatten_ppi: bool = False) -> pd.DataFrame\n
Creates test Pandas DataFrame
. The dataframe is in sequential (datetime) order.
When the dataset is used in the open-world setting, and unknown classes are defined, the returned test dataframe is composed of test_known_size
samples of known classes followed by test_unknown_size
samples of unknown classes.
Memory usage
The whole test set is loaded into memory. If the dataset size is larger than 'S'
, consider using get_test_dataloader
instead.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten the PPI sequence into individual columns (named IPT_X
, DIR_X
, SIZE_X
, PUSH_X
, X being the index of the packet) or keep one PPI
column with 2D data.
False
Returns:
Type DescriptionDataFrame
Test data as a dataframe.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n !!! warning \"Memory usage\"\n\n The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Test data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_test=True)\n assert self.dataset_config is not None and self.test_dataset is not None\n if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_num_classes","title":"get_num_classes","text":"get_num_classes() -> int\n
Returns the number of classes in the current configuration of the dataset.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_num_classes(self) -> int:\n \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n return self.class_info.num_classes\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_known_apps","title":"get_known_apps","text":"get_known_apps() -> list[str]\n
Returns the list of known applications in the current configuration of the dataset.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_known_apps(self) -> list[str]:\n \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n return self.class_info.known_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_unknown_apps","title":"get_unknown_apps","text":"get_unknown_apps() -> list[str]\n
Returns the list of unknown applications in the current configuration of the dataset.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_unknown_apps(self) -> list[str]:\n \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n return self.class_info.unknown_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.compute_dataset_statistics","title":"compute_dataset_statistics","text":"compute_dataset_statistics(\n num_samples: int | Literal[\"all\"] = 10000000,\n num_workers: int = 4,\n batch_size: int = 16384,\n disabled_apps: Optional[list[str]] = None,\n) -> None\n
Computes dataset statistics and saves them to the statistics_path
folder.
Parameters:
Name Type Description Defaultnum_samples
int | Literal['all']
Number of samples to use for computing the statistics.
10000000
num_workers
int
Number of workers for loading data.
4
batch_size
int
Number of samples per batch for loading data.
16384
disabled_apps
Optional[list[str]]
List of applications to exclude from the statistics.
None
Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n \"\"\"\n Computes dataset statistics and saves them to the `statistics_path` folder.\n\n Parameters:\n num_samples: Number of samples to use for computing the statistics.\n num_workers: Number of workers for loading data.\n batch_size: Number of samples per batch for loading data.\n disabled_apps: List of applications to exclude from the statistics.\n \"\"\"\n if disabled_apps:\n bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n if len(bad_disabled_apps) > 0:\n raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n if not os.path.exists(self.statistics_path):\n os.mkdir(self.statistics_path)\n compute_dataset_statistics(database_path=self.database_path,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n output_dir=self.statistics_path,\n packet_histograms=self.metadata.packet_histograms,\n flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n protocol=self.metadata.protocol,\n extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n disabled_apps=disabled_apps if disabled_apps is not None else [],\n num_samples=num_samples,\n num_workers=num_workers,\n batch_size=batch_size,\n silent=self.silent)\n
"},{"location":"reference_dataset_config/","title":"Config class","text":""},{"location":"reference_dataset_config/#config.DatasetConfig","title":"config.DatasetConfig","text":"The main class for the configuration of:
When initializing this class, pass a CesnetDataset
instance to be configured and the desired configuration. Available options are here.
Attributes:
Name Type Descriptiondataset
InitVar[CesnetDataset]
The dataset instance to be configured.
data_root
str
Taken from the dataset instance.
database_filename
str
Taken from the dataset instance.
database_path
str
Taken from the dataset instance.
servicemap_path
str
Taken from the dataset instance.
flowstats_features
list[str]
Taken from dataset.metadata.flowstats_features
.
flowstats_features_boolean
list[str]
Taken from dataset.metadata.flowstats_features_boolean
.
flowstats_features_phist
list[str]
Taken from dataset.metadata.packet_histograms
if use_packet_histograms
is true, otherwise an empty list.
other_fields
list[str]
Taken from dataset.metadata.other_fields
if return_other_fields
is true, otherwise an empty list.
Attributes:
Name Type Descriptionneed_train_set
bool
Use to disable the train set. Default: True
need_val_set
bool
Use to disable the validation set. When need_train_set
is false, the validation set will also be disabled. Default: True
need_test_set
bool
Use to disable the test set. Default: True
train_period_name
str
Name of the train period. See instructions.
train_dates
list[str]
Dates used for creating a train set.
train_dates_weigths
Optional[list[int]]
To use a non-uniform distribution of samples across train dates.
val_approach
ValidationApproach
How a validation set should be created. Either split train data into train and validation or have a separate validation period. Default: SPLIT_FROM_TRAIN
train_val_split_fraction
float
The fraction of validation samples when splitting from the train set. Default: 0.2
val_period_name
str
Name of the validation period. See instructions.
val_dates
list[str]
Dates used for creating a validation set.
test_period_name
str
Name of the test period. See instructions.
test_dates
list[str]
Dates used for creating a test set.
apps_selection
AppSelection
How to select application classes. Default: ALL_KNOWN
apps_selection_topx
int
Take top X as known.
apps_selection_background_unknown
list[str]
Provide a list of background traffic classes to be used as unknown.
apps_selection_fixed_known
list[str]
Provide a list of manually selected known applications.
apps_selection_fixed_unknown
list[str]
Provide a list of manually selected unknown applications.
disabled_apps
list[str]
List of applications to be disabled and not used at all.
min_train_samples_check
MinTrainSamplesCheck
How to handle applications with not enough training samples. Default: DISABLE_APPS
min_train_samples_per_app
int
Defines the threshold for not enough. Default: 100
random_state
int
Fix all random processes performed during dataset initialization. Default: 420
fold_id
int
To perform N-fold cross-validation, set this to 1..N
. Each fold will use the same configuration but a different random seed. Default: 0
train_workers
int
Number of workers for loading train data. 0
means that the data will be loaded in the main process. Default: 4
test_workers
int
Number of workers for loading test data. 0
means that the data will be loaded in the main process. Default: 1
val_workers
int
Number of workers for loading validation data. 0
means that the data will be loaded in the main process. Default: 1
batch_size
int
Number of samples per batch. Default: 192
test_batch_size
int
Number of samples per batch for loading validation and test data. Default: 2048
preload_val
bool
Whether to dump the validation set with numpy.savez_compressed
and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. Default: True
preload_test
bool
Whether to dump the test set with numpy.savez_compressed
and preload it in future runs. Default: False
train_size
int | Literal['all']
Size of the train set. See instructions. Default: all
val_known_size
int | Literal['all']
Size of the validation set. See instructions. Default: all
test_known_size
int | Literal['all']
Size of the test set. See instructions. Default: all
val_unknown_size
int | Literal['all']
Size of the unknown classes validation set. Use for evaluation in the open-world setting. Default: 0
test_unknown_size
int | Literal['all']
Size of the unknown classes test set. Use for evaluation in the open-world setting. Default: 0
train_dataloader_order
DataLoaderOrder
Whether to load train data in sequential or random order. Default: RANDOM
train_dataloader_seed
Optional[int]
Seed for loading train data in random order. Default: None
return_other_fields
bool
Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. Default: False
return_tensors
bool
Use for returning torch.Tensor
from dataloaders. Dataframes are not available when this option is used. Default: False
use_packet_histograms
bool
Whether to use packet histogram features, if available in the dataset. Default: True
use_tcp_features
bool
Whether to use TCP features, if available in the dataset. Default: True
use_push_flags
bool
Whether to use push flags in packet sequences, if available in the dataset. Default: False
fit_scalers_samples
int | float
Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. Default: 0.25
ppi_transform
Optional[Callable]
Transform function for PPI sequences. See the transforms page for more information. Default: None
flowstats_transform
Optional[Callable]
Transform function for flow statistics. See the transforms page for more information. Default: None
flowstats_phist_transform
Optional[Callable]
Transform function for packet histograms. See the transforms page for more information. Default: None
There are three options for how to define train/validation/test dates.
train_period_name
, val_period_name
, or test_period_name
) available in dataset.time_periods
and leave the list of dates (train_dates
, val_dates
, or test_dates
) empty.dataset.available_dates
.dataset.default_train_period_name
and dataset.default_test_period_name
.There are two options for configuring sizes of train/validation/test sets.
S
) when creating the CesnetDataset
instance and leave train_size
, val_known_size
, and test_known_size
with their default all
value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).train_size
, val_known_size
, and test_known_size
. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the ORIG
dataset size and want to control the size of experiments.Tip
The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.
Source code incesnet_datazoo\\config.py
@dataclass(config=C)\nclass DatasetConfig():\n \"\"\"\n The main class for the configuration of:\n\n - Train, validation, test sets (dates, sizes, validation approach).\n - Application selection \u2014 either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).\n - Data transformations. See the [transforms][transforms] page for more information.\n - Dataloader options like batch sizes, order of loading, or number of workers.\n\n When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].\n\n Attributes:\n dataset: The dataset instance to be configured.\n data_root: Taken from the dataset instance.\n database_filename: Taken from the dataset instance.\n database_path: Taken from the dataset instance.\n servicemap_path: Taken from the dataset instance.\n flowstats_features: Taken from `dataset.metadata.flowstats_features`.\n flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.\n flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.\n other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.\n\n # Configuration options\n\n Attributes:\n need_train_set: Use to disable the train set. `Default: True`\n need_val_set: Use to disable the validation set. When `need_train_set` is false, the validation set will also be disabled. `Default: True`\n need_test_set: Use to disable the test set. `Default: True`\n train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n train_dates: Dates used for creating a train set.\n train_dates_weigths: To use a non-uniform distribution of samples across train dates.\n val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`\n train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`\n val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n val_dates: Dates used for creating a validation set.\n test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n test_dates: Dates used for creating a test set.\n\n apps_selection: How to select application classes. `Default: ALL_KNOWN`\n apps_selection_topx: Take top X as known.\n apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.\n apps_selection_fixed_known: Provide a list of manually selected known applications.\n apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.\n disabled_apps: List of applications to be disabled and not used at all.\n min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`\n min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`\n\n random_state: Fix all random processes performed during dataset initialization. `Default: 420`\n fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`\n train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`\n test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`\n val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`\n batch_size: Number of samples per batch. `Default: 192`\n test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`\n preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: True`\n preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`\n train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`\n test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`\n train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`\n train_dataloader_seed: Seed for loading train data in random order. `Default: None`\n\n return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`\n return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`\n use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`\n use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`\n use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`\n fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`\n ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`\n flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`\n flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`\n\n # How to configure train, validation, and test sets\n There are three options for how to define train/validation/test dates.\n\n 1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.\n 2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.\n 3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.\n\n There are two options for configuring sizes of train/validation/test sets.\n\n 1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.\n This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).\n 2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.\n This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.\n\n !!! tip Validation set\n The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].\n\n \"\"\"\n dataset: InitVar[CesnetDataset]\n data_root: str = field(init=False)\n database_filename: str = field(init=False)\n database_path: str = field(init=False)\n servicemap_path: str = field(init=False)\n flowstats_features: list[str] = field(init=False)\n flowstats_features_boolean: list[str] = field(init=False)\n flowstats_features_phist: list[str] = field(init=False)\n other_fields: list[str] = field(init=False)\n\n need_train_set: bool = True\n need_val_set: bool = True\n need_test_set: bool = True\n train_period_name: str = \"\"\n train_dates: list[str] = field(default_factory=list)\n train_dates_weigths: Optional[list[int]] = None\n val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN\n train_val_split_fraction: float = 0.2\n val_period_name: str = \"\"\n val_dates: list[str] = field(default_factory=list)\n test_period_name: str = \"\"\n test_dates: list[str] = field(default_factory=list)\n\n apps_selection: AppSelection = AppSelection.ALL_KNOWN\n apps_selection_topx: int = 0\n apps_selection_background_unknown: list[str] = field(default_factory=list)\n apps_selection_fixed_known: list[str] = field(default_factory=list)\n apps_selection_fixed_unknown: list[str] = field(default_factory=list)\n disabled_apps: list[str] = field(default_factory=list)\n min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS\n min_train_samples_per_app: int = 100\n\n random_state: int = 420\n fold_id: int = 0\n train_workers: int = 4\n test_workers: int = 1\n val_workers: int = 1\n batch_size: int = 192\n test_batch_size: int = 2048\n preload_val: bool = True\n preload_test: bool = False\n train_size: int | Literal[\"all\"] = \"all\"\n val_known_size: int | Literal[\"all\"] = \"all\"\n test_known_size: int | Literal[\"all\"] = \"all\"\n val_unknown_size: int | Literal[\"all\"] = 0\n test_unknown_size: int | Literal[\"all\"] = 0\n train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM\n train_dataloader_seed: Optional[int] = None\n\n return_other_fields: bool = False\n return_tensors: bool = False\n use_packet_histograms: bool = False\n use_tcp_features: bool = False\n use_push_flags: bool = False\n fit_scalers_samples: int | float = 0.25\n ppi_transform: Optional[Callable] = None\n flowstats_transform: Optional[Callable] = None\n flowstats_phist_transform: Optional[Callable] = None\n\n def __post_init__(self, dataset: CesnetDataset):\n \"\"\"\n Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.\n \"\"\"\n self.data_root = dataset.data_root\n self.servicemap_path = dataset.servicemap_path\n self.database_filename = dataset.database_filename\n self.database_path = dataset.database_path\n\n if not self.need_train_set:\n self.need_val_set = False\n if self.apps_selection != AppSelection.FIXED:\n raise ValueError(\"Application selection has to be fixed when need_train_set is false\")\n if (len(self.train_dates) > 0 or self.train_period_name != \"\"):\n raise ValueError(\"train_dates and train_period_name cannot be specified when need_train_set is false\")\n else:\n # Configure train dates\n if len(self.train_dates) > 0 and self.train_period_name == \"\":\n raise ValueError(\"train_period_name has to be specified when train_dates are set\")\n if len(self.train_dates) == 0 and self.train_period_name != \"\":\n if self.train_period_name not in dataset.time_periods:\n raise ValueError(f\"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods\")\n self.train_dates = dataset.time_periods[self.train_period_name]\n if len(self.train_dates) == 0 and self.train_period_name == \"\":\n self.train_period_name = dataset.default_train_period_name\n self.train_dates = dataset.time_periods[dataset.default_train_period_name]\n # Configure test dates\n if not self.need_test_set:\n if (len(self.test_dates) > 0 or self.test_period_name != \"\"):\n raise ValueError(\"test_dates and test_period_name cannot be specified when need_test_set is false\")\n else:\n if len(self.test_dates) > 0 and self.test_period_name == \"\":\n raise ValueError(\"test_period_name has to be specified when test_dates are set\")\n if len(self.test_dates) == 0 and self.test_period_name != \"\":\n if self.test_period_name not in dataset.time_periods:\n raise ValueError(f\"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods\")\n self.test_dates = dataset.time_periods[self.test_period_name]\n if len(self.test_dates) == 0 and self.test_period_name == \"\":\n self.test_period_name = dataset.default_test_period_name\n self.test_dates = dataset.time_periods[dataset.default_test_period_name]\n # Configure val dates\n if (not self.need_val_set or self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN) and (len(self.val_dates) > 0 or self.val_period_name != \"\"):\n raise ValueError(\"val_dates and val_period_name cannot be specified when need_val_set is false or the validation approach is split-from-train\")\n if self.val_approach == ValidationApproach.VALIDATION_DATES:\n if len(self.val_dates) > 0 and self.val_period_name == \"\":\n raise ValueError(\"val_period_name has to be specified when val_dates are set\")\n if len(self.val_dates) == 0 and self.val_period_name != \"\":\n if self.val_period_name not in dataset.time_periods:\n raise ValueError(f\"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods\")\n self.val_dates = dataset.time_periods[self.val_period_name]\n if len(self.val_dates) == 0 and self.val_period_name == \"\":\n raise ValueError(\"val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates\")\n # Check if train, val, and test dates are available in the dataset\n bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]\n bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]\n bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]\n if len(bad_train_dates) > 0:\n raise ValueError(f\"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n if len(bad_val_dates) > 0:\n raise ValueError(f\"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n if len(bad_test_dates) > 0:\n raise ValueError(f\"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n # Check time order of train, val, and test periods\n train_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.train_dates]\n test_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.test_dates]\n if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):\n warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n if self.val_approach == ValidationApproach.VALIDATION_DATES:\n # Train dates are guaranteed to be set\n val_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.val_dates]\n if min(val_dates) <= max(train_dates):\n warnings.warn(f\"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n if len(test_dates) > 0 and min(test_dates) <= max(val_dates):\n warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n # Configure features\n self.flowstats_features = dataset.metadata.flowstats_features\n self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean\n self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []\n if self.use_packet_histograms:\n if len(dataset.metadata.packet_histograms) == 0:\n raise ValueError(\"This dataset does not support use_packet_histograms\")\n self.flowstats_features_phist = dataset.metadata.packet_histograms\n else:\n self.flowstats_features_phist = []\n if self.flowstats_phist_transform is not None:\n raise ValueError(\"flowstats_phist_transform cannot be specified when use_packet_histograms is false\")\n if dataset.metadata.protocol == Protocol.TLS:\n if self.use_tcp_features:\n self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS\n if self.use_push_flags and \"PUSH_FLAG\" not in dataset.metadata.ppi_features:\n raise ValueError(\"This TLS dataset does not support use_push_flags\")\n if dataset.metadata.protocol == Protocol.QUIC:\n if self.use_tcp_features:\n raise ValueError(\"QUIC datasets do not support use_tcp_features\")\n if self.use_push_flags:\n raise ValueError(\"QUIC datasets do not support use_push_flags\")\n # When train_dates_weigths are used, train_size and val_known_size have to be specified\n if self.train_dates_weigths is not None:\n if not self.need_train_set:\n raise ValueError(\"train_dates_weigths cannot be specified when need_train_set is false\")\n if len(self.train_dates_weigths) != len(self.train_dates):\n raise ValueError(\"train_dates_weigths has to have the same length as train_dates\")\n if self.train_size == \"all\":\n raise ValueError(\"train_size cannot be 'all' when train_dates_weigths are speficied\")\n if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == \"all\":\n raise ValueError(\"val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train\")\n # App selection\n if self.apps_selection == AppSelection.ALL_KNOWN:\n self.val_unknown_size = 0\n self.test_unknown_size = 0\n if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n raise ValueError(\"apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known\")\n if self.apps_selection == AppSelection.TOPX_KNOWN:\n if self.apps_selection_topx == 0:\n raise ValueError(\"apps_selection_topx has to be greater than 0 when application selection is top-x-known\")\n if len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n raise ValueError(\"apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known\")\n if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:\n if len(self.apps_selection_background_unknown) == 0:\n raise ValueError(\"apps_selection_background_unknown has to be specified when application selection is background-unknown\")\n bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]\n if len(bad_apps) > 0:\n raise ValueError(f\"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n raise ValueError(\"apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown\")\n if self.apps_selection == AppSelection.FIXED:\n if len(self.apps_selection_fixed_known) == 0:\n raise ValueError(\"apps_selection_fixed_known has to be specified when application selection is fixed\")\n bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]\n if len(bad_apps) > 0:\n raise ValueError(f\"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n if len(self.disabled_apps) > 0:\n raise ValueError(\"disabled_apps cannot be specified when application selection is fixed\")\n if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:\n warnings.warn(\"min_train_samples_per_app is not used when application selection is fixed\")\n if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0:\n raise ValueError(\"apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed\")\n # More asserts\n bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]\n if len(bad_disabled_apps) > 0:\n raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples <= 0 or self.fit_scalers_samples > 1):\n raise ValueError(\"fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer\")\n\n def get_flowstats_features_len(self) -> int:\n \"\"\"Gets the number of flow statistics features.\"\"\"\n return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n\n def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n phist_mapping = {\n \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n }\n short_names_mapping = {\n \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n \"FLOW_ENDREASON_END\": \"FEND_END\",\n \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n \"FLAG_CWR\": \"F_CWR\",\n \"FLAG_CWR_REV\": \"F_CWR_REV\",\n \"FLAG_ECE\": \"F_ECE\",\n \"FLAG_ECE_REV\": \"F_ECE_REV\",\n \"FLAG_PSH_REV\": \"F_PSH_REV\",\n \"FLAG_RST\": \"F_RST\",\n \"FLAG_RST_REV\": \"F_RST_REV\",\n \"FLAG_FIN\": \"F_FIN\",\n \"FLAG_FIN_REV\": \"F_FIN_REV\",\n }\n feature_names = self.flowstats_features[:]\n for f in self.flowstats_features_boolean:\n if shorter_names and f in short_names_mapping:\n feature_names.append(short_names_mapping[f])\n else:\n feature_names.append(f)\n for f in self.flowstats_features_phist:\n feature_names.extend(phist_mapping[f])\n assert len(feature_names) == self.get_flowstats_features_len()\n return feature_names\n\n def get_ppi_feature_names(self) -> list[str]:\n \"\"\"Gets the names of flattened PPI features.\"\"\"\n ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n if self.use_push_flags:\n ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n return ppi_feature_names\n\n def get_ppi_channels(self) -> list[int]:\n \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n if self.use_push_flags:\n return TCP_PPI_CHANNELS\n else:\n return UDP_PPI_CHANNELS\n\n def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n \"\"\"\n Gets feature names.\n\n Parameters:\n flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n \"\"\"\n feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n return feature_names\n\n def _get_train_tables_paths(self) -> list[str]:\n return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n\n def _get_val_tables_paths(self) -> list[str]:\n if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n return list(map(lambda t: f\"/flows/D{t}\", self.val_dates))\n\n def _get_test_tables_paths(self) -> list[str]:\n return list(map(lambda t: f\"/flows/D{t}\", self.test_dates))\n\n def _get_train_data_hash(self) -> str:\n train_data_params = self._get_train_data_params()\n params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()\n params_hash = params_hash[:10]\n return params_hash\n\n def _get_train_data_path(self) -> str:\n if self.need_train_set:\n params_hash = self._get_train_data_hash()\n return os.path.join(self.data_root, \"train-data\", f\"{params_hash}_{self.random_state}\", f\"fold_{self.fold_id}\")\n else:\n return os.path.join(self.data_root, \"train-data\", \"default\")\n\n def _get_train_data_params(self) -> TrainDataParams:\n return TrainDataParams(\n database_filename=self.database_filename,\n train_period_name=self.train_period_name,\n train_tables_paths=self._get_train_tables_paths(),\n apps_selection=self.apps_selection,\n apps_selection_topx=self.apps_selection_topx,\n apps_selection_background_unknown=self.apps_selection_background_unknown,\n apps_selection_fixed_known=self.apps_selection_fixed_known,\n apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,\n disabled_apps=self.disabled_apps,\n min_train_samples_per_app=self.min_train_samples_per_app,\n min_train_samples_check=self.min_train_samples_check,)\n\n def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n assert self.val_approach == ValidationApproach.VALIDATION_DATES\n val_data_params = TestDataParams(\n database_filename=self.database_filename,\n test_period_name=self.val_period_name,\n test_tables_paths=self._get_val_tables_paths(),\n known_apps=known_apps,\n unknown_apps=unknown_apps,)\n params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()\n params_hash = params_hash[:10]\n val_data_path = os.path.join(self.data_root, \"val-data\", f\"{params_hash}_{self.random_state}\")\n return val_data_params, val_data_path\n\n def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n test_data_params = TestDataParams(\n database_filename=self.database_filename,\n test_period_name=self.test_period_name,\n test_tables_paths=self._get_test_tables_paths(),\n known_apps=known_apps,\n unknown_apps=unknown_apps,)\n params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()\n params_hash = params_hash[:10]\n test_data_path = os.path.join(self.data_root, \"test-data\", f\"{params_hash}_{self.random_state}\")\n return test_data_params, test_data_path\n\n @model_validator(mode=\"before\") # type: ignore\n @classmethod\n def check_deprecated_args(cls, values):\n kwargs = values.kwargs\n if \"train_period\" in kwargs:\n warnings.warn(\"train_period is deprecated. Use train_period_name instead.\")\n kwargs[\"train_period_name\"] = kwargs[\"train_period\"]\n if \"val_period\" in kwargs:\n warnings.warn(\"val_period is deprecated. Use val_period_name instead.\")\n kwargs[\"val_period_name\"] = kwargs[\"val_period\"]\n if \"test_period\" in kwargs:\n warnings.warn(\"test_period is deprecated. Use test_period_name instead.\")\n kwargs[\"test_period_name\"] = kwargs[\"test_period\"]\n return values\n\n def __str__(self):\n _process_tag = yaml.emitter.Emitter.process_tag\n _ignore_aliases = yaml.Dumper.ignore_aliases\n yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None\n yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True\n s = yaml.dump(dataclasses.asdict(self), sort_keys=False)\n yaml.emitter.Emitter.process_tag = _process_tag\n yaml.Dumper.ignore_aliases = _ignore_aliases\n return s\n
"},{"location":"reference_dataset_config/#config.DatasetConfig-functions","title":"Functions","text":""},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len","title":"get_flowstats_features_len","text":"get_flowstats_features_len() -> int\n
Gets the number of flow statistics features.
Source code incesnet_datazoo\\config.py
def get_flowstats_features_len(self) -> int:\n \"\"\"Gets the number of flow statistics features.\"\"\"\n return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded","title":"get_flowstats_feature_names_expanded","text":"get_flowstats_feature_names_expanded(\n shorter_names: bool = False,\n) -> list[str]\n
Gets names of flow statistics features. Packet histograms are expanded into bin features.
Source code incesnet_datazoo\\config.py
def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n phist_mapping = {\n \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n }\n short_names_mapping = {\n \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n \"FLOW_ENDREASON_END\": \"FEND_END\",\n \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n \"FLAG_CWR\": \"F_CWR\",\n \"FLAG_CWR_REV\": \"F_CWR_REV\",\n \"FLAG_ECE\": \"F_ECE\",\n \"FLAG_ECE_REV\": \"F_ECE_REV\",\n \"FLAG_PSH_REV\": \"F_PSH_REV\",\n \"FLAG_RST\": \"F_RST\",\n \"FLAG_RST_REV\": \"F_RST_REV\",\n \"FLAG_FIN\": \"F_FIN\",\n \"FLAG_FIN_REV\": \"F_FIN_REV\",\n }\n feature_names = self.flowstats_features[:]\n for f in self.flowstats_features_boolean:\n if shorter_names and f in short_names_mapping:\n feature_names.append(short_names_mapping[f])\n else:\n feature_names.append(f)\n for f in self.flowstats_features_phist:\n feature_names.extend(phist_mapping[f])\n assert len(feature_names) == self.get_flowstats_features_len()\n return feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_feature_names","title":"get_ppi_feature_names","text":"get_ppi_feature_names() -> list[str]\n
Gets the names of flattened PPI features.
Source code incesnet_datazoo\\config.py
def get_ppi_feature_names(self) -> list[str]:\n \"\"\"Gets the names of flattened PPI features.\"\"\"\n ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n if self.use_push_flags:\n ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n return ppi_feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_channels","title":"get_ppi_channels","text":"get_ppi_channels() -> list[int]\n
Gets the available features (channels) in PPI sequences.
Source code incesnet_datazoo\\config.py
def get_ppi_channels(self) -> list[int]:\n \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n if self.use_push_flags:\n return TCP_PPI_CHANNELS\n else:\n return UDP_PPI_CHANNELS\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_feature_names","title":"get_feature_names","text":"get_feature_names(\n flatten_ppi: bool = False, shorter_names: bool = False\n) -> list[str]\n
Gets feature names.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten PPI into individual feature names or keep one PPI
column.
False
Source code in cesnet_datazoo\\config.py
def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n \"\"\"\n Gets feature names.\n\n Parameters:\n flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n \"\"\"\n feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n return feature_names\n
"},{"location":"reference_dataset_config/#enums-for-configuration","title":"Enums for configuration","text":"The following enums are used for dataset configuration.
"},{"location":"reference_dataset_config/#config.ValidationApproach","title":"config.ValidationApproach","text":"The validation approach defines which samples should be used for creating a validation set.
SPLIT_FROM_TRAINclass-attribute
instance-attribute
SPLIT_FROM_TRAIN = 'split-from-train'\n
Split train data into train and validation. Scikit-learn train_test_split
is used to create a random stratified validation set. The fraction of validation samples is defined in train_val_split_fraction
.
class-attribute
instance-attribute
VALIDATION_DATES = 'validation-dates'\n
Use separate validation dates to create a validation set. Validation dates need to be specified in val_dates
, and the name of the validation period in val_period_name
.
Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use ALL_KNOWN
to select all the applications as known. Use TOPX_KNOWN
or BACKGROUND_UNKNOWN
for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The FIXED
is for manual selection of known and unknown applications.
class-attribute
instance-attribute
ALL_KNOWN = 'all-known'\n
Use all applications as known.
TOPX_KNOWNclass-attribute
instance-attribute
TOPX_KNOWN = 'topx-known'\n
Use the first X (apps_selection_topx
) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.
class-attribute
instance-attribute
BACKGROUND_UNKNOWN = 'background-unknown'\n
Use the list of background traffic classes (apps_selection_background_unknown
) as unknown, and the rest as known.
class-attribute
instance-attribute
FIXED = 'fixed'\n
Manual application selection. Provide lists of known applications (apps_selection_fixed_known
) and unknown applications (apps_selection_fixed_unknown
).
Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with min_train_samples_per_app
, and its default value is 100. With the DISABLE_APPS
approach, these applications will be disabled and not used for training or testing. With the WARN_AND_EXIT
approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set min_train_samples_per_app
to 0.
class-attribute
instance-attribute
WARN_AND_EXIT = 'warn-and-exit'\n
Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to disabled_apps
.
class-attribute
instance-attribute
DISABLE_APPS = 'disable-apps'\n
Disable applications with not enough training samples.
"},{"location":"reference_dataset_config/#config.DataLoaderOrder","title":"config.DataLoaderOrder","text":"Validation and test sets are always loaded in sequential order \u2014 sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use RANDOM
if your classification model requires it; SEQUENTIAL
otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.
class-attribute
instance-attribute
RANDOM = 'random'\n
Iterate train data in random order.
SEQUENTIALclass-attribute
instance-attribute
SEQUENTIAL = 'sequential'\n
Iterate train data in sequential (datetime) order.
"},{"location":"reference_datasets/","title":"Dataset classes","text":"These are subclasses of CesnetDataset
representing individual datasets available in cesnet-datazoo
.
Bases: CesnetDataset
Dataset class for CESNET-TLS22.
Source code incesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS22(CesnetDataset):\n \"\"\"Dataset class for [CESNET-TLS22][cesnet-tls22].\"\"\"\n name = \"CESNET-TLS22\"\n database_filename = \"CESNET-TLS22.h5\"\n bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls22\"\n available_dates = _CESNET_TLS22_AVAILABLE_DATES\n time_periods = {\n \"W-2021-40\": [\"20211004\", \"20211005\", \"20211006\", \"20211007\", \"20211008\", \"20211009\", \"20211010\"],\n \"W-2021-41\": [\"20211011\", \"20211012\", \"20211013\", \"20211014\", \"20211015\", \"20211016\", \"20211017\"],\n }\n default_train_period_name = \"W-2021-40\"\n default_test_period_name = \"W-2021-41\"\n _tables_app_enum = _CESNET_TLS22_TABLES_APP_ENUM\n _tables_cat_enum = _CESNET_TLS22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_QUIC22","title":"datasets.datasets.CESNET_QUIC22","text":" Bases: CesnetDataset
Dataset class for CESNET-QUIC22.
Source code incesnet_datazoo\\datasets\\datasets.py
class CESNET_QUIC22(CesnetDataset):\n \"\"\"Dataset class for [CESNET-QUIC22][cesnet-quic22].\"\"\"\n name = \"CESNET-QUIC22\"\n database_filename = \"CESNET-QUIC22.h5\"\n bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-quic22\"\n available_dates = _CESNET_QUIC22_AVAILABLE_DATES\n time_periods = {\n \"W-2022-44\": [\"20221031\", \"20221101\", \"20221102\", \"20221103\", \"20221104\", \"20221105\", \"20221106\"],\n \"W-2022-45\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\"],\n \"W-2022-46\": [\"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\"],\n \"W-2022-47\": [\"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n \"W45-47\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\",\n \"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\",\n \"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n }\n default_train_period_name = \"W-2022-44\"\n default_test_period_name = \"W-2022-45\"\n _tables_app_enum = _CESNET_QUIC22_TABLES_APP_ENUM\n _tables_cat_enum = _CESNET_QUIC22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS_Year22","title":"datasets.datasets.CESNET_TLS_Year22","text":" Bases: CesnetDataset
Dataset class for CESNET-TLS-Year22.
Source code incesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS_Year22(CesnetDataset):\n \"\"\"Dataset class for [CESNET-TLS-Year22][cesnet-tls-year22].\"\"\"\n name = \"CESNET-TLS-Year22\"\n database_filename = \"CESNET-TLS-Year22.h5\"\n bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls-year22\"\n available_dates = _CESNET_TLS_YEAR22_AVAILABLE_DATES\n time_periods = _CESNET_TLS_YEAR22_TIME_PERIODS\n default_train_period_name = \"M-2022-9\"\n default_test_period_name = \"M-2022-10\"\n _tables_app_enum = _CESNET_TLS_YEAR22_TABLES_APP_ENUM\n _tables_cat_enum = _CESNET_TLS_YEAR22_TABLES_CATEGORY_ENUM\n
"},{"location":"transforms/","title":"Transforms","text":"The cesnet_datazoo
package supports configurable transforms of input data in a similar fashion to what torchvision is doing for the computer vision field. Input features are split into three groups, each having its own transformation. Those groups are PPI sequences, flow statistics, and packet histograms.
ppi_transform
of DatasetConfig
is applied to PPI sequences.flowstats_transform
is applied to flow statistics (excluding boolean features, such as flow end reasons or TCP flags).flowstats_phist_transform
is applied to packet histograms.Transforms are implemented in a separate package CESNET Models. See cesnet_models.transforms
documentation for details.
Limitations
The current implementation does not support the composing of transformations.
"},{"location":"transforms/#available-transformations","title":"Available transformations","text":"PPI sequences
Flow statistics
Packet histograms
More transformations will be implemented in future versions.
"},{"location":"transforms/#data-scaling","title":"Data scaling","text":"Transformations implementing data scaling will be fitted, if needed, on a subset of training data during dataset initialization.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"CESNET DataZoo","text":"This is the documentation of the CESNET DataZoo project.
The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the cesnet-datazoo
package are:
cesnet_models.transforms
documentation for details.S
size containing 25 million samples. Apart from loading data into dataframes, the cesnet-datazoo
package provides dataloaders for processing data in smaller batches.
An example of how dataloaders can be used is in cesnet_datazoo.datasets.loaders
or in the following snippet:
def load_from_dataloader(dataloader: DataLoader):\n other_fields = []\n data_ppi = []\n data_flowstats = []\n labels = []\n for batch_other_fields, batch_ppi, batch_flowstats, batch_labels in dataloader:\n other_fields.append(batch_other_fields)\n data_ppi.append(batch_ppi)\n data_flowstats.append(batch_flowstats)\n labels.append(batch_labels)\n df_other_fields = pd.concat(other_fields, ignore_index=True)\n data_ppi = np.concatenate(data_ppi)\n data_flowstats = np.concatenate(data_flowstats)\n labels = np.concatenate(labels)\n return df_other_fields, data_ppi, data_flowstats, labels\n
When a dataloader is iterated, the returned data are in the format tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels)
. Batch size B is configured with batch_size
and test_batch_size
config options. The shapes are:
pd.DataFrame (B, C)
- a Pandas DataFrame with auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the return_other_fields
config option is false, this will be an empty DataFrame. Columns C depend on the used dataset and are available at dataset_config.other_fields
.np.ndarray (B, [3, 4], 30)
- the middle dimension is either 4 when TCP push flags are used (use_push_flags
) or 3 otherwise.np.ndarray (B, F)
- where F is the number of flowstats features computed with DatasetConfig.get_flowstats_features_len. To get the order and names of flowstats features, call DatasetConfig.get_flowstats_feature_names_expanded. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the data features page for more information about features.np.ndarray (B)
- integer labels encoded with a LabelEncoder
instance available at dataset.class_info.encoder
.PPI and flow statistics features returned from dataloaders are transformed depending on the selected configuration. See the transforms page for more information.
"},{"location":"dataset_metadata/","title":"DatasetMetadata","text":"Each dataset class has its metadata available as a DatasetMetadata
instance in the metadata
attribute.
CESNET-TLS22
This dataset was published in \"Fine-grained TLS services classification with reject option\" (DOI, arXiv). It was built from live traffic collected using high-speed monitoring probes at the perimeter of the CESNET2 network.
For detailed information about the dataset, see the linked paper and the dataset metadata page.
"},{"location":"datasets_overview/#cesnet-quic22","title":"CESNET-QUIC22","text":"CESNET-QUIC22
This dataset was published in \"CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines\" (DOI). The QUIC protocol has the potential to replace TLS over TLS as the standard protocol for reliable and secure Internet communication. Due to its design that makes the inspection of connection handshakes challenging and its usage in HTTP/3, there is an increasing demand for QUIC traffic classification methods.
For detailed information about the dataset, see the linked paper and the dataset metadata page. Experiments based on this dataset were published in \"Encrypted traffic classification: the QUIC case\" (DOI).
"},{"location":"datasets_overview/#cesnet-tls-year22","title":"CESNET-TLS-Year22","text":"CESNET-TLS-Year22
This dataset is similar to CESNET-TLS22; however, it spans the entire year 2022. It will be published in the near future.
"},{"location":"features/","title":"Features","text":"This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.
"},{"location":"features/#ppi-sequence","title":"PPI sequence","text":"A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set use_push_flags
for using PUSH flags in PPI sequences, if available in the used dataset.
Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.
Name Description DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons"},{"location":"features/#packet-histograms","title":"Packet histograms","text":"Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0\u201315, 16\u201331, 32\u201363, 64\u2013127, 128\u2013255, 256\u2013511, 512\u20131024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set use_packet_histograms
for using packet histograms features, if available in the dataset.
On the dataset metadata page, packet histogram features are called PHIST_SRC_SIZES
, PHIST_DST_SIZES
, PHIST_SRC_IPT
, PHIST_DST_IPT
. Those are the names of database columns that are flattened to the _BIN{x} features.
Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set use_tcp_features
for using a subset of flags defined in cesnet_datazoo.constants.SELECTED_TCP_FLAGS
.
Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets. Set return_other_fields
to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.
Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.
TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV Zero-length packets(without L4 payload, e.g. ACKs) Not included Not included Included Retransmissions(and out-of-order packets) Included Not included* Included Computed from Entire flow First 30 packets Entire flow*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.
For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.
"},{"location":"getting_started/","title":"Getting started","text":""},{"location":"getting_started/#jupyter-notebooks","title":"Jupyter notebooks","text":"Example Jupyter notebooks are provided at https://github.com/CESNET/cesnet-tcexamples. Start with:
from cesnet_datazoo.datasets import CESNET_QUIC22\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset.compute_dataset_statistics(num_samples=100_000, num_workers=0)\n
This will download the dataset, compute dataset statistics, and save them into /datasets/CESNET-QUIC22/statistics
."},{"location":"getting_started/#enable-logging-and-set-the-spawn-method-on-windows","title":"Enable logging and set the spawn method on Windows","text":"import logging\nimport multiprocessing as mp\n\nmp.set_start_method(\"spawn\") \nlogging.basicConfig(\n level=logging.INFO,\n format=\"[%(asctime)s][%(name)s][%(levelname)s] - %(message)s\")\n
For running on Windows, we recommend using the spawn
method for creating dataloader worker processes. Set up logging to get more information from the package."},{"location":"getting_started/#initialize-dataset-to-create-train-validation-and-test-dataframes","title":"Initialize dataset to create train, validation, and test dataframes","text":"from cesnet_datazoo.datasets import CESNET_QUIC22\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\n\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset_config = DatasetConfig(\n dataset=dataset,\n apps_selection=AppSelection.ALL_KNOWN,\n train_period_name=\"W-2022-44\",\n test_period_name=\"W-2022-45\",\n)\ndataset.set_dataset_config_and_initialize(dataset_config)\ntrain_dataframe = dataset.get_train_df()\nval_dataframe = dataset.get_val_df()\ntest_dataframe = dataset.get_test_df()\n
The DatasetConfig
class handles the configuration of datasets, and calling set_dataset_config_and_initialize
initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See CesnetDataset
reference.
Install the package from pip with:
pip install cesnet-datazoo\n
or for editable install with:
pip install -e git+https://github.com/CESNET/cesnet-datazoo\n
"},{"location":"installation/#requirements","title":"Requirements","text":"The cesnet-datazoo
package requires Python >=3.10.
The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:
The dataset is stored in a PyTables database. The internal PyTablesDataset
class is used as a wrapper that implements the PyTorch Dataset
interface and is compatible with DataLoader
, which provides efficient parallel loading of the data. The dataset configuration is done through the DatasetConfig
class.
Intended usage:
DatasetConfig
and set it with set_dataset_config_and_initialize
. This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.get_train_dataloader
or get_train_df
to get training data for a classification model.get_val_dataloader
or get_val_df
.get_test_dataloader
or get_test_df
.Parameters:
Name Type Description Defaultdata_root
str
Path to the folder where the dataset will be stored. Each dataset size has its own subfolder data_root/size
size
str
Size of the dataset. Options are XS
, S
, M
, L
, ORIG
.
'S'
silent
bool
Whether to suppress print and tqdm output.
False
Attributes:
Name Type Descriptionname
str
Name of the dataset.
database_filename
str
Name of the database file.
database_path
str
Path to the database file.
servicemap_path
str
Path to the servicemap file.
statistics_path
str
Path to the dataset statistics folder.
bucket_url
str
URL of the bucket where the database is stored.
metadata
DatasetMetadata
Additional dataset metadata.
available_classes
list[str]
List of all available classes in the dataset.
available_dates
list[str]
List of all available dates in the dataset.
time_periods
dict[str, list[str]]
Predefined time periods. Each time period is a list of dates.
default_train_period_name
str
Default time period for training.
default_test_period_name
str
Default time period for testing.
The following attributes are initialized when set_dataset_config_and_initialize
is called.
Attributes:
Name Type Descriptiondataset_config
Optional[DatasetConfig]
Configuration of the dataset.
class_info
Optional[ClassInfo]
Structured information about the classes.
dataset_indices
Optional[IndicesTuple]
Named tuple containing train_indices
, val_known_indices
, val_unknown_indices
, test_known_indices
, test_unknown_indices
. These are the indices into PyTables database that define train, validation, and test sets.
train_dataset
Optional[PyTablesDataset]
Train set in the form of PyTablesDataset
instance wrapping the PyTables database.
val_dataset
Optional[PyTablesDataset]
Validation set in the form of PyTablesDataset
instance wrapping the PyTables database.
test_dataset
Optional[PyTablesDataset]
Test set in the form of PyTablesDataset
instance wrapping the PyTables database.
known_app_counts
Optional[DataFrame]
Known application counts in the train, validation, and test sets.
unknown_app_counts
Optional[DataFrame]
Unknown application counts in the validation and test sets.
train_dataloader
Optional[DataLoader]
Iterable PyTorch DataLoader
for training.
train_dataloader_sampler
Optional[Sampler]
Sampler used for iterating the training dataloader. Either RandomSampler
or SequentialSampler
.
train_dataloader_drop_last
bool
Whether to drop the last incomplete batch when iterating the training dataloader.
val_dataloader
Optional[DataLoader]
Iterable PyTorch DataLoader
for validation.
test_dataloader
Optional[DataLoader]
Iterable PyTorch DataLoader
for testing.
cesnet_datazoo\\datasets\\cesnet_dataset.py
class CesnetDataset():\n \"\"\"\n The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:\n\n - Iterable PyTorch DataLoader for batch processing. See [using dataloaders][using-dataloaders] for more details.\n - Pandas DataFrame for loading the entire train, validation, or test set at once.\n\n The dataset is stored in a [PyTables](https://www.pytables.org/) database. The internal `PyTablesDataset` class is used as a wrapper\n that implements the PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) interface\n and is compatible with [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),\n which provides efficient parallel loading of the data. The dataset configuration is done through the [`DatasetConfig`][config.DatasetConfig] class.\n\n **Intended usage:**\n\n 1. Create an instance of the [dataset class][dataset-classes] with the desired size and data root. This will download the dataset if it has not already been downloaded.\n 2. Create an instance of [`DatasetConfig`][config.DatasetConfig] and set it with [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize].\n This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.\n 3. Use [`get_train_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_train_dataloader] or [`get_train_df`][datasets.cesnet_dataset.CesnetDataset.get_train_df] to get training data for a classification model.\n 4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_val_dataloader] or [`get_val_df`][datasets.cesnet_dataset.CesnetDataset.get_val_df].\n 5. Evaluate the model on [`get_test_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_test_dataloader] or [`get_test_df`][datasets.cesnet_dataset.CesnetDataset.get_test_df].\n\n Parameters:\n data_root: Path to the folder where the dataset will be stored. Each dataset size has its own subfolder `data_root/size`\n size: Size of the dataset. Options are `XS`, `S`, `M`, `L`, `ORIG`.\n silent: Whether to suppress print and tqdm output.\n\n Attributes:\n name: Name of the dataset.\n database_filename: Name of the database file.\n database_path: Path to the database file.\n servicemap_path: Path to the servicemap file.\n statistics_path: Path to the dataset statistics folder.\n bucket_url: URL of the bucket where the database is stored.\n metadata: Additional [dataset metadata][metadata].\n available_classes: List of all available classes in the dataset.\n available_dates: List of all available dates in the dataset.\n time_periods: Predefined time periods. Each time period is a list of dates.\n default_train_period_name: Default time period for training.\n default_test_period_name: Default time period for testing.\n\n The following attributes are initialized when [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize] is called.\n\n Attributes:\n dataset_config: Configuration of the dataset.\n class_info: Structured information about the classes.\n dataset_indices: Named tuple containing `train_indices`, `val_known_indices`, `val_unknown_indices`, `test_known_indices`, `test_unknown_indices`. These are the indices into PyTables database that define train, validation, and test sets.\n train_dataset: Train set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n val_dataset: Validation set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n test_dataset: Test set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n known_app_counts: Known application counts in the train, validation, and test sets.\n unknown_app_counts: Unknown application counts in the validation and test sets.\n train_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training.\n train_dataloader_sampler: Sampler used for iterating the training dataloader. Either [`RandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) or [`SequentialSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler).\n train_dataloader_drop_last: Whether to drop the last incomplete batch when iterating the training dataloader.\n val_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n test_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n \"\"\"\n data_root: str\n size: str\n silent: bool = False\n\n name: str\n database_filename: str\n database_path: str\n servicemap_path: str\n statistics_path: str\n bucket_url: str\n metadata: DatasetMetadata\n available_classes: list[str]\n available_dates: list[str]\n time_periods: dict[str, list[str]]\n default_train_period_name: str\n default_test_period_name: str\n\n dataset_config: Optional[DatasetConfig] = None\n class_info: Optional[ClassInfo] = None\n dataset_indices: Optional[IndicesTuple] = None\n train_dataset: Optional[PyTablesDataset] = None\n val_dataset: Optional[PyTablesDataset] = None\n test_dataset: Optional[PyTablesDataset] = None\n known_app_counts: Optional[pd.DataFrame] = None\n unknown_app_counts: Optional[pd.DataFrame] = None\n train_dataloader: Optional[DataLoader] = None\n train_dataloader_sampler: Optional[Sampler] = None\n train_dataloader_drop_last: bool = True\n val_dataloader: Optional[DataLoader] = None\n test_dataloader: Optional[DataLoader] = None\n\n _collate_fn: Optional[Callable] = None\n _tables_app_enum: dict[int, str]\n _tables_cat_enum: dict[int, str]\n\n def __init__(self, data_root: str, size: str = \"S\", database_checks_at_init: bool = False, silent: bool = False) -> None:\n self.silent = silent\n self.metadata = load_metadata(self.name)\n self.size = size\n if self.size != \"ORIG\":\n if size not in self.metadata.available_dataset_sizes:\n raise ValueError(f\"Unknown dataset size {self.size}\")\n self.name = f\"{self.name}-{self.size}\"\n filename, ext = os.path.splitext(self.database_filename)\n self.database_filename = f\"{filename}-{self.size}{ext}\"\n self.data_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, self.size)))\n self.database_path = os.path.join(self.data_root, self.database_filename)\n self.servicemap_path = os.path.join(self.data_root, SERVICEMAP_FILE)\n self.statistics_path = os.path.join(self.data_root, \"statistics\")\n if not os.path.exists(self.data_root):\n os.makedirs(self.data_root)\n if not self._is_downloaded():\n self._download()\n if database_checks_at_init:\n with tb.open_file(self.database_path, mode=\"r\") as database:\n tables_paths = list(map(lambda x: x._v_pathname, iter(database.get_node(f\"/flows\"))))\n num_samples = 0\n for p in tables_paths:\n table = database.get_node(p)\n assert isinstance(table, tb.Table)\n if self._tables_app_enum != {v: k for k, v in dict(table.get_enum(APP_COLUMN)).items()}:\n raise ValueError(f\"Found mismatch between _tables_app_enum and the PyTables database enum in table {p}. Please report this issue.\")\n if self._tables_cat_enum != {v: k for k, v in dict(table.get_enum(CATEGORY_COLUMN)).items()}:\n raise ValueError(f\"Found mismatch between _tables_cat_enum and the PyTables database enum in table {p}. Please report this issue.\")\n num_samples += len(table)\n if self.size == \"ORIG\" and num_samples != self.metadata.available_samples:\n raise ValueError(f\"Expected {self.metadata.available_samples} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n if self.size != \"ORIG\" and num_samples != DATASET_SIZES[self.size]:\n raise ValueError(f\"Expected {DATASET_SIZES[self.size]} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n if self.available_dates != list(map(lambda x: x.removeprefix(\"/flows/D\"), tables_paths)):\n raise ValueError(f\"Found mismatch between available_dates and the dates available in the PyTables database. Please report this issue.\")\n # Add all available dates as single date time periods\n for d in self.available_dates:\n self.time_periods[d] = [d]\n available_applications = sorted([app for app in pd.read_csv(self.servicemap_path, index_col=\"Tag\").index if not is_background_app(app)])\n if len(available_applications) != self.metadata.application_count:\n raise ValueError(f\"Found {len(available_applications)} applications in the servicemap (omitting background traffic classes), but expected {self.metadata.application_count}. Please report this issue.\")\n self.available_classes = available_applications + self.metadata.background_traffic_classes\n\n def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n \"\"\"\n Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n Parameters:\n dataset_config: Desired configuration of the dataset.\n disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n \"\"\"\n self.dataset_config = dataset_config\n self._clear()\n self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n\n def get_train_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n When the dataloader is iterated in random order, the last incomplete batch is dropped.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ---------------------------- | ------------------------------------------------------------------------------------------ |\n | `batch_size` | Number of samples per batch. |\n | `train_workers` | Number of workers for loading train data. |\n | `train_dataloader_order` | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][]. |\n | `train_dataloader_seed` | Seed for loading train data in random order. |\n\n Returns:\n Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n if not self.dataset_config.need_train_set:\n raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n assert self.train_dataset\n if self.train_dataloader:\n return self.train_dataloader\n # Create sampler according to the selected order\n if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n if self.dataset_config.train_dataloader_seed is not None:\n generator = torch.Generator()\n generator.manual_seed(self.dataset_config.train_dataloader_seed)\n else:\n generator = None\n self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n self.train_dataloader_drop_last = True\n elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n self.train_dataloader_drop_last = False\n else: assert_never(self.dataset_config.train_dataloader_order)\n # Create dataloader\n batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n train_dataloader = DataLoader(\n self.train_dataset,\n num_workers=self.dataset_config.train_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.train_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.train_workers == 0:\n self.train_dataset.pytables_worker_init()\n self.train_dataloader = train_dataloader\n return train_dataloader\n\n def get_val_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n The dataloader is created on the first call and then cached.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `val_workers` | Number of workers for loading validation data. |\n\n Returns:\n Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n if not self.dataset_config.need_val_set:\n raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n assert self.val_dataset is not None\n if self.val_dataloader:\n return self.val_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n val_dataloader = DataLoader(\n self.val_dataset,\n num_workers=self.dataset_config.val_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.val_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.val_workers == 0:\n self.val_dataset.pytables_worker_init()\n self.val_dataloader = val_dataloader\n return val_dataloader\n\n def get_test_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n The dataloader is created on the first call and then cached.\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `test_workers` | Number of workers for loading test data. |\n\n Returns:\n Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n if not self.dataset_config.need_test_set:\n raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n assert self.test_dataset is not None\n if self.test_dataloader:\n return self.test_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n test_dataloader = DataLoader(\n self.test_dataset,\n num_workers=self.dataset_config.test_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=False,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.test_workers == 0:\n self.test_dataset.pytables_worker_init()\n self.test_dataloader = test_dataloader\n return test_dataloader\n\n def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n train_dataloader = self.get_train_dataloader()\n val_dataloader = self.get_val_dataloader()\n test_dataloader = self.get_test_dataloader()\n return train_dataloader, val_dataloader, test_dataloader\n\n def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n !!! warning \"Memory usage\"\n\n The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Train data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_train=True)\n assert self.dataset_config is not None and self.train_dataset is not None\n if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n train_dataloader = self.get_train_dataloader()\n assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n # Read dataloader in sequential order\n train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n train_dataloader.sampler.drop_last = False\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n df = create_df_from_dataloader(dataloader=train_dataloader,\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n # Restore the original dataloader sampler and drop_last\n train_dataloader.sampler.sampler = self.train_dataloader_sampler\n train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n return df\n\n def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n !!! warning \"Memory usage\"\n\n The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Validation data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_val=True)\n assert self.dataset_config is not None and self.val_dataset is not None\n if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n\n def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n !!! warning \"Memory usage\"\n\n The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Test data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_test=True)\n assert self.dataset_config is not None and self.test_dataset is not None\n if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n\n def get_num_classes(self) -> int:\n \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n return self.class_info.num_classes\n\n def get_known_apps(self) -> list[str]:\n \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n return self.class_info.known_apps\n\n def get_unknown_apps(self) -> list[str]:\n \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n return self.class_info.unknown_apps\n\n def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n \"\"\"\n Computes dataset statistics and saves them to the `statistics_path` folder.\n\n Parameters:\n num_samples: Number of samples to use for computing the statistics.\n num_workers: Number of workers for loading data.\n batch_size: Number of samples per batch for loading data.\n disabled_apps: List of applications to exclude from the statistics.\n \"\"\"\n if disabled_apps:\n bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n if len(bad_disabled_apps) > 0:\n raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n if not os.path.exists(self.statistics_path):\n os.mkdir(self.statistics_path)\n compute_dataset_statistics(database_path=self.database_path,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n output_dir=self.statistics_path,\n packet_histograms=self.metadata.packet_histograms,\n flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n protocol=self.metadata.protocol,\n extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n disabled_apps=disabled_apps if disabled_apps is not None else [],\n num_samples=num_samples,\n num_workers=num_workers,\n batch_size=batch_size,\n silent=self.silent)\n\n def _generate_time_periods(self) -> None:\n time_periods = {}\n for period in self.time_periods:\n time_periods[period] = []\n if period.startswith(\"W\"):\n split = period.split(\"-\")\n collection_year, week = int(split[1]), int(split[2])\n for d in range(1, 8):\n s = datetime.date.fromisocalendar(collection_year, week, d).strftime(\"%Y%m%d\")\n # last week of a year can span into the following year\n if s not in self.metadata.missing_dates_in_collection_period and s.startswith(str(collection_year)):\n time_periods[period].append(s)\n elif period.startswith(\"M\"):\n split = period.split(\"-\")\n collection_year, month = int(split[1]), int(split[2])\n for d in range(1, calendar.monthrange(collection_year, month)[1]):\n s = datetime.date(collection_year, month, d).strftime(\"%Y%m%d\")\n if s not in self.metadata.missing_dates_in_collection_period:\n time_periods[period].append(s)\n self.time_periods = time_periods\n\n def _is_downloaded(self) -> bool:\n \"\"\"Servicemap is downloaded after the database; thus if it exists, the database is also downloaded\"\"\"\n return os.path.exists(self.servicemap_path) and os.path.exists(self.database_path)\n\n def _download(self) -> None:\n if not self.silent:\n print(f\"Downloading {self.name} dataset\")\n database_url = f\"{self.bucket_url}&file={self.database_filename}\"\n servicemap_url = f\"{self.bucket_url}&file={SERVICEMAP_FILE}\"\n resumable_download(url=database_url, file_path=self.database_path, silent=self.silent)\n simple_download(url=servicemap_url, file_path=self.servicemap_path)\n\n def _clear(self) -> None:\n self.class_info = None\n self.dataset_indices = None\n self.train_dataset = None\n self.val_dataset = None\n self.test_dataset = None\n self.known_app_counts = None\n self.unknown_app_counts = None\n self.train_dataloader = None\n self.train_dataloader_sampler = None\n self.train_dataloader_drop_last = True\n self.val_dataloader = None\n self.test_dataloader = None\n self._collate_fn = None\n\n def _check_before_dataframe(self, check_train: bool = False, check_val: bool = False, check_test: bool = False) -> None:\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting a dataframe\")\n if self.dataset_config.return_tensors:\n raise ValueError(\"Dataframes are not available when return_tensors is set. Use a dataloader instead.\")\n if check_train and not self.dataset_config.need_train_set:\n raise ValueError(\"Train dataframe is not available when need_train_set is false\")\n if check_val and not self.dataset_config.need_val_set:\n raise ValueError(\"Validation dataframe is not available when need_val_set is false\")\n if check_test and not self.dataset_config.need_test_set:\n raise ValueError(\"Test dataframe is not available when need_test_set is false\")\n\n def _initialize_train_val_test(self, disable_indices_cache: bool = False) -> None:\n assert self.dataset_config is not None\n dataset_config = self.dataset_config\n servicemap = pd.read_csv(dataset_config.servicemap_path, index_col=\"Tag\")\n # Initialize train set\n if dataset_config.need_train_set:\n train_indices, train_unknown_indices, known_apps, unknown_apps = init_or_load_train_indices(dataset_config=dataset_config,\n tables_app_enum=self._tables_app_enum,\n servicemap=servicemap,\n disable_indices_cache=disable_indices_cache,)\n # Date weight sampling of train indices\n if dataset_config.train_dates_weigths is not None:\n assert dataset_config.train_size != \"all\"\n if dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n # requested number of samples is train_size + val_known_size when using the split-from-train validation approach\n assert dataset_config.val_known_size != \"all\"\n num_samples = dataset_config.train_size + dataset_config.val_known_size\n else:\n num_samples = dataset_config.train_size\n if num_samples > len(train_indices):\n raise ValueError(f\"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})\")\n train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)\n elif dataset_config.apps_selection == AppSelection.FIXED:\n known_apps = dataset_config.apps_selection_fixed_known\n unknown_apps = dataset_config.apps_selection_fixed_unknown\n train_indices = np.zeros((0,3), dtype=np.int64)\n train_unknown_indices = np.zeros((0,3), dtype=np.int64)\n else:\n raise ValueError(\"Either need train set or the fixed application selection\")\n # Initialize validation set\n if dataset_config.need_val_set:\n if dataset_config.val_approach == ValidationApproach.VALIDATION_DATES:\n val_known_indices, val_unknown_indices, val_data_path = init_or_load_val_indices(dataset_config=dataset_config,\n known_apps=known_apps,\n unknown_apps=unknown_apps,\n tables_app_enum=self._tables_app_enum,\n disable_indices_cache=disable_indices_cache,)\n elif dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n train_val_rng = get_fresh_random_generator(dataset_config=dataset_config, section=RandomizedSection.TRAIN_VAL_SPLIT)\n val_data_path = dataset_config._get_train_data_path()\n val_unknown_indices = train_unknown_indices\n train_labels = train_indices[:, INDICES_LABEL_POS]\n if dataset_config.train_dates_weigths is not None:\n assert dataset_config.val_known_size != \"all\"\n # When weight sampling is used, val_known_size is kept but the resulting train size can be smaller due to no enough samples in some train dates\n if dataset_config.val_known_size > len(train_indices):\n raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples after weight sampling ({len(train_indices)})\")\n train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.val_known_size, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n dataset_config.train_size = len(train_indices)\n elif dataset_config.train_size == \"all\" and dataset_config.val_known_size == \"all\":\n train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.train_val_split_fraction, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n else:\n if dataset_config.val_known_size != \"all\" and dataset_config.train_size != \"all\" and dataset_config.train_size + dataset_config.val_known_size > len(train_indices):\n raise ValueError(f\"Requested train size + validation size ({dataset_config.train_size + dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n if dataset_config.train_size != \"all\" and dataset_config.train_size > len(train_indices):\n raise ValueError(f\"Requested train size ({dataset_config.train_size}) is larger than the number of available train samples ({len(train_indices)})\")\n if dataset_config.val_known_size != \"all\" and dataset_config.val_known_size > len(train_indices):\n raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n train_indices, val_known_indices = train_test_split(train_indices,\n train_size=dataset_config.train_size if dataset_config.train_size != \"all\" else None,\n test_size=dataset_config.val_known_size if dataset_config.val_known_size != \"all\" else None,\n stratify=train_labels, shuffle=True, random_state=train_val_rng)\n else:\n val_known_indices = np.zeros((0,3), dtype=np.int64)\n val_unknown_indices = np.zeros((0,3), dtype=np.int64)\n val_data_path = None\n # Initialize test set\n if dataset_config.need_test_set:\n test_known_indices, test_unknown_indices, test_data_path = init_or_load_test_indices(dataset_config=dataset_config,\n known_apps=known_apps,\n unknown_apps=unknown_apps,\n tables_app_enum=self._tables_app_enum,\n disable_indices_cache=disable_indices_cache,)\n else:\n test_known_indices = np.zeros((0,3), dtype=np.int64)\n test_unknown_indices = np.zeros((0,3), dtype=np.int64)\n test_data_path = None\n # Fit scalers if needed\n if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or\n dataset_config.flowstats_transform is not None and dataset_config.flowstats_transform.needs_fitting):\n if not dataset_config.need_train_set:\n raise ValueError(\"Train set is needed to fit the scalers. Provide pre-fitted scalers.\")\n fit_scalers(dataset_config=dataset_config, train_indices=train_indices)\n # Subset dataset indices based on the selected sizes and compute application counts\n dataset_indices = IndicesTuple(train_indices=train_indices, val_known_indices=val_known_indices, val_unknown_indices=val_unknown_indices, test_known_indices=test_known_indices, test_unknown_indices=test_unknown_indices)\n dataset_indices = subset_and_sort_indices(dataset_config=dataset_config, dataset_indices=dataset_indices)\n known_app_counts = compute_known_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n unknown_app_counts = compute_unknown_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n # Combine known and unknown test indicies to create a single dataloader\n assert isinstance(dataset_config.test_unknown_size, int)\n if dataset_config.test_unknown_size > 0 and len(unknown_apps) > 0:\n test_combined_indices = np.concatenate((dataset_indices.test_known_indices, dataset_indices.test_unknown_indices))\n else:\n test_combined_indices = dataset_indices.test_known_indices\n # Create encoder the class info structure\n encoder = LabelEncoder().fit(known_apps)\n encoder.classes_ = np.append(encoder.classes_, UNKNOWN_STR_LABEL)\n class_info = create_class_info(servicemap=servicemap, encoder=encoder, known_apps=known_apps, unknown_apps=unknown_apps)\n encode_labels_with_unknown_fn = partial(_encode_labels_with_unknown, encoder=encoder, class_info=class_info)\n # Create train, validation, and test datasets\n train_dataset = val_dataset = test_dataset = None\n if dataset_config.need_train_set:\n train_dataset = PyTablesDataset(\n database_path=dataset_config.database_path,\n tables_paths=dataset_config._get_train_tables_paths(),\n indices=dataset_indices.train_indices,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n flowstats_features=dataset_config.flowstats_features,\n flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n flowstats_features_phist=dataset_config.flowstats_features_phist,\n other_fields=self.dataset_config.other_fields,\n ppi_channels=dataset_config.get_ppi_channels(),\n ppi_transform=dataset_config.ppi_transform,\n flowstats_transform=dataset_config.flowstats_transform,\n flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n target_transform=encode_labels_with_unknown_fn,\n return_tensors=dataset_config.return_tensors,)\n if dataset_config.need_val_set:\n assert val_data_path is not None\n val_dataset = PyTablesDataset(\n database_path=dataset_config.database_path,\n tables_paths=dataset_config._get_train_tables_paths(),\n indices=dataset_indices.val_known_indices,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n flowstats_features=dataset_config.flowstats_features,\n flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n flowstats_features_phist=dataset_config.flowstats_features_phist,\n other_fields=self.dataset_config.other_fields,\n ppi_channels=dataset_config.get_ppi_channels(),\n ppi_transform=dataset_config.ppi_transform,\n flowstats_transform=dataset_config.flowstats_transform,\n flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n target_transform=encode_labels_with_unknown_fn,\n return_tensors=dataset_config.return_tensors,\n preload=dataset_config.preload_val,\n preload_blob=os.path.join(val_data_path, \"preload\", f\"val_dataset-{dataset_config.val_known_size}.npz\"),)\n if dataset_config.need_test_set:\n assert test_data_path is not None\n test_dataset = PyTablesDataset(\n database_path=dataset_config.database_path,\n tables_paths=dataset_config._get_test_tables_paths(),\n indices=test_combined_indices,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n flowstats_features=dataset_config.flowstats_features,\n flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n flowstats_features_phist=dataset_config.flowstats_features_phist,\n other_fields=self.dataset_config.other_fields,\n ppi_channels=dataset_config.get_ppi_channels(),\n ppi_transform=dataset_config.ppi_transform,\n flowstats_transform=dataset_config.flowstats_transform,\n flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n target_transform=encode_labels_with_unknown_fn,\n return_tensors=dataset_config.return_tensors,\n preload=dataset_config.preload_test,\n preload_blob=os.path.join(test_data_path, \"preload\", f\"test_dataset-{dataset_config.test_known_size}-{dataset_config.test_unknown_size}.npz\"),)\n self.class_info = class_info\n self.dataset_indices = dataset_indices\n self.train_dataset = train_dataset\n self.val_dataset = val_dataset\n self.test_dataset = test_dataset\n self.known_app_counts = known_app_counts\n self.unknown_app_counts = unknown_app_counts\n self._collate_fn = collate_fn_simple\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize","title":"set_dataset_config_and_initialize","text":"set_dataset_config_and_initialize(\n dataset_config: DatasetConfig,\n disable_indices_cache: bool = False,\n) -> None\n
Initialize train, validation, and test sets. Data cannot be accessed before calling this method.
Parameters:
Name Type Description Defaultdataset_config
DatasetConfig
Desired configuration of the dataset.
requireddisable_indices_cache
bool
Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.
False
Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n \"\"\"\n Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n Parameters:\n dataset_config: Desired configuration of the dataset.\n disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n \"\"\"\n self.dataset_config = dataset_config\n self._clear()\n self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_dataloader","title":"get_train_dataloader","text":"get_train_dataloader() -> DataLoader\n
Provides a PyTorch DataLoader
for training. The dataloader is created on the first call and then cached. When the dataloader is iterated in random order, the last incomplete batch is dropped. The dataloader is configured with the following config attributes:
batch_size
Number of samples per batch. train_workers
Number of workers for loading train data. train_dataloader_order
Whether to load train data in sequential or random order. See config.DataLoaderOrder. train_dataloader_seed
Seed for loading train data in random order. Returns:
Type DescriptionDataLoader
Train data as an iterable dataloader. See using dataloaders for more details.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n When the dataloader is iterated in random order, the last incomplete batch is dropped.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ---------------------------- | ------------------------------------------------------------------------------------------ |\n | `batch_size` | Number of samples per batch. |\n | `train_workers` | Number of workers for loading train data. |\n | `train_dataloader_order` | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][]. |\n | `train_dataloader_seed` | Seed for loading train data in random order. |\n\n Returns:\n Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n if not self.dataset_config.need_train_set:\n raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n assert self.train_dataset\n if self.train_dataloader:\n return self.train_dataloader\n # Create sampler according to the selected order\n if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n if self.dataset_config.train_dataloader_seed is not None:\n generator = torch.Generator()\n generator.manual_seed(self.dataset_config.train_dataloader_seed)\n else:\n generator = None\n self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n self.train_dataloader_drop_last = True\n elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n self.train_dataloader_drop_last = False\n else: assert_never(self.dataset_config.train_dataloader_order)\n # Create dataloader\n batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n train_dataloader = DataLoader(\n self.train_dataset,\n num_workers=self.dataset_config.train_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.train_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.train_workers == 0:\n self.train_dataset.pytables_worker_init()\n self.train_dataloader = train_dataloader\n return train_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_dataloader","title":"get_val_dataloader","text":"get_val_dataloader() -> DataLoader\n
Provides a PyTorch DataLoader
for validation. The dataloader is created on the first call and then cached. The dataloader is configured with the following config attributes:
test_batch_size
Number of samples per batch for loading validation and test data. val_workers
Number of workers for loading validation data. Returns:
Type DescriptionDataLoader
Validation data as an iterable dataloader. See using dataloaders for more details.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n The dataloader is created on the first call and then cached.\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `val_workers` | Number of workers for loading validation data. |\n\n Returns:\n Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n if not self.dataset_config.need_val_set:\n raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n assert self.val_dataset is not None\n if self.val_dataloader:\n return self.val_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n val_dataloader = DataLoader(\n self.val_dataset,\n num_workers=self.dataset_config.val_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=self.dataset_config.val_workers > 0,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.val_workers == 0:\n self.val_dataset.pytables_worker_init()\n self.val_dataloader = val_dataloader\n return val_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_dataloader","title":"get_test_dataloader","text":"get_test_dataloader() -> DataLoader\n
Provides a PyTorch DataLoader
for testing. The dataloader is created on the first call and then cached.
When the dataset is used in the open-world setting, and unknown classes are defined, the test dataloader returns test_known_size
samples of known classes followed by test_unknown_size
samples of unknown classes.
The dataloader is configured with the following config attributes:
Dataset config Descriptiontest_batch_size
Number of samples per batch for loading validation and test data. test_workers
Number of workers for loading test data. Returns:
Type DescriptionDataLoader
Test data as an iterable dataloader. See using dataloaders for more details.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_dataloader(self) -> DataLoader:\n \"\"\"\n Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n The dataloader is created on the first call and then cached.\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n The dataloader is configured with the following config attributes:\n\n | Dataset config | Description |\n | ------------------| ------------------------------------------------------------------|\n | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n | `test_workers` | Number of workers for loading test data. |\n\n Returns:\n Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n \"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n if not self.dataset_config.need_test_set:\n raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n assert self.test_dataset is not None\n if self.test_dataloader:\n return self.test_dataloader\n batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n test_dataloader = DataLoader(\n self.test_dataset,\n num_workers=self.dataset_config.test_workers,\n worker_init_fn=worker_init_fn,\n collate_fn=self._collate_fn,\n persistent_workers=False,\n batch_size=None,\n sampler=batch_sampler,)\n if self.dataset_config.test_workers == 0:\n self.test_dataset.pytables_worker_init()\n self.test_dataloader = test_dataloader\n return test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_dataloaders","title":"get_dataloaders","text":"get_dataloaders() -> (\n tuple[DataLoader, DataLoader, DataLoader]\n)\n
Gets train, validation, and test dataloaders in one call.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n if self.dataset_config is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n train_dataloader = self.get_train_dataloader()\n val_dataloader = self.get_val_dataloader()\n test_dataloader = self.get_test_dataloader()\n return train_dataloader, val_dataloader, test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_df","title":"get_train_df","text":"get_train_df(flatten_ppi: bool = False) -> pd.DataFrame\n
Creates a train Pandas DataFrame
. The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.
Memory usage
The whole train set is loaded into memory. If the dataset size is larger than 'S'
, consider using get_train_dataloader
instead.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten the PPI sequence into individual columns (named IPT_X
, DIR_X
, SIZE_X
, PUSH_X
, X being the index of the packet) or keep one PPI
column with 2D data.
False
Returns:
Type DescriptionDataFrame
Train data as a dataframe.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n !!! warning \"Memory usage\"\n\n The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Train data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_train=True)\n assert self.dataset_config is not None and self.train_dataset is not None\n if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n train_dataloader = self.get_train_dataloader()\n assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n # Read dataloader in sequential order\n train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n train_dataloader.sampler.drop_last = False\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n df = create_df_from_dataloader(dataloader=train_dataloader,\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n # Restore the original dataloader sampler and drop_last\n train_dataloader.sampler.sampler = self.train_dataloader_sampler\n train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n return df\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_df","title":"get_val_df","text":"get_val_df(flatten_ppi: bool = False) -> pd.DataFrame\n
Creates validation Pandas DataFrame
. The dataframe is in sequential (datetime) order.
Memory usage
The whole validation set is loaded into memory. If the dataset size is larger than 'S'
, consider using get_val_dataloader
instead.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten the PPI sequence into individual columns (named IPT_X
, DIR_X
, SIZE_X
, PUSH_X
, X being the index of the packet) or keep one PPI
column with 2D data.
False
Returns:
Type DescriptionDataFrame
Validation data as a dataframe.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n !!! warning \"Memory usage\"\n\n The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Validation data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_val=True)\n assert self.dataset_config is not None and self.val_dataset is not None\n if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_df","title":"get_test_df","text":"get_test_df(flatten_ppi: bool = False) -> pd.DataFrame\n
Creates test Pandas DataFrame
. The dataframe is in sequential (datetime) order.
When the dataset is used in the open-world setting, and unknown classes are defined, the returned test dataframe is composed of test_known_size
samples of known classes followed by test_unknown_size
samples of unknown classes.
Memory usage
The whole test set is loaded into memory. If the dataset size is larger than 'S'
, consider using get_test_dataloader
instead.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten the PPI sequence into individual columns (named IPT_X
, DIR_X
, SIZE_X
, PUSH_X
, X being the index of the packet) or keep one PPI
column with 2D data.
False
Returns:
Type DescriptionDataFrame
Test data as a dataframe.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n \"\"\"\n Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n When the dataset is used in the open-world setting, and unknown classes are defined,\n the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n !!! warning \"Memory usage\"\n\n The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n Parameters:\n flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n Returns:\n Test data as a dataframe.\n \"\"\"\n self._check_before_dataframe(check_test=True)\n assert self.dataset_config is not None and self.test_dataset is not None\n if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n feature_names=feature_names,\n flatten_ppi=flatten_ppi,\n silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_num_classes","title":"get_num_classes","text":"get_num_classes() -> int\n
Returns the number of classes in the current configuration of the dataset.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_num_classes(self) -> int:\n \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n return self.class_info.num_classes\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_known_apps","title":"get_known_apps","text":"get_known_apps() -> list[str]\n
Returns the list of known applications in the current configuration of the dataset.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_known_apps(self) -> list[str]:\n \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n return self.class_info.known_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_unknown_apps","title":"get_unknown_apps","text":"get_unknown_apps() -> list[str]\n
Returns the list of unknown applications in the current configuration of the dataset.
Source code incesnet_datazoo\\datasets\\cesnet_dataset.py
def get_unknown_apps(self) -> list[str]:\n \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n if self.class_info is None:\n raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n return self.class_info.unknown_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.compute_dataset_statistics","title":"compute_dataset_statistics","text":"compute_dataset_statistics(\n num_samples: int | Literal[\"all\"] = 10000000,\n num_workers: int = 4,\n batch_size: int = 16384,\n disabled_apps: Optional[list[str]] = None,\n) -> None\n
Computes dataset statistics and saves them to the statistics_path
folder.
Parameters:
Name Type Description Defaultnum_samples
int | Literal['all']
Number of samples to use for computing the statistics.
10000000
num_workers
int
Number of workers for loading data.
4
batch_size
int
Number of samples per batch for loading data.
16384
disabled_apps
Optional[list[str]]
List of applications to exclude from the statistics.
None
Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n \"\"\"\n Computes dataset statistics and saves them to the `statistics_path` folder.\n\n Parameters:\n num_samples: Number of samples to use for computing the statistics.\n num_workers: Number of workers for loading data.\n batch_size: Number of samples per batch for loading data.\n disabled_apps: List of applications to exclude from the statistics.\n \"\"\"\n if disabled_apps:\n bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n if len(bad_disabled_apps) > 0:\n raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n if not os.path.exists(self.statistics_path):\n os.mkdir(self.statistics_path)\n compute_dataset_statistics(database_path=self.database_path,\n tables_app_enum=self._tables_app_enum,\n tables_cat_enum=self._tables_cat_enum,\n output_dir=self.statistics_path,\n packet_histograms=self.metadata.packet_histograms,\n flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n protocol=self.metadata.protocol,\n extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n disabled_apps=disabled_apps if disabled_apps is not None else [],\n num_samples=num_samples,\n num_workers=num_workers,\n batch_size=batch_size,\n silent=self.silent)\n
"},{"location":"reference_dataset_config/","title":"Config class","text":""},{"location":"reference_dataset_config/#config.DatasetConfig","title":"config.DatasetConfig","text":"The main class for the configuration of:
When initializing this class, pass a CesnetDataset
instance to be configured and the desired configuration. Available options are here.
Attributes:
Name Type Descriptiondataset
InitVar[CesnetDataset]
The dataset instance to be configured.
data_root
str
Taken from the dataset instance.
database_filename
str
Taken from the dataset instance.
database_path
str
Taken from the dataset instance.
servicemap_path
str
Taken from the dataset instance.
flowstats_features
list[str]
Taken from dataset.metadata.flowstats_features
.
flowstats_features_boolean
list[str]
Taken from dataset.metadata.flowstats_features_boolean
.
flowstats_features_phist
list[str]
Taken from dataset.metadata.packet_histograms
if use_packet_histograms
is true, otherwise an empty list.
other_fields
list[str]
Taken from dataset.metadata.other_fields
if return_other_fields
is true, otherwise an empty list.
Attributes:
Name Type Descriptionneed_train_set
bool
Use to disable the train set. Default: True
need_val_set
bool
Use to disable the validation set. When need_train_set
is false, the validation set will also be disabled. Default: True
need_test_set
bool
Use to disable the test set. Default: True
train_period_name
str
Name of the train period. See instructions.
train_dates
list[str]
Dates used for creating a train set.
train_dates_weigths
Optional[list[int]]
To use a non-uniform distribution of samples across train dates.
val_approach
ValidationApproach
How a validation set should be created. Either split train data into train and validation or have a separate validation period. Default: SPLIT_FROM_TRAIN
train_val_split_fraction
float
The fraction of validation samples when splitting from the train set. Default: 0.2
val_period_name
str
Name of the validation period. See instructions.
val_dates
list[str]
Dates used for creating a validation set.
test_period_name
str
Name of the test period. See instructions.
test_dates
list[str]
Dates used for creating a test set.
apps_selection
AppSelection
How to select application classes. Default: ALL_KNOWN
apps_selection_topx
int
Take top X as known.
apps_selection_background_unknown
list[str]
Provide a list of background traffic classes to be used as unknown.
apps_selection_fixed_known
list[str]
Provide a list of manually selected known applications.
apps_selection_fixed_unknown
list[str]
Provide a list of manually selected unknown applications.
disabled_apps
list[str]
List of applications to be disabled and not used at all.
min_train_samples_check
MinTrainSamplesCheck
How to handle applications with not enough training samples. Default: DISABLE_APPS
min_train_samples_per_app
int
Defines the threshold for not enough. Default: 100
random_state
int
Fix all random processes performed during dataset initialization. Default: 420
fold_id
int
To perform N-fold cross-validation, set this to 1..N
. Each fold will use the same configuration but a different random seed. Default: 0
train_workers
int
Number of workers for loading train data. 0
means that the data will be loaded in the main process. Default: 4
test_workers
int
Number of workers for loading test data. 0
means that the data will be loaded in the main process. Default: 1
val_workers
int
Number of workers for loading validation data. 0
means that the data will be loaded in the main process. Default: 1
batch_size
int
Number of samples per batch. Default: 192
test_batch_size
int
Number of samples per batch for loading validation and test data. Default: 2048
preload_val
bool
Whether to dump the validation set with numpy.savez_compressed
and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. Default: True
preload_test
bool
Whether to dump the test set with numpy.savez_compressed
and preload it in future runs. Default: False
train_size
int | Literal['all']
Size of the train set. See instructions. Default: all
val_known_size
int | Literal['all']
Size of the validation set. See instructions. Default: all
test_known_size
int | Literal['all']
Size of the test set. See instructions. Default: all
val_unknown_size
int | Literal['all']
Size of the unknown classes validation set. Use for evaluation in the open-world setting. Default: 0
test_unknown_size
int | Literal['all']
Size of the unknown classes test set. Use for evaluation in the open-world setting. Default: 0
train_dataloader_order
DataLoaderOrder
Whether to load train data in sequential or random order. Default: RANDOM
train_dataloader_seed
Optional[int]
Seed for loading train data in random order. Default: None
return_other_fields
bool
Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. Default: False
return_tensors
bool
Use for returning torch.Tensor
from dataloaders. Dataframes are not available when this option is used. Default: False
use_packet_histograms
bool
Whether to use packet histogram features, if available in the dataset. Default: True
use_tcp_features
bool
Whether to use TCP features, if available in the dataset. Default: True
use_push_flags
bool
Whether to use push flags in packet sequences, if available in the dataset. Default: False
fit_scalers_samples
int | float
Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. Default: 0.25
ppi_transform
Optional[Callable]
Transform function for PPI sequences. See the transforms page for more information. Default: None
flowstats_transform
Optional[Callable]
Transform function for flow statistics. See the transforms page for more information. Default: None
flowstats_phist_transform
Optional[Callable]
Transform function for packet histograms. See the transforms page for more information. Default: None
There are three options for how to define train/validation/test dates.
train_period_name
, val_period_name
, or test_period_name
) available in dataset.time_periods
and leave the list of dates (train_dates
, val_dates
, or test_dates
) empty.dataset.available_dates
.dataset.default_train_period_name
and dataset.default_test_period_name
.There are two options for configuring sizes of train/validation/test sets.
S
) when creating the CesnetDataset
instance and leave train_size
, val_known_size
, and test_known_size
with their default all
value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).train_size
, val_known_size
, and test_known_size
. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the ORIG
dataset size and want to control the size of experiments.Tip
The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.
Source code incesnet_datazoo\\config.py
@dataclass(config=C)\nclass DatasetConfig():\n \"\"\"\n The main class for the configuration of:\n\n - Train, validation, test sets (dates, sizes, validation approach).\n - Application selection \u2014 either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).\n - Data transformations. See the [transforms][transforms] page for more information.\n - Dataloader options like batch sizes, order of loading, or number of workers.\n\n When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].\n\n Attributes:\n dataset: The dataset instance to be configured.\n data_root: Taken from the dataset instance.\n database_filename: Taken from the dataset instance.\n database_path: Taken from the dataset instance.\n servicemap_path: Taken from the dataset instance.\n flowstats_features: Taken from `dataset.metadata.flowstats_features`.\n flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.\n flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.\n other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.\n\n # Configuration options\n\n Attributes:\n need_train_set: Use to disable the train set. `Default: True`\n need_val_set: Use to disable the validation set. When `need_train_set` is false, the validation set will also be disabled. `Default: True`\n need_test_set: Use to disable the test set. `Default: True`\n train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n train_dates: Dates used for creating a train set.\n train_dates_weigths: To use a non-uniform distribution of samples across train dates.\n val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`\n train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`\n val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n val_dates: Dates used for creating a validation set.\n test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n test_dates: Dates used for creating a test set.\n\n apps_selection: How to select application classes. `Default: ALL_KNOWN`\n apps_selection_topx: Take top X as known.\n apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.\n apps_selection_fixed_known: Provide a list of manually selected known applications.\n apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.\n disabled_apps: List of applications to be disabled and not used at all.\n min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`\n min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`\n\n random_state: Fix all random processes performed during dataset initialization. `Default: 420`\n fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`\n train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`\n test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`\n val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`\n batch_size: Number of samples per batch. `Default: 192`\n test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`\n preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: True`\n preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`\n train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`\n test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`\n train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`\n train_dataloader_seed: Seed for loading train data in random order. `Default: None`\n\n return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`\n return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`\n use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`\n use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`\n use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`\n fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`\n ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`\n flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`\n flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`\n\n # How to configure train, validation, and test sets\n There are three options for how to define train/validation/test dates.\n\n 1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.\n 2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.\n 3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.\n\n There are two options for configuring sizes of train/validation/test sets.\n\n 1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.\n This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).\n 2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.\n This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.\n\n !!! tip Validation set\n The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].\n\n \"\"\"\n dataset: InitVar[CesnetDataset]\n data_root: str = field(init=False)\n database_filename: str = field(init=False)\n database_path: str = field(init=False)\n servicemap_path: str = field(init=False)\n flowstats_features: list[str] = field(init=False)\n flowstats_features_boolean: list[str] = field(init=False)\n flowstats_features_phist: list[str] = field(init=False)\n other_fields: list[str] = field(init=False)\n\n need_train_set: bool = True\n need_val_set: bool = True\n need_test_set: bool = True\n train_period_name: str = \"\"\n train_dates: list[str] = field(default_factory=list)\n train_dates_weigths: Optional[list[int]] = None\n val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN\n train_val_split_fraction: float = 0.2\n val_period_name: str = \"\"\n val_dates: list[str] = field(default_factory=list)\n test_period_name: str = \"\"\n test_dates: list[str] = field(default_factory=list)\n\n apps_selection: AppSelection = AppSelection.ALL_KNOWN\n apps_selection_topx: int = 0\n apps_selection_background_unknown: list[str] = field(default_factory=list)\n apps_selection_fixed_known: list[str] = field(default_factory=list)\n apps_selection_fixed_unknown: list[str] = field(default_factory=list)\n disabled_apps: list[str] = field(default_factory=list)\n min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS\n min_train_samples_per_app: int = 100\n\n random_state: int = 420\n fold_id: int = 0\n train_workers: int = 4\n test_workers: int = 1\n val_workers: int = 1\n batch_size: int = 192\n test_batch_size: int = 2048\n preload_val: bool = True\n preload_test: bool = False\n train_size: int | Literal[\"all\"] = \"all\"\n val_known_size: int | Literal[\"all\"] = \"all\"\n test_known_size: int | Literal[\"all\"] = \"all\"\n val_unknown_size: int | Literal[\"all\"] = 0\n test_unknown_size: int | Literal[\"all\"] = 0\n train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM\n train_dataloader_seed: Optional[int] = None\n\n return_other_fields: bool = False\n return_tensors: bool = False\n use_packet_histograms: bool = False\n use_tcp_features: bool = False\n use_push_flags: bool = False\n fit_scalers_samples: int | float = 0.25\n ppi_transform: Optional[Callable] = None\n flowstats_transform: Optional[Callable] = None\n flowstats_phist_transform: Optional[Callable] = None\n\n def __post_init__(self, dataset: CesnetDataset):\n \"\"\"\n Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.\n \"\"\"\n self.data_root = dataset.data_root\n self.servicemap_path = dataset.servicemap_path\n self.database_filename = dataset.database_filename\n self.database_path = dataset.database_path\n\n if not self.need_train_set:\n self.need_val_set = False\n if self.apps_selection != AppSelection.FIXED:\n raise ValueError(\"Application selection has to be fixed when need_train_set is false\")\n if (len(self.train_dates) > 0 or self.train_period_name != \"\"):\n raise ValueError(\"train_dates and train_period_name cannot be specified when need_train_set is false\")\n else:\n # Configure train dates\n if len(self.train_dates) > 0 and self.train_period_name == \"\":\n raise ValueError(\"train_period_name has to be specified when train_dates are set\")\n if len(self.train_dates) == 0 and self.train_period_name != \"\":\n if self.train_period_name not in dataset.time_periods:\n raise ValueError(f\"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods\")\n self.train_dates = dataset.time_periods[self.train_period_name]\n if len(self.train_dates) == 0 and self.train_period_name == \"\":\n self.train_period_name = dataset.default_train_period_name\n self.train_dates = dataset.time_periods[dataset.default_train_period_name]\n # Configure test dates\n if not self.need_test_set:\n if (len(self.test_dates) > 0 or self.test_period_name != \"\"):\n raise ValueError(\"test_dates and test_period_name cannot be specified when need_test_set is false\")\n else:\n if len(self.test_dates) > 0 and self.test_period_name == \"\":\n raise ValueError(\"test_period_name has to be specified when test_dates are set\")\n if len(self.test_dates) == 0 and self.test_period_name != \"\":\n if self.test_period_name not in dataset.time_periods:\n raise ValueError(f\"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods\")\n self.test_dates = dataset.time_periods[self.test_period_name]\n if len(self.test_dates) == 0 and self.test_period_name == \"\":\n self.test_period_name = dataset.default_test_period_name\n self.test_dates = dataset.time_periods[dataset.default_test_period_name]\n # Configure val dates\n if (not self.need_val_set or self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN) and (len(self.val_dates) > 0 or self.val_period_name != \"\"):\n raise ValueError(\"val_dates and val_period_name cannot be specified when need_val_set is false or the validation approach is split-from-train\")\n if self.val_approach == ValidationApproach.VALIDATION_DATES:\n if len(self.val_dates) > 0 and self.val_period_name == \"\":\n raise ValueError(\"val_period_name has to be specified when val_dates are set\")\n if len(self.val_dates) == 0 and self.val_period_name != \"\":\n if self.val_period_name not in dataset.time_periods:\n raise ValueError(f\"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods\")\n self.val_dates = dataset.time_periods[self.val_period_name]\n if len(self.val_dates) == 0 and self.val_period_name == \"\":\n raise ValueError(\"val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates\")\n # Check if train, val, and test dates are available in the dataset\n bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]\n bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]\n bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]\n if len(bad_train_dates) > 0:\n raise ValueError(f\"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n if len(bad_val_dates) > 0:\n raise ValueError(f\"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n if len(bad_test_dates) > 0:\n raise ValueError(f\"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n # Check time order of train, val, and test periods\n train_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.train_dates]\n test_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.test_dates]\n if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):\n warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n if self.val_approach == ValidationApproach.VALIDATION_DATES:\n # Train dates are guaranteed to be set\n val_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.val_dates]\n if min(val_dates) <= max(train_dates):\n warnings.warn(f\"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n if len(test_dates) > 0 and min(test_dates) <= max(val_dates):\n warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n # Configure features\n self.flowstats_features = dataset.metadata.flowstats_features\n self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean\n self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []\n if self.use_packet_histograms:\n if len(dataset.metadata.packet_histograms) == 0:\n raise ValueError(\"This dataset does not support use_packet_histograms\")\n self.flowstats_features_phist = dataset.metadata.packet_histograms\n else:\n self.flowstats_features_phist = []\n if self.flowstats_phist_transform is not None:\n raise ValueError(\"flowstats_phist_transform cannot be specified when use_packet_histograms is false\")\n if dataset.metadata.protocol == Protocol.TLS:\n if self.use_tcp_features:\n self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS\n if self.use_push_flags and \"PUSH_FLAG\" not in dataset.metadata.ppi_features:\n raise ValueError(\"This TLS dataset does not support use_push_flags\")\n if dataset.metadata.protocol == Protocol.QUIC:\n if self.use_tcp_features:\n raise ValueError(\"QUIC datasets do not support use_tcp_features\")\n if self.use_push_flags:\n raise ValueError(\"QUIC datasets do not support use_push_flags\")\n # When train_dates_weigths are used, train_size and val_known_size have to be specified\n if self.train_dates_weigths is not None:\n if not self.need_train_set:\n raise ValueError(\"train_dates_weigths cannot be specified when need_train_set is false\")\n if len(self.train_dates_weigths) != len(self.train_dates):\n raise ValueError(\"train_dates_weigths has to have the same length as train_dates\")\n if self.train_size == \"all\":\n raise ValueError(\"train_size cannot be 'all' when train_dates_weigths are speficied\")\n if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == \"all\":\n raise ValueError(\"val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train\")\n # App selection\n if self.apps_selection == AppSelection.ALL_KNOWN:\n self.val_unknown_size = 0\n self.test_unknown_size = 0\n if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n raise ValueError(\"apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known\")\n if self.apps_selection == AppSelection.TOPX_KNOWN:\n if self.apps_selection_topx == 0:\n raise ValueError(\"apps_selection_topx has to be greater than 0 when application selection is top-x-known\")\n if len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n raise ValueError(\"apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known\")\n if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:\n if len(self.apps_selection_background_unknown) == 0:\n raise ValueError(\"apps_selection_background_unknown has to be specified when application selection is background-unknown\")\n bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]\n if len(bad_apps) > 0:\n raise ValueError(f\"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n raise ValueError(\"apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown\")\n if self.apps_selection == AppSelection.FIXED:\n if len(self.apps_selection_fixed_known) == 0:\n raise ValueError(\"apps_selection_fixed_known has to be specified when application selection is fixed\")\n bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]\n if len(bad_apps) > 0:\n raise ValueError(f\"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n if len(self.disabled_apps) > 0:\n raise ValueError(\"disabled_apps cannot be specified when application selection is fixed\")\n if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:\n warnings.warn(\"min_train_samples_per_app is not used when application selection is fixed\")\n if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0:\n raise ValueError(\"apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed\")\n # More asserts\n bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]\n if len(bad_disabled_apps) > 0:\n raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples <= 0 or self.fit_scalers_samples > 1):\n raise ValueError(\"fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer\")\n\n def get_flowstats_features_len(self) -> int:\n \"\"\"Gets the number of flow statistics features.\"\"\"\n return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n\n def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n phist_mapping = {\n \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n }\n short_names_mapping = {\n \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n \"FLOW_ENDREASON_END\": \"FEND_END\",\n \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n \"FLAG_CWR\": \"F_CWR\",\n \"FLAG_CWR_REV\": \"F_CWR_REV\",\n \"FLAG_ECE\": \"F_ECE\",\n \"FLAG_ECE_REV\": \"F_ECE_REV\",\n \"FLAG_PSH_REV\": \"F_PSH_REV\",\n \"FLAG_RST\": \"F_RST\",\n \"FLAG_RST_REV\": \"F_RST_REV\",\n \"FLAG_FIN\": \"F_FIN\",\n \"FLAG_FIN_REV\": \"F_FIN_REV\",\n }\n feature_names = self.flowstats_features[:]\n for f in self.flowstats_features_boolean:\n if shorter_names and f in short_names_mapping:\n feature_names.append(short_names_mapping[f])\n else:\n feature_names.append(f)\n for f in self.flowstats_features_phist:\n feature_names.extend(phist_mapping[f])\n assert len(feature_names) == self.get_flowstats_features_len()\n return feature_names\n\n def get_ppi_feature_names(self) -> list[str]:\n \"\"\"Gets the names of flattened PPI features.\"\"\"\n ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n if self.use_push_flags:\n ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n return ppi_feature_names\n\n def get_ppi_channels(self) -> list[int]:\n \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n if self.use_push_flags:\n return TCP_PPI_CHANNELS\n else:\n return UDP_PPI_CHANNELS\n\n def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n \"\"\"\n Gets feature names.\n\n Parameters:\n flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n \"\"\"\n feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n return feature_names\n\n def _get_train_tables_paths(self) -> list[str]:\n return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n\n def _get_val_tables_paths(self) -> list[str]:\n if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n return list(map(lambda t: f\"/flows/D{t}\", self.val_dates))\n\n def _get_test_tables_paths(self) -> list[str]:\n return list(map(lambda t: f\"/flows/D{t}\", self.test_dates))\n\n def _get_train_data_hash(self) -> str:\n train_data_params = self._get_train_data_params()\n params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()\n params_hash = params_hash[:10]\n return params_hash\n\n def _get_train_data_path(self) -> str:\n if self.need_train_set:\n params_hash = self._get_train_data_hash()\n return os.path.join(self.data_root, \"train-data\", f\"{params_hash}_{self.random_state}\", f\"fold_{self.fold_id}\")\n else:\n return os.path.join(self.data_root, \"train-data\", \"default\")\n\n def _get_train_data_params(self) -> TrainDataParams:\n return TrainDataParams(\n database_filename=self.database_filename,\n train_period_name=self.train_period_name,\n train_tables_paths=self._get_train_tables_paths(),\n apps_selection=self.apps_selection,\n apps_selection_topx=self.apps_selection_topx,\n apps_selection_background_unknown=self.apps_selection_background_unknown,\n apps_selection_fixed_known=self.apps_selection_fixed_known,\n apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,\n disabled_apps=self.disabled_apps,\n min_train_samples_per_app=self.min_train_samples_per_app,\n min_train_samples_check=self.min_train_samples_check,)\n\n def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n assert self.val_approach == ValidationApproach.VALIDATION_DATES\n val_data_params = TestDataParams(\n database_filename=self.database_filename,\n test_period_name=self.val_period_name,\n test_tables_paths=self._get_val_tables_paths(),\n known_apps=known_apps,\n unknown_apps=unknown_apps,)\n params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()\n params_hash = params_hash[:10]\n val_data_path = os.path.join(self.data_root, \"val-data\", f\"{params_hash}_{self.random_state}\")\n return val_data_params, val_data_path\n\n def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n test_data_params = TestDataParams(\n database_filename=self.database_filename,\n test_period_name=self.test_period_name,\n test_tables_paths=self._get_test_tables_paths(),\n known_apps=known_apps,\n unknown_apps=unknown_apps,)\n params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()\n params_hash = params_hash[:10]\n test_data_path = os.path.join(self.data_root, \"test-data\", f\"{params_hash}_{self.random_state}\")\n return test_data_params, test_data_path\n\n @model_validator(mode=\"before\") # type: ignore\n @classmethod\n def check_deprecated_args(cls, values):\n kwargs = values.kwargs\n if \"train_period\" in kwargs:\n warnings.warn(\"train_period is deprecated. Use train_period_name instead.\")\n kwargs[\"train_period_name\"] = kwargs[\"train_period\"]\n if \"val_period\" in kwargs:\n warnings.warn(\"val_period is deprecated. Use val_period_name instead.\")\n kwargs[\"val_period_name\"] = kwargs[\"val_period\"]\n if \"test_period\" in kwargs:\n warnings.warn(\"test_period is deprecated. Use test_period_name instead.\")\n kwargs[\"test_period_name\"] = kwargs[\"test_period\"]\n return values\n\n def __str__(self):\n _process_tag = yaml.emitter.Emitter.process_tag\n _ignore_aliases = yaml.Dumper.ignore_aliases\n yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None\n yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True\n s = yaml.dump(dataclasses.asdict(self), sort_keys=False)\n yaml.emitter.Emitter.process_tag = _process_tag\n yaml.Dumper.ignore_aliases = _ignore_aliases\n return s\n
"},{"location":"reference_dataset_config/#config.DatasetConfig-functions","title":"Functions","text":""},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len","title":"get_flowstats_features_len","text":"get_flowstats_features_len() -> int\n
Gets the number of flow statistics features.
Source code incesnet_datazoo\\config.py
def get_flowstats_features_len(self) -> int:\n \"\"\"Gets the number of flow statistics features.\"\"\"\n return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded","title":"get_flowstats_feature_names_expanded","text":"get_flowstats_feature_names_expanded(\n shorter_names: bool = False,\n) -> list[str]\n
Gets names of flow statistics features. Packet histograms are expanded into bin features.
Source code incesnet_datazoo\\config.py
def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n phist_mapping = {\n \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n }\n short_names_mapping = {\n \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n \"FLOW_ENDREASON_END\": \"FEND_END\",\n \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n \"FLAG_CWR\": \"F_CWR\",\n \"FLAG_CWR_REV\": \"F_CWR_REV\",\n \"FLAG_ECE\": \"F_ECE\",\n \"FLAG_ECE_REV\": \"F_ECE_REV\",\n \"FLAG_PSH_REV\": \"F_PSH_REV\",\n \"FLAG_RST\": \"F_RST\",\n \"FLAG_RST_REV\": \"F_RST_REV\",\n \"FLAG_FIN\": \"F_FIN\",\n \"FLAG_FIN_REV\": \"F_FIN_REV\",\n }\n feature_names = self.flowstats_features[:]\n for f in self.flowstats_features_boolean:\n if shorter_names and f in short_names_mapping:\n feature_names.append(short_names_mapping[f])\n else:\n feature_names.append(f)\n for f in self.flowstats_features_phist:\n feature_names.extend(phist_mapping[f])\n assert len(feature_names) == self.get_flowstats_features_len()\n return feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_feature_names","title":"get_ppi_feature_names","text":"get_ppi_feature_names() -> list[str]\n
Gets the names of flattened PPI features.
Source code incesnet_datazoo\\config.py
def get_ppi_feature_names(self) -> list[str]:\n \"\"\"Gets the names of flattened PPI features.\"\"\"\n ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n if self.use_push_flags:\n ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n return ppi_feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_channels","title":"get_ppi_channels","text":"get_ppi_channels() -> list[int]\n
Gets the available features (channels) in PPI sequences.
Source code incesnet_datazoo\\config.py
def get_ppi_channels(self) -> list[int]:\n \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n if self.use_push_flags:\n return TCP_PPI_CHANNELS\n else:\n return UDP_PPI_CHANNELS\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_feature_names","title":"get_feature_names","text":"get_feature_names(\n flatten_ppi: bool = False, shorter_names: bool = False\n) -> list[str]\n
Gets feature names.
Parameters:
Name Type Description Defaultflatten_ppi
bool
Whether to flatten PPI into individual feature names or keep one PPI
column.
False
Source code in cesnet_datazoo\\config.py
def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n \"\"\"\n Gets feature names.\n\n Parameters:\n flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n \"\"\"\n feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n return feature_names\n
"},{"location":"reference_dataset_config/#enums-for-configuration","title":"Enums for configuration","text":"The following enums are used for dataset configuration.
"},{"location":"reference_dataset_config/#config.ValidationApproach","title":"config.ValidationApproach","text":"The validation approach defines which samples should be used for creating a validation set.
SPLIT_FROM_TRAINclass-attribute
instance-attribute
SPLIT_FROM_TRAIN = 'split-from-train'\n
Split train data into train and validation. Scikit-learn train_test_split
is used to create a random stratified validation set. The fraction of validation samples is defined in train_val_split_fraction
.
class-attribute
instance-attribute
VALIDATION_DATES = 'validation-dates'\n
Use separate validation dates to create a validation set. Validation dates need to be specified in val_dates
, and the name of the validation period in val_period_name
.
Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use ALL_KNOWN
to select all the applications as known. Use TOPX_KNOWN
or BACKGROUND_UNKNOWN
for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The FIXED
is for manual selection of known and unknown applications.
class-attribute
instance-attribute
ALL_KNOWN = 'all-known'\n
Use all applications as known.
TOPX_KNOWNclass-attribute
instance-attribute
TOPX_KNOWN = 'topx-known'\n
Use the first X (apps_selection_topx
) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.
class-attribute
instance-attribute
BACKGROUND_UNKNOWN = 'background-unknown'\n
Use the list of background traffic classes (apps_selection_background_unknown
) as unknown, and the rest as known.
class-attribute
instance-attribute
FIXED = 'fixed'\n
Manual application selection. Provide lists of known applications (apps_selection_fixed_known
) and unknown applications (apps_selection_fixed_unknown
).
Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with min_train_samples_per_app
, and its default value is 100. With the DISABLE_APPS
approach, these applications will be disabled and not used for training or testing. With the WARN_AND_EXIT
approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set min_train_samples_per_app
to 0.
class-attribute
instance-attribute
WARN_AND_EXIT = 'warn-and-exit'\n
Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to disabled_apps
.
class-attribute
instance-attribute
DISABLE_APPS = 'disable-apps'\n
Disable applications with not enough training samples.
"},{"location":"reference_dataset_config/#config.DataLoaderOrder","title":"config.DataLoaderOrder","text":"Validation and test sets are always loaded in sequential order \u2014 sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use RANDOM
if your classification model requires it; SEQUENTIAL
otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.
class-attribute
instance-attribute
RANDOM = 'random'\n
Iterate train data in random order.
SEQUENTIALclass-attribute
instance-attribute
SEQUENTIAL = 'sequential'\n
Iterate train data in sequential (datetime) order.
"},{"location":"reference_datasets/","title":"Dataset classes","text":"These are subclasses of CesnetDataset
representing individual datasets available in cesnet-datazoo
.
Bases: CesnetDataset
Dataset class for CESNET-TLS22.
Source code incesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS22(CesnetDataset):\n \"\"\"Dataset class for [CESNET-TLS22][cesnet-tls22].\"\"\"\n name = \"CESNET-TLS22\"\n database_filename = \"CESNET-TLS22.h5\"\n bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls22\"\n available_dates = _CESNET_TLS22_AVAILABLE_DATES\n time_periods = {\n \"W-2021-40\": [\"20211004\", \"20211005\", \"20211006\", \"20211007\", \"20211008\", \"20211009\", \"20211010\"],\n \"W-2021-41\": [\"20211011\", \"20211012\", \"20211013\", \"20211014\", \"20211015\", \"20211016\", \"20211017\"],\n }\n default_train_period_name = \"W-2021-40\"\n default_test_period_name = \"W-2021-41\"\n _tables_app_enum = _CESNET_TLS22_TABLES_APP_ENUM\n _tables_cat_enum = _CESNET_TLS22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_QUIC22","title":"datasets.datasets.CESNET_QUIC22","text":" Bases: CesnetDataset
Dataset class for CESNET-QUIC22.
Source code incesnet_datazoo\\datasets\\datasets.py
class CESNET_QUIC22(CesnetDataset):\n \"\"\"Dataset class for [CESNET-QUIC22][cesnet-quic22].\"\"\"\n name = \"CESNET-QUIC22\"\n database_filename = \"CESNET-QUIC22.h5\"\n bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-quic22\"\n available_dates = _CESNET_QUIC22_AVAILABLE_DATES\n time_periods = {\n \"W-2022-44\": [\"20221031\", \"20221101\", \"20221102\", \"20221103\", \"20221104\", \"20221105\", \"20221106\"],\n \"W-2022-45\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\"],\n \"W-2022-46\": [\"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\"],\n \"W-2022-47\": [\"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n \"W45-47\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\",\n \"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\",\n \"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n }\n default_train_period_name = \"W-2022-44\"\n default_test_period_name = \"W-2022-45\"\n _tables_app_enum = _CESNET_QUIC22_TABLES_APP_ENUM\n _tables_cat_enum = _CESNET_QUIC22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS_Year22","title":"datasets.datasets.CESNET_TLS_Year22","text":" Bases: CesnetDataset
Dataset class for CESNET-TLS-Year22.
Source code incesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS_Year22(CesnetDataset):\n \"\"\"Dataset class for [CESNET-TLS-Year22][cesnet-tls-year22].\"\"\"\n name = \"CESNET-TLS-Year22\"\n database_filename = \"CESNET-TLS-Year22.h5\"\n bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls-year22\"\n available_dates = _CESNET_TLS_YEAR22_AVAILABLE_DATES\n time_periods = _CESNET_TLS_YEAR22_TIME_PERIODS\n default_train_period_name = \"M-2022-9\"\n default_test_period_name = \"M-2022-10\"\n _tables_app_enum = _CESNET_TLS_YEAR22_TABLES_APP_ENUM\n _tables_cat_enum = _CESNET_TLS_YEAR22_TABLES_CATEGORY_ENUM\n
"},{"location":"transforms/","title":"Transforms","text":"The cesnet_datazoo
package supports configurable transforms of input data in a similar fashion to what torchvision is doing for the computer vision field. Input features are split into three groups, each having its own transformation. Those groups are PPI sequences, flow statistics, and packet histograms.
ppi_transform
of DatasetConfig
is applied to PPI sequences.flowstats_transform
is applied to flow statistics (excluding boolean features, such as flow end reasons or TCP flags).flowstats_phist_transform
is applied to packet histograms.Transforms are implemented in a separate package CESNET Models. See cesnet_models.transforms
documentation for details.
Limitations
The current implementation does not support the composing of transformations.
"},{"location":"transforms/#available-transformations","title":"Available transformations","text":"PPI sequences
Flow statistics
Packet histograms
More transformations will be implemented in future versions.
"},{"location":"transforms/#data-scaling","title":"Data scaling","text":"Transformations implementing data scaling will be fitted, if needed, on a subset of training data during dataset initialization.
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index b0f62ab..98b3e81 100755 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ