Skip to content

Commit

Permalink
Deployed fd517b8 with MkDocs version: 1.5.3
Browse files Browse the repository at this point in the history
  • Loading branch information
janluxemburk committed Feb 14, 2024
1 parent 7570777 commit 1f79af8
Show file tree
Hide file tree
Showing 6 changed files with 312 additions and 189 deletions.
18 changes: 11 additions & 7 deletions dataloaders/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -566,26 +566,30 @@ <h1 id="using-dataloaders">Using dataloaders</h1>
<p>Apart from loading data into dataframes, the <code>cesnet-datazoo</code> package provides dataloaders for processing data in smaller batches.</p>
<p>An example of how dataloaders can be used is in <code>cesnet_datazoo.datasets.loaders</code> or in the following snippet:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">load_from_dataloader</span><span class="p">(</span><span class="n">dataloader</span><span class="p">:</span> <span class="n">DataLoader</span><span class="p">):</span>
<span class="n">other_fields</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">data_ppi</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">data_flowstats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">batch_ppi</span><span class="p">,</span> <span class="n">batch_flowstats</span><span class="p">,</span> <span class="n">batch_labels</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
<span class="k">for</span> <span class="n">batch_other_fields</span><span class="p">,</span> <span class="n">batch_ppi</span><span class="p">,</span> <span class="n">batch_flowstats</span><span class="p">,</span> <span class="n">batch_labels</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
<span class="n">other_fields</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_other_fields</span><span class="p">)</span>
<span class="n">data_ppi</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_ppi</span><span class="p">)</span>
<span class="n">data_flowstats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_flowstats</span><span class="p">)</span>
<span class="n">labels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_labels</span><span class="p">)</span>
<span class="n">df_other_fields</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span><span class="n">other_fields</span><span class="p">,</span> <span class="n">ignore_index</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">data_ppi</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">data_ppi</span><span class="p">)</span>
<span class="n">data_flowstats</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">data_flowstats</span><span class="p">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
<span class="k">return</span> <span class="n">data_ppi</span><span class="p">,</span> <span class="n">data_flowstats</span><span class="p">,</span> <span class="n">labels</span>
<span class="k">return</span> <span class="n">df_other_fields</span><span class="p">,</span> <span class="n">data_ppi</span><span class="p">,</span> <span class="n">data_flowstats</span><span class="p">,</span> <span class="n">labels</span>
</code></pre></div>
<p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_ppi, batch_flowstats, batch_labels)</code>. The batch size <em>B</em> is configured with <code>batch_size</code> and <code>test_batch_size</code> config options.
<p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels)</code>. Batch size <em>B</em> is configured with <code>batch_size</code> and <code>test_batch_size</code> config options.
The shapes are:</p>
<ul>
<li>batch_ppi - <code>(B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li>
<li>batch_flowstats <code>(B, F)</code> - where F is the number of flowstats features computed with <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len">DatasetConfig.get_flowstats_features_len</a>. To get the order and names of flowstats features, call <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded">DatasetConfig.get_flowstats_feature_names_expanded</a>. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the <a class="autorefs autorefs-internal" href="../features/#features">data features</a> page for more information about features.</li>
<li>batch_labels <code>(B)</code> - integer labels encoded with <code>LabelEncoder</code> available at <code>dataset.encoder</code>.</li>
<li>batch_other_fields <code>pd.DataFrame (B, C)</code> - a Pandas DataFrame with <a class="autorefs autorefs-internal" href="../features/#other-fields">auxiliary fields</a>, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the <code>return_other_fields</code> config option is false, this will be an empty DataFrame. DataFrame columns C depend on the used dataset and are available at <code>dataset_config.other_fields</code>.</li>
<li>batch_ppi - <code>np.ndarray (B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li>
<li>batch_flowstats <code>np.ndarray (B, F)</code> - where F is the number of flowstats features computed with <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len">DatasetConfig.get_flowstats_features_len</a>. To get the order and names of flowstats features, call <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded">DatasetConfig.get_flowstats_feature_names_expanded</a>. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the <a class="autorefs autorefs-internal" href="../features/#features">data features</a> page for more information about features.</li>
<li>batch_labels <code>np.ndarray (B)</code> - integer labels encoded with a <code>LabelEncoder</code> instance available at <code>dataset.encoder</code>.</li>
</ul>
<p>Data returned from dataloaders are scaled depending on the selected configuration; see <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig"><code>DatasetConfig</code></a> for options.</p>
<p>PPI and flow statistics features returned from dataloaders are scaled depending on the selected configuration; see <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig"><code>DatasetConfig</code></a> for options.</p>



Expand Down
47 changes: 21 additions & 26 deletions features/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -847,7 +847,7 @@ <h2 id="tcp-features">TCP features</h2>
</tbody>
</table>
<h2 id="other-fields">Other fields</h2>
<p>Most of those fields are not yet available in the <a class="autorefs autorefs-internal" href="../reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset"><code>CesnetDataset</code></a> class. To access them, create an instance of <code>cesnet_datazoo.pytables_data.pytables_dataset.PyTablesDataset</code> and set <code>return_all_fields</code> to true.</p>
<p>Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The <a class="autorefs autorefs-internal" href="../dataset_metadata/#metadata">dataset metadata</a> page lists available fields in individual datasets. </p>
<table>
<thead>
<tr>
Expand All @@ -860,80 +860,75 @@ <h2 id="other-fields">Other fields</h2>
<tr>
<td>ID</td>
<td>Per-dataset unique flow identifier</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>TIME_FIRST</td>
<td>Timestamp of the first packet in format <em>YYYY-MM-DDTHH-MM-SS.ffffff</em></td>
<td></td>
<td>Timestamp of the first packet</td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>TIME_LAST</td>
<td>Timestamp of the last packet in format <em>YYYY-MM-DDTHH-MM-SS.ffffff</em></td>
<td></td>
<td>Timestamp of the last packet</td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>SRC_IP</td>
<td>Source IP address</td>
<td><code>return_ips</code></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>DST_IP</td>
<td>Destination IP address</td>
<td><code>return_ips</code></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>DST_ASN</td>
<td>Destination Autonomous System number</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>SRC_PORT</td>
<td>Source port</td>
<td><code>return_ips</code></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>DST_PORT</td>
<td>Destination port</td>
<td><code>return_ips</code></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>PROTOCOL</td>
<td>Transport protocol</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>TLS_SNI / QUIC_SNI</td>
<td>Server Name Indication domain</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>TLS_JA3</td>
<td>JA3 fingerprint</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>QUIC_VERSION</td>
<td>QUIC protocol version</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
<tr>
<td>QUIC_USER_AGENT</td>
<td>User agent string if available in the QUIC Initial Packet</td>
<td></td>
</tr>
<tr>
<td>APP</td>
<td>Web service label</td>
<td></td>
</tr>
<tr>
<td>CATEGORY</td>
<td>Service category label</td>
<td></td>
<td><code>return_other_fields</code></td>
</tr>
</tbody>
</table>
<!--
| APP | Web service label | |
| CATEGORY | Service category label | |
-->

<h2 id="details-about-packet-histograms-and-ppi">Details about packet histograms and PPI</h2>
<p>Due to differences in implementation between packet sequences (<a href="https://github.com/CESNET/ipfixprobe/blob/master/process/pstats.cpp">pstats.cpp</a>) and packet histogram (<a href="https://github.com/CESNET/ipfixprobe/blob/master/process/phists.cpp">phist.cpp</a>) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table.
Note that this is related to TLS over TCP datasets.</p>
Expand Down
Loading

0 comments on commit 1f79af8

Please sign in to comment.