Skip to content

Commit 1f79af8

Browse files
committed
Deployed fd517b8 with MkDocs version: 1.5.3
1 parent 7570777 commit 1f79af8

File tree

6 files changed

+312
-189
lines changed

6 files changed

+312
-189
lines changed

dataloaders/index.html

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -566,26 +566,30 @@ <h1 id="using-dataloaders">Using dataloaders</h1>
566566
<p>Apart from loading data into dataframes, the <code>cesnet-datazoo</code> package provides dataloaders for processing data in smaller batches.</p>
567567
<p>An example of how dataloaders can be used is in <code>cesnet_datazoo.datasets.loaders</code> or in the following snippet:</p>
568568
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">load_from_dataloader</span><span class="p">(</span><span class="n">dataloader</span><span class="p">:</span> <span class="n">DataLoader</span><span class="p">):</span>
569+
<span class="n">other_fields</span> <span class="o">=</span> <span class="p">[]</span>
569570
<span class="n">data_ppi</span> <span class="o">=</span> <span class="p">[]</span>
570571
<span class="n">data_flowstats</span> <span class="o">=</span> <span class="p">[]</span>
571572
<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
572-
<span class="k">for</span> <span class="n">batch_ppi</span><span class="p">,</span> <span class="n">batch_flowstats</span><span class="p">,</span> <span class="n">batch_labels</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
573+
<span class="k">for</span> <span class="n">batch_other_fields</span><span class="p">,</span> <span class="n">batch_ppi</span><span class="p">,</span> <span class="n">batch_flowstats</span><span class="p">,</span> <span class="n">batch_labels</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
574+
<span class="n">other_fields</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_other_fields</span><span class="p">)</span>
573575
<span class="n">data_ppi</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_ppi</span><span class="p">)</span>
574576
<span class="n">data_flowstats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_flowstats</span><span class="p">)</span>
575577
<span class="n">labels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_labels</span><span class="p">)</span>
578+
<span class="n">df_other_fields</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span><span class="n">other_fields</span><span class="p">,</span> <span class="n">ignore_index</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
576579
<span class="n">data_ppi</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">data_ppi</span><span class="p">)</span>
577580
<span class="n">data_flowstats</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">data_flowstats</span><span class="p">)</span>
578581
<span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
579-
<span class="k">return</span> <span class="n">data_ppi</span><span class="p">,</span> <span class="n">data_flowstats</span><span class="p">,</span> <span class="n">labels</span>
582+
<span class="k">return</span> <span class="n">df_other_fields</span><span class="p">,</span> <span class="n">data_ppi</span><span class="p">,</span> <span class="n">data_flowstats</span><span class="p">,</span> <span class="n">labels</span>
580583
</code></pre></div>
581-
<p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_ppi, batch_flowstats, batch_labels)</code>. The batch size <em>B</em> is configured with <code>batch_size</code> and <code>test_batch_size</code> config options.
584+
<p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels)</code>. Batch size <em>B</em> is configured with <code>batch_size</code> and <code>test_batch_size</code> config options.
582585
The shapes are:</p>
583586
<ul>
584-
<li>batch_ppi - <code>(B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li>
585-
<li>batch_flowstats <code>(B, F)</code> - where F is the number of flowstats features computed with <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len">DatasetConfig.get_flowstats_features_len</a>. To get the order and names of flowstats features, call <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded">DatasetConfig.get_flowstats_feature_names_expanded</a>. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the <a class="autorefs autorefs-internal" href="../features/#features">data features</a> page for more information about features.</li>
586-
<li>batch_labels <code>(B)</code> - integer labels encoded with <code>LabelEncoder</code> available at <code>dataset.encoder</code>.</li>
587+
<li>batch_other_fields <code>pd.DataFrame (B, C)</code> - a Pandas DataFrame with <a class="autorefs autorefs-internal" href="../features/#other-fields">auxiliary fields</a>, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the <code>return_other_fields</code> config option is false, this will be an empty DataFrame. DataFrame columns C depend on the used dataset and are available at <code>dataset_config.other_fields</code>.</li>
588+
<li>batch_ppi - <code>np.ndarray (B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li>
589+
<li>batch_flowstats <code>np.ndarray (B, F)</code> - where F is the number of flowstats features computed with <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len">DatasetConfig.get_flowstats_features_len</a>. To get the order and names of flowstats features, call <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded">DatasetConfig.get_flowstats_feature_names_expanded</a>. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the <a class="autorefs autorefs-internal" href="../features/#features">data features</a> page for more information about features.</li>
590+
<li>batch_labels <code>np.ndarray (B)</code> - integer labels encoded with a <code>LabelEncoder</code> instance available at <code>dataset.encoder</code>.</li>
587591
</ul>
588-
<p>Data returned from dataloaders are scaled depending on the selected configuration; see <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig"><code>DatasetConfig</code></a> for options.</p>
592+
<p>PPI and flow statistics features returned from dataloaders are scaled depending on the selected configuration; see <a class="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig"><code>DatasetConfig</code></a> for options.</p>
589593

590594

591595

features/index.html

Lines changed: 21 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -847,7 +847,7 @@ <h2 id="tcp-features">TCP features</h2>
847847
</tbody>
848848
</table>
849849
<h2 id="other-fields">Other fields</h2>
850-
<p>Most of those fields are not yet available in the <a class="autorefs autorefs-internal" href="../reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset"><code>CesnetDataset</code></a> class. To access them, create an instance of <code>cesnet_datazoo.pytables_data.pytables_dataset.PyTablesDataset</code> and set <code>return_all_fields</code> to true.</p>
850+
<p>Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The <a class="autorefs autorefs-internal" href="../dataset_metadata/#metadata">dataset metadata</a> page lists available fields in individual datasets. </p>
851851
<table>
852852
<thead>
853853
<tr>
@@ -860,80 +860,75 @@ <h2 id="other-fields">Other fields</h2>
860860
<tr>
861861
<td>ID</td>
862862
<td>Per-dataset unique flow identifier</td>
863-
<td></td>
863+
<td><code>return_other_fields</code></td>
864864
</tr>
865865
<tr>
866866
<td>TIME_FIRST</td>
867-
<td>Timestamp of the first packet in format <em>YYYY-MM-DDTHH-MM-SS.ffffff</em></td>
868-
<td></td>
867+
<td>Timestamp of the first packet</td>
868+
<td><code>return_other_fields</code></td>
869869
</tr>
870870
<tr>
871871
<td>TIME_LAST</td>
872-
<td>Timestamp of the last packet in format <em>YYYY-MM-DDTHH-MM-SS.ffffff</em></td>
873-
<td></td>
872+
<td>Timestamp of the last packet</td>
873+
<td><code>return_other_fields</code></td>
874874
</tr>
875875
<tr>
876876
<td>SRC_IP</td>
877877
<td>Source IP address</td>
878-
<td><code>return_ips</code></td>
878+
<td><code>return_other_fields</code></td>
879879
</tr>
880880
<tr>
881881
<td>DST_IP</td>
882882
<td>Destination IP address</td>
883-
<td><code>return_ips</code></td>
883+
<td><code>return_other_fields</code></td>
884884
</tr>
885885
<tr>
886886
<td>DST_ASN</td>
887887
<td>Destination Autonomous System number</td>
888-
<td></td>
888+
<td><code>return_other_fields</code></td>
889889
</tr>
890890
<tr>
891891
<td>SRC_PORT</td>
892892
<td>Source port</td>
893-
<td><code>return_ips</code></td>
893+
<td><code>return_other_fields</code></td>
894894
</tr>
895895
<tr>
896896
<td>DST_PORT</td>
897897
<td>Destination port</td>
898-
<td><code>return_ips</code></td>
898+
<td><code>return_other_fields</code></td>
899899
</tr>
900900
<tr>
901901
<td>PROTOCOL</td>
902902
<td>Transport protocol</td>
903-
<td></td>
903+
<td><code>return_other_fields</code></td>
904904
</tr>
905905
<tr>
906906
<td>TLS_SNI / QUIC_SNI</td>
907907
<td>Server Name Indication domain</td>
908-
<td></td>
908+
<td><code>return_other_fields</code></td>
909909
</tr>
910910
<tr>
911911
<td>TLS_JA3</td>
912912
<td>JA3 fingerprint</td>
913-
<td></td>
913+
<td><code>return_other_fields</code></td>
914914
</tr>
915915
<tr>
916916
<td>QUIC_VERSION</td>
917917
<td>QUIC protocol version</td>
918-
<td></td>
918+
<td><code>return_other_fields</code></td>
919919
</tr>
920920
<tr>
921921
<td>QUIC_USER_AGENT</td>
922922
<td>User agent string if available in the QUIC Initial Packet</td>
923-
<td></td>
924-
</tr>
925-
<tr>
926-
<td>APP</td>
927-
<td>Web service label</td>
928-
<td></td>
929-
</tr>
930-
<tr>
931-
<td>CATEGORY</td>
932-
<td>Service category label</td>
933-
<td></td>
923+
<td><code>return_other_fields</code></td>
934924
</tr>
935925
</tbody>
936926
</table>
927+
<!--
928+
| APP | Web service label | |
929+
| CATEGORY | Service category label | |
930+
-->
931+
937932
<h2 id="details-about-packet-histograms-and-ppi">Details about packet histograms and PPI</h2>
938933
<p>Due to differences in implementation between packet sequences (<a href="https://github.com/CESNET/ipfixprobe/blob/master/process/pstats.cpp">pstats.cpp</a>) and packet histogram (<a href="https://github.com/CESNET/ipfixprobe/blob/master/process/phists.cpp">phist.cpp</a>) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table.
939934
Note that this is related to TLS over TCP datasets.</p>

0 commit comments

Comments
 (0)