You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_ppi, batch_flowstats, batch_labels)</code>. The batch size <em>B</em> is configured with <code>batch_size</code> and <code>test_batch_size</code> config options.
584
+
<p>When a dataloader is iterated, the returned data are in the format <code>tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels)</code>. Batch size <em>B</em> is configured with <code>batch_size</code> and <code>test_batch_size</code> config options.
582
585
The shapes are:</p>
583
586
<ul>
584
-
<li>batch_ppi - <code>(B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li>
585
-
<li>batch_flowstats <code>(B, F)</code> - where F is the number of flowstats features computed with <aclass="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len">DatasetConfig.get_flowstats_features_len</a>. To get the order and names of flowstats features, call <aclass="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded">DatasetConfig.get_flowstats_feature_names_expanded</a>. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the <aclass="autorefs autorefs-internal" href="../features/#features">data features</a> page for more information about features.</li>
586
-
<li>batch_labels <code>(B)</code> - integer labels encoded with <code>LabelEncoder</code> available at <code>dataset.encoder</code>.</li>
587
+
<li>batch_other_fields <code>pd.DataFrame (B, C)</code> - a Pandas DataFrame with <aclass="autorefs autorefs-internal" href="../features/#other-fields">auxiliary fields</a>, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. If the <code>return_other_fields</code> config option is false, this will be an empty DataFrame. DataFrame columns C depend on the used dataset and are available at <code>dataset_config.other_fields</code>.</li>
588
+
<li>batch_ppi - <code>np.ndarray (B, [3, 4], 30)</code> - the middle dimension is either 4 when TCP push flags are used (<code>use_push_flags</code>) or 3 otherwise.</li>
589
+
<li>batch_flowstats <code>np.ndarray (B, F)</code> - where F is the number of flowstats features computed with <aclass="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len">DatasetConfig.get_flowstats_features_len</a>. To get the order and names of flowstats features, call <aclass="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded">DatasetConfig.get_flowstats_feature_names_expanded</a>. The batch_flowstats array includes flow statistics, TCP features (if available and configured), and bins of packet histograms (if available and configured). See the <aclass="autorefs autorefs-internal" href="../features/#features">data features</a> page for more information about features.</li>
590
+
<li>batch_labels <code>np.ndarray (B)</code> - integer labels encoded with a <code>LabelEncoder</code> instance available at <code>dataset.encoder</code>.</li>
587
591
</ul>
588
-
<p>Data returned from dataloaders are scaled depending on the selected configuration; see <aclass="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig"><code>DatasetConfig</code></a> for options.</p>
592
+
<p>PPI and flow statistics features returned from dataloaders are scaled depending on the selected configuration; see <aclass="autorefs autorefs-internal" href="../reference_dataset_config/#config.DatasetConfig"><code>DatasetConfig</code></a> for options.</p>
<p>Most of those fields are not yet available in the <aclass="autorefs autorefs-internal" href="../reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset"><code>CesnetDataset</code></a>class. To access them, create an instance of <code>cesnet_datazoo.pytables_data.pytables_dataset.PyTablesDataset</code> and set <code>return_all_fields</code> to true.</p>
850
+
<p>Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The <aclass="autorefs autorefs-internal" href="../dataset_metadata/#metadata">dataset metadata</a>page lists available fields in individual datasets. </p>
<td>Timestamp of the first packet in format <em>YYYY-MM-DDTHH-MM-SS.ffffff</em></td>
868
-
<td></td>
867
+
<td>Timestamp of the first packet</td>
868
+
<td><code>return_other_fields</code></td>
869
869
</tr>
870
870
<tr>
871
871
<td>TIME_LAST</td>
872
-
<td>Timestamp of the last packet in format <em>YYYY-MM-DDTHH-MM-SS.ffffff</em></td>
873
-
<td></td>
872
+
<td>Timestamp of the last packet</td>
873
+
<td><code>return_other_fields</code></td>
874
874
</tr>
875
875
<tr>
876
876
<td>SRC_IP</td>
877
877
<td>Source IP address</td>
878
-
<td><code>return_ips</code></td>
878
+
<td><code>return_other_fields</code></td>
879
879
</tr>
880
880
<tr>
881
881
<td>DST_IP</td>
882
882
<td>Destination IP address</td>
883
-
<td><code>return_ips</code></td>
883
+
<td><code>return_other_fields</code></td>
884
884
</tr>
885
885
<tr>
886
886
<td>DST_ASN</td>
887
887
<td>Destination Autonomous System number</td>
888
-
<td></td>
888
+
<td><code>return_other_fields</code></td>
889
889
</tr>
890
890
<tr>
891
891
<td>SRC_PORT</td>
892
892
<td>Source port</td>
893
-
<td><code>return_ips</code></td>
893
+
<td><code>return_other_fields</code></td>
894
894
</tr>
895
895
<tr>
896
896
<td>DST_PORT</td>
897
897
<td>Destination port</td>
898
-
<td><code>return_ips</code></td>
898
+
<td><code>return_other_fields</code></td>
899
899
</tr>
900
900
<tr>
901
901
<td>PROTOCOL</td>
902
902
<td>Transport protocol</td>
903
-
<td></td>
903
+
<td><code>return_other_fields</code></td>
904
904
</tr>
905
905
<tr>
906
906
<td>TLS_SNI / QUIC_SNI</td>
907
907
<td>Server Name Indication domain</td>
908
-
<td></td>
908
+
<td><code>return_other_fields</code></td>
909
909
</tr>
910
910
<tr>
911
911
<td>TLS_JA3</td>
912
912
<td>JA3 fingerprint</td>
913
-
<td></td>
913
+
<td><code>return_other_fields</code></td>
914
914
</tr>
915
915
<tr>
916
916
<td>QUIC_VERSION</td>
917
917
<td>QUIC protocol version</td>
918
-
<td></td>
918
+
<td><code>return_other_fields</code></td>
919
919
</tr>
920
920
<tr>
921
921
<td>QUIC_USER_AGENT</td>
922
922
<td>User agent string if available in the QUIC Initial Packet</td>
923
-
<td></td>
924
-
</tr>
925
-
<tr>
926
-
<td>APP</td>
927
-
<td>Web service label</td>
928
-
<td></td>
929
-
</tr>
930
-
<tr>
931
-
<td>CATEGORY</td>
932
-
<td>Service category label</td>
933
-
<td></td>
923
+
<td><code>return_other_fields</code></td>
934
924
</tr>
935
925
</tbody>
936
926
</table>
927
+
<!--
928
+
| APP | Web service label | |
929
+
| CATEGORY | Service category label | |
930
+
-->
931
+
937
932
<h2id="details-about-packet-histograms-and-ppi">Details about packet histograms and PPI</h2>
938
933
<p>Due to differences in implementation between packet sequences (<ahref="https://github.com/CESNET/ipfixprobe/blob/master/process/pstats.cpp">pstats.cpp</a>) and packet histogram (<ahref="https://github.com/CESNET/ipfixprobe/blob/master/process/phists.cpp">phist.cpp</a>) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table.
939
934
Note that this is related to TLS over TCP datasets.</p>
0 commit comments