You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Major revision to extend SPADE with new capabilities. Now it is possible to
set a voting strategy for the pseudolabeler. It is possible to have a separate
number of GMM components per model. The `alpha` weight parameter can now be set
separately for positive and negative pseudo-labels.
See the updated README for more details.
PiperOrigin-RevId: 718863197
Copy file name to clipboardExpand all lines: README.md
+26-10Lines changed: 26 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -78,6 +78,8 @@ The metric reported by the pipeline is model [AUC](https://developers.google.com
78
78
79
79
<spanstyle="color:red;background-color:lightgrey">label_col_name (string)</span>: The name of the label column in the input BigQuery table.
80
80
81
+
<spanstyle="color:red;background-color:lightgrey">labels_are_strings</span>: Whether the labels in the input dataset are strings or integers.
82
+
81
83
<spanstyle="color:red;background-color:lightgrey">positive_data_value (integer)</span>: The value used in the label column to denote positive data - data points that are anomalous. “1” can be used, for example.
82
84
83
85
<spanstyle="color:red;background-color:lightgrey">negative_data_value (integer)</span>: The value used in the label column to denote negative data - data points that are not anomalous. “0” can be used, for example.
@@ -99,17 +101,21 @@ one class classifier ensemble to label a point as negative. The higher this valu
99
101
100
102
<spanstyle="color:yellow;background-color:lightgrey">data_test_gcs_uri</span>: Cloud Storage location to store the CSV data to be used for evaluating the supervised model. Note that the positive and negative label values must also be the same in this testing set. It is okay to have your test labels in that form, or use 1 for positive and 0 for negative. Use exactly one of BigQuery locations or GCS locations.
101
103
102
-
<spanstyle="color:yellow;background-color:lightgrey">upload_only</span>: Use this setting in conjunction with `output_bigquery_table_path` or `data_output_gcs_uri`. When `True`, the algorithm will just upload the pseudo labeled data to the specified table, and will skip training a supervised model. When set to `False`, the algorithm will also train a supervised model and upload it to a GCS location. Default is `False`.
104
+
<spanstyle="color:yellow;background-color:lightgrey">upload_only (bool)</span>: Use this setting in conjunction with `output_bigquery_table_path` or `data_output_gcs_uri`. When `True`, the algorithm will just upload the pseudo labeled data to the specified table, and will skip training a supervised model. When set to `False`, the algorithm will also train a supervised model and upload it to a GCS location. Default is `False`.
103
105
104
106
<spanstyle="color:yellow;background-color:lightgrey">output_bigquery_table_path</span>: A complete BigQuery path in the form of 'project.dataset.table' to be used for uploading the pseudo labeled data. This includes features and new labels. By default, we will use the column names from the input_bigquery_table_path BigQuery table. Use exactly one of BigQuery locations or GCS locations.
105
107
106
108
<spanstyle="color:yellow;background-color:lightgrey">data_output_gcs_uri</span>: Cloud Storage location used for uploading the pseudo labeled data as CSV. This includes features and new labels. By default, we will use the column names from the data_input_gcs_uri table. Use exactly one of BigQuery locations or GCS locations.
107
109
108
-
<spanstyle="color:yellow;background-color:lightgrey">alpha (float)</span>: Sample weights for weighting the loss function, only for pseudo-labeled data from the occ ensemble. Original data that is labeled will have a weight of 1.0. By default, we use alpha = 1.0.
110
+
<spanstyle="color:yellow;background-color:lightgrey">voting_strategy (bool)</span>: The voting strategy to use when determining if a data point is anomalous. By default, we use unanimous voting, meaning all the models in the ensemble need to agree in order to label a data point as anomalous.
111
+
112
+
<spanstyle="color:yellow;background-color:lightgrey">alpha (float)</span>: Sample weights for weighting the loss function, only for positively pseudo-labeled data from the occ ensemble. Original data that is labeled will have a weight of 1.0. If this is provided and `alpha_negative_pseudolabels` is not provided, then this value will be used for both positive and negative pseudo-labeled data. By default, we use alpha = 1.0.
113
+
114
+
<spanstyle="color:yellow;background-color:lightgrey">alpha_negative_pseudolabels (float)</span>: Sample weights for weighting the loss function, only for negatively pseudo-labeled data from the occ ensemble. Original data that is labeled will have a weight of 1.0. If this is not provided, then the `alpha` value will be used for both positive and negative pseudo-labeled data. By default, we use alpha_negative_pseudolabels = 1.0.
109
115
110
116
<spanstyle="color:yellow;background-color:lightgrey">ensemble_count</span>: Integer representing the number of one class classifiers in the ensemble used for pseudo labeling unlabeled data points. The more models in the ensemble, the less likely it is for all the models to gain consensus, and thus will reduce the amount of labeled data points. By default, we use 5 one class classifiers.
111
117
112
-
<spanstyle="color:yellow;background-color:lightgrey">n_components</span>: Integer representing the number of components to use in the one class classifier ensemble. By default, we use 1 component.
118
+
<spanstyle="color:yellow;background-color:lightgrey">n_components</span>: The number of components to use in the one class classifier ensemble. By default, we use 1 component. Pass a single integer if all the ensemble models should have the same number of components. Pass a space-separated list of integers if you want to use different numbers of components for each model in the ensemble. By default, we use 1 component.
113
119
114
120
<spanstyle="color:yellow;background-color:lightgrey">covariance_type</span>: String representing the covariance type to use in the one class classifier ensemble. By default, we use 'full' covariance. Note that when there are many components, a 'full' covariance matrix may not be suitable.
0 commit comments