@@ -72,3 +72,216 @@ only three attributes must be set:
72
72
* selection_metrics\_ : The values of the selection metric as a 1-D numpy floating point array. a
73
73
higher selection metric indicates a better model.
74
74
* metric_name\_ : The name of the selection metric as a string. Used for interpretation.
75
+
76
+ .. _a_cust_sams :
77
+
78
+ Simulated annealing model selection (SAMS)
79
+ ------------------------------------------
80
+
81
+ Simulated annealing model selection, or SAMS, was devised by
82
+ `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157 >`_.
83
+ It is a model selection algorithm, which instead of looking at the statistical
84
+ significance, like is most commonly used, simulates multiple models and looks at what the
85
+ good fitting models have in common. The algorithm works in three stages:
86
+
87
+ * The simulation stage: here the algorithm simulates many models of a fixed size using simulated annealing,
88
+ and sorts them by their
89
+ :math: `R^2 `. Commonly it simulates 10000 or 20000 models, however it depends on the
90
+ problem at hand.
91
+ * The reduction stage: here the algorithm takes the simulated models and looks what the
92
+ most common 1-factor, 2-factor, 3-factor, etc. combinations are. In other words, it looks
93
+ at which submodel of size k occurs most frequently in the good fitting models.
94
+ * The selection stage: here the algorithm takes the most occuring submodels of each size and
95
+ compares them to determine an ordering. The ordering is based on the entropy which is explained
96
+ later.
97
+
98
+ As you may notice, the algorithm does not output just one model. It outputs multiple models,
99
+ ordered by which model is has the most confidence in. The last two stages of the algorithm
100
+ use the result of the first stage to automatically determine an ordering, however, the user
101
+ may also manually look at a raster plot of the results which looks as follows:
102
+
103
+
104
+ Each row is a model, each column is a potential term in the model, and the color indicates the
105
+ coefficient of the term. This means that any term not in the model has a coefficient of zero, which is
106
+ plotted in white. By looking at largely colored columns, we can determine which
107
+ submodels occur most often.
108
+
109
+ .. note ::
110
+ In some events, multiple distinct models may perform equally well. Such a scenario is
111
+ difficult to detect in the raster plot, and also by the entropy criterion. Luckily, we
112
+ can also cluster the results in the raster plot making them more visible. The different
113
+ terms in each model are binary encoded if the effect is present or not. On this representation,
114
+ a kmeans clustering is run. Such a scenario
115
+ is investigated in `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157 >`_.
116
+
117
+ See :py:class: `SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor> ` for information
118
+ on the parameters.
119
+
120
+
121
+
122
+ Entropy calculations
123
+ ^^^^^^^^^^^^^^^^^^^^
124
+
125
+ The entropy is the most effective addition of the algorithm to perform automated model
126
+ selection. The entropy is computed as
127
+
128
+ .. math ::
129
+
130
+ e = f_{o} * log_2 (f_{o} / f_{t}) + (1 - f_{o}) * log_2 ((1 - f_{o}) / (1 - f_{t}))
131
+
132
+ where :math: `f_{o}` is the observed frequency of the submodel in the simulation phase, and
133
+ :math: `f_{t}` is the theoretical frequency this submodel would occur when randomly sampling
134
+ hereditary models.
135
+
136
+ In `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157 >`_,
137
+ the authors performed some simulations on screening designs for different model selection algorithms.
138
+ The oracle method requires prior knowledge about the true model. Each term is tested for significance.
139
+ The AICc method comes from Akaike's Information Criterion (corrected). The authors noted that
140
+ a search through the hereditary models was performed, from which the best according to the AICc was
141
+ selected. This, together with the Bayes Information Criterion (BIC) is commonly applied in practice.
142
+ The last method is the new SAMS method with entropy selection.
143
+
144
+ .. list-table :: Part of the simulations results from Wolters and Bingham (2012)
145
+ :align: center
146
+ :widths: 1 1 1 1 1
147
+
148
+ * - Method
149
+ - Correct
150
+ - Underfitted
151
+ - Overfitted
152
+ - (Partialy) Wrong
153
+ * - Oracle
154
+ - 62.8
155
+ - 37.2
156
+ - 0
157
+ - 0
158
+ * - AICc
159
+ - 7.2
160
+ - 0.7
161
+ - 53.8
162
+ - 38.3
163
+ * - SAMS
164
+ - 43.3
165
+ - 16.2
166
+ - 15.8
167
+ - 24.7
168
+
169
+ The SAMS method with entropy significantly outperforms any other method with 43.3% of models
170
+ found to be correct. In addition, the oracle method, which has prior knowledge about the true
171
+ model, also only found 62.8% of the models. AICc only found about 7.2% of the models making it
172
+ not very suitable for this kind of scenario.
173
+
174
+ .. _samplers_sams :
175
+
176
+ There is one downside to the entropy criterion. Only in the specific case where the model
177
+ is a (partial) response surface model with weak heredity can :math: `f_{t}`
178
+ be computed exactly. To make sure the algorithm is generic enough, a fallback was implemented
179
+ to compute an approximation of the entropy using a model sampler. Three different
180
+ samplers are implemented:
181
+ :py:func: `sample_model_dep_onebyone <pyoptex.utils.model.sample_model_dep_onebyone> `,
182
+ :py:func: `sample_model_dep_mcmc <pyoptex.analysis.estimators.models.model.sample_model_dep_mcmc> `
183
+ and :py:func: `sample_model_dep_random <pyoptex.utils.model.sample_model_dep_random> `.
184
+
185
+ For each of these samplers, we ran similar simulations to
186
+ `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157 >`_.
187
+ We start from a PB12 design (Plackett-Burman). Next, we generate a random hereditary model
188
+ by sampling 1 to 4 main effects, :math: `n_{main}`, and sequentially sampling :math: `4 - n_{main}`
189
+ interaction effects. Note that this is a weak heredity submodel of a partial response surface
190
+ design where each factor has linear effects and two-factor interactions.
191
+
192
+ The results are
193
+
194
+ .. list-table :: Simulations of different entropy approximations
195
+ :align: center
196
+ :widths: 1 1 1 1 1
197
+
198
+ * - Method
199
+ - Correct
200
+ - Underfitted
201
+ - Overfitted
202
+ - (Partialy) Wrong
203
+ * - Exact entropy
204
+ - 43.7
205
+ - 30.3
206
+ - 10.5
207
+ - 15.5
208
+ * - One-by-one
209
+ - 37.3
210
+ - 12.6
211
+ - 23.8
212
+ - 26.3
213
+ * - Markov-chain Monte carlo (mcmc)
214
+ - 38.8
215
+ - 12.3
216
+ - 23.3
217
+ - 25.6
218
+ * - Random
219
+ - 36.8
220
+ - 10.1
221
+ - 26.1
222
+ - 27.0
223
+
224
+ The first row is the exact entropy method as used in
225
+ `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157 >`_.
226
+ Note that all three samplers, even though they perform worse than the exact entropy based on the percentage
227
+ of correct models, still perform significantly better than AICc. When loosing the classification by
228
+ also classiying models underfitted or overfitted by one term as correct, the exact entropy method
229
+ has 61.1% accuracy, the one-by-one has 59.1%, the mcmc has 59.3%, and the random has 56.5%.
230
+
231
+ By default, the one-by-one
232
+ sampler is used as it performs almost equally as good as the mcmc method, but computes faster.
233
+
234
+ .. _warning_sams :
235
+
236
+ .. warning ::
237
+ The implementation of SAMS uses the samplers by default, however, the exact method
238
+ may be used by specifying the `entropy_model_order ` parameter in
239
+ :py:class: `SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor> `.
240
+ However, a large warning should be given to this parameter as it comes with certain
241
+ assertions (which are covered in many, but not all scenarios).
242
+
243
+ First, the heredity mode must be 'weak', otherwise the sampling method is still
244
+ applied. Second, the model must be generated using
245
+ :py:func: `partial_rsm_names <pyoptex.utils.model.partial_rsm_names> ` followed by
246
+ :py:func: `model2Y2X <pyopytex.utils.model.model2Y2X> `. Third, the factors must
247
+ be ordered: first all factors which can have a quadratic effect, second
248
+ all factors which can not have quadratic effects, but can have two-factor interactions,
249
+ and third all factors which can only have a main effect. Finally, the dependency
250
+ matrix must be generated using
251
+ :py:func: `order_dependencies <pyoptex.utils.model.order_dependencies> `.
252
+
253
+ As an example. Create three factors
254
+
255
+ >>> factors = [
256
+ >>> Factor(' A' ), Factor(' B' ), Factor(' C' )
257
+ >>> ]
258
+
259
+ Next, create the model orders. The order of the factor names in the dictionary
260
+ **must ** be the same as those in the list of factors. They also must be
261
+ ordered `quad ` - `tfi ` - `lin `.
262
+
263
+ >>> entropy_model_order = {' A' : ' quad' , ' B' : ' tfi' , ' C' : ' lin' }
264
+
265
+ Create the model using :py:func: `partial_rsm_names <pyoptex.utils.model.partial_rsm_names> `.
266
+ Note that the `quad ` elements are first, then the `tfi `, and finally the `lin ` elements.
267
+ The dictionary parameters **must ** be in the same order as the factors.
268
+
269
+ >>> model = partial_rsm_names(entropy_model_order)
270
+ >>> Y2X = model2Y2X(model, factors)
271
+
272
+ Next, create the dependencies from the model
273
+
274
+ >>> dep = order_dependencies(model, factors)
275
+
276
+ Finally, we can fit SAMS using the exact entropy formula
277
+
278
+ >>> regr = SamsRegressor(
279
+ >>> factors, Y2X ,
280
+ >>> mode= ' weak' , dependencies= dep,
281
+ >>> forced_model= np.array([0 ], np.int\_),
282
+ >>> entropy_model_order= entropy_model_order)
283
+ >>> )
284
+
285
+
286
+
287
+
0 commit comments