Skip to content

Commit 22f0cb7

Browse files
author
Mathias Born
committed
Added more analysis documentation + few bugfixes in the code + entropy samplers
1 parent 20a63dc commit 22f0cb7

File tree

6 files changed

+457
-226
lines changed

6 files changed

+457
-226
lines changed

docs/source/_docs/analysis/customization.rst

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,216 @@ only three attributes must be set:
7272
* selection_metrics\_ : The values of the selection metric as a 1-D numpy floating point array. a
7373
higher selection metric indicates a better model.
7474
* metric_name\_ : The name of the selection metric as a string. Used for interpretation.
75+
76+
.. _a_cust_sams:
77+
78+
Simulated annealing model selection (SAMS)
79+
------------------------------------------
80+
81+
Simulated annealing model selection, or SAMS, was devised by
82+
`Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157>`_.
83+
It is a model selection algorithm, which instead of looking at the statistical
84+
significance, like is most commonly used, simulates multiple models and looks at what the
85+
good fitting models have in common. The algorithm works in three stages:
86+
87+
* The simulation stage: here the algorithm simulates many models of a fixed size using simulated annealing,
88+
and sorts them by their
89+
:math:`R^2`. Commonly it simulates 10000 or 20000 models, however it depends on the
90+
problem at hand.
91+
* The reduction stage: here the algorithm takes the simulated models and looks what the
92+
most common 1-factor, 2-factor, 3-factor, etc. combinations are. In other words, it looks
93+
at which submodel of size k occurs most frequently in the good fitting models.
94+
* The selection stage: here the algorithm takes the most occuring submodels of each size and
95+
compares them to determine an ordering. The ordering is based on the entropy which is explained
96+
later.
97+
98+
As you may notice, the algorithm does not output just one model. It outputs multiple models,
99+
ordered by which model is has the most confidence in. The last two stages of the algorithm
100+
use the result of the first stage to automatically determine an ordering, however, the user
101+
may also manually look at a raster plot of the results which looks as follows:
102+
103+
104+
Each row is a model, each column is a potential term in the model, and the color indicates the
105+
coefficient of the term. This means that any term not in the model has a coefficient of zero, which is
106+
plotted in white. By looking at largely colored columns, we can determine which
107+
submodels occur most often.
108+
109+
.. note::
110+
In some events, multiple distinct models may perform equally well. Such a scenario is
111+
difficult to detect in the raster plot, and also by the entropy criterion. Luckily, we
112+
can also cluster the results in the raster plot making them more visible. The different
113+
terms in each model are binary encoded if the effect is present or not. On this representation,
114+
a kmeans clustering is run. Such a scenario
115+
is investigated in `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157>`_.
116+
117+
See :py:class:`SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor>` for information
118+
on the parameters.
119+
120+
121+
122+
Entropy calculations
123+
^^^^^^^^^^^^^^^^^^^^
124+
125+
The entropy is the most effective addition of the algorithm to perform automated model
126+
selection. The entropy is computed as
127+
128+
.. math::
129+
130+
e = f_{o} * log_2(f_{o} / f_{t}) + (1 - f_{o}) * log_2((1 - f_{o}) / (1 - f_{t}))
131+
132+
where :math:`f_{o}` is the observed frequency of the submodel in the simulation phase, and
133+
:math:`f_{t}` is the theoretical frequency this submodel would occur when randomly sampling
134+
hereditary models.
135+
136+
In `Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157>`_,
137+
the authors performed some simulations on screening designs for different model selection algorithms.
138+
The oracle method requires prior knowledge about the true model. Each term is tested for significance.
139+
The AICc method comes from Akaike's Information Criterion (corrected). The authors noted that
140+
a search through the hereditary models was performed, from which the best according to the AICc was
141+
selected. This, together with the Bayes Information Criterion (BIC) is commonly applied in practice.
142+
The last method is the new SAMS method with entropy selection.
143+
144+
.. list-table:: Part of the simulations results from Wolters and Bingham (2012)
145+
:align: center
146+
:widths: 1 1 1 1 1
147+
148+
* - Method
149+
- Correct
150+
- Underfitted
151+
- Overfitted
152+
- (Partialy) Wrong
153+
* - Oracle
154+
- 62.8
155+
- 37.2
156+
- 0
157+
- 0
158+
* - AICc
159+
- 7.2
160+
- 0.7
161+
- 53.8
162+
- 38.3
163+
* - SAMS
164+
- 43.3
165+
- 16.2
166+
- 15.8
167+
- 24.7
168+
169+
The SAMS method with entropy significantly outperforms any other method with 43.3% of models
170+
found to be correct. In addition, the oracle method, which has prior knowledge about the true
171+
model, also only found 62.8% of the models. AICc only found about 7.2% of the models making it
172+
not very suitable for this kind of scenario.
173+
174+
.. _samplers_sams:
175+
176+
There is one downside to the entropy criterion. Only in the specific case where the model
177+
is a (partial) response surface model with weak heredity can :math:`f_{t}`
178+
be computed exactly. To make sure the algorithm is generic enough, a fallback was implemented
179+
to compute an approximation of the entropy using a model sampler. Three different
180+
samplers are implemented:
181+
:py:func:`sample_model_dep_onebyone <pyoptex.utils.model.sample_model_dep_onebyone>`,
182+
:py:func:`sample_model_dep_mcmc <pyoptex.analysis.estimators.models.model.sample_model_dep_mcmc>`
183+
and :py:func:`sample_model_dep_random <pyoptex.utils.model.sample_model_dep_random>`.
184+
185+
For each of these samplers, we ran similar simulations to
186+
`Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157>`_.
187+
We start from a PB12 design (Plackett-Burman). Next, we generate a random hereditary model
188+
by sampling 1 to 4 main effects, :math:`n_{main}`, and sequentially sampling :math:`4 - n_{main}`
189+
interaction effects. Note that this is a weak heredity submodel of a partial response surface
190+
design where each factor has linear effects and two-factor interactions.
191+
192+
The results are
193+
194+
.. list-table:: Simulations of different entropy approximations
195+
:align: center
196+
:widths: 1 1 1 1 1
197+
198+
* - Method
199+
- Correct
200+
- Underfitted
201+
- Overfitted
202+
- (Partialy) Wrong
203+
* - Exact entropy
204+
- 43.7
205+
- 30.3
206+
- 10.5
207+
- 15.5
208+
* - One-by-one
209+
- 37.3
210+
- 12.6
211+
- 23.8
212+
- 26.3
213+
* - Markov-chain Monte carlo (mcmc)
214+
- 38.8
215+
- 12.3
216+
- 23.3
217+
- 25.6
218+
* - Random
219+
- 36.8
220+
- 10.1
221+
- 26.1
222+
- 27.0
223+
224+
The first row is the exact entropy method as used in
225+
`Wolters and Bingham (2012) <https://www.tandfonline.com/doi/abs/10.1198/TECH.2011.08157>`_.
226+
Note that all three samplers, even though they perform worse than the exact entropy based on the percentage
227+
of correct models, still perform significantly better than AICc. When loosing the classification by
228+
also classiying models underfitted or overfitted by one term as correct, the exact entropy method
229+
has 61.1% accuracy, the one-by-one has 59.1%, the mcmc has 59.3%, and the random has 56.5%.
230+
231+
By default, the one-by-one
232+
sampler is used as it performs almost equally as good as the mcmc method, but computes faster.
233+
234+
.. _warning_sams:
235+
236+
.. warning::
237+
The implementation of SAMS uses the samplers by default, however, the exact method
238+
may be used by specifying the `entropy_model_order` parameter in
239+
:py:class:`SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor>`.
240+
However, a large warning should be given to this parameter as it comes with certain
241+
assertions (which are covered in many, but not all scenarios).
242+
243+
First, the heredity mode must be 'weak', otherwise the sampling method is still
244+
applied. Second, the model must be generated using
245+
:py:func:`partial_rsm_names <pyoptex.utils.model.partial_rsm_names>` followed by
246+
:py:func:`model2Y2X <pyopytex.utils.model.model2Y2X>`. Third, the factors must
247+
be ordered: first all factors which can have a quadratic effect, second
248+
all factors which can not have quadratic effects, but can have two-factor interactions,
249+
and third all factors which can only have a main effect. Finally, the dependency
250+
matrix must be generated using
251+
:py:func:`order_dependencies <pyoptex.utils.model.order_dependencies>`.
252+
253+
As an example. Create three factors
254+
255+
>>> factors = [
256+
>>> Factor('A'), Factor('B'), Factor('C')
257+
>>> ]
258+
259+
Next, create the model orders. The order of the factor names in the dictionary
260+
**must** be the same as those in the list of factors. They also must be
261+
ordered `quad` - `tfi` - `lin`.
262+
263+
>>> entropy_model_order = {'A': 'quad', 'B': 'tfi', 'C': 'lin'}
264+
265+
Create the model using :py:func:`partial_rsm_names <pyoptex.utils.model.partial_rsm_names>`.
266+
Note that the `quad` elements are first, then the `tfi`, and finally the `lin` elements.
267+
The dictionary parameters **must** be in the same order as the factors.
268+
269+
>>> model = partial_rsm_names(entropy_model_order)
270+
>>> Y2X = model2Y2X(model, factors)
271+
272+
Next, create the dependencies from the model
273+
274+
>>> dep = order_dependencies(model, factors)
275+
276+
Finally, we can fit SAMS using the exact entropy formula
277+
278+
>>> regr = SamsRegressor(
279+
>>> factors, Y2X,
280+
>>> mode='weak', dependencies=dep,
281+
>>> forced_model=np.array([0], np.int\_),
282+
>>> entropy_model_order=entropy_model_order)
283+
>>> )
284+
285+
286+
287+

src/pyoptex/analysis/estimators/sams/entropy.py

Lines changed: 5 additions & 137 deletions
Original file line numberDiff line numberDiff line change
@@ -4,143 +4,11 @@
44

55
import numpy as np
66

7-
from ....utils.model import sample_model_dep
7+
from ....utils.model import sample_model_dep_onebyone
88
from ....utils.numba import numba_int2bool
99

10-
from .models.model import Model
11-
def sample_mcmc(dep, size, forced=None, mode=None, N=1, skip=10):
12-
# Create the SAMS modeller
13-
m = Model(np.zeros((0, len(dep))), np.zeros((0,)), mode=mode, forced=forced, dep=dep)
14-
15-
# Initialize a random model
16-
model = np.zeros((size,), dtype=np.int_)
17-
m.init(model)
18-
19-
# Intialize the samples
20-
samples = np.zeros((N, size), dtype=np.int_)
21-
22-
# Warmup phase
23-
for i in range(1000):
24-
m.mutate(model)
25-
26-
# Main sampling loop
27-
for i in range(N*skip):
28-
# Mutate the model
29-
m.mutate(model)
30-
31-
# Every skip, store the result
32-
if i % skip == 0:
33-
samples[int(i/skip)] = model
34-
35-
return samples
36-
37-
def sample_random(dep, size, forced=None, mode=None, N=1):
38-
assert mode == 'weak', 'Mode must be weak'
39-
40-
#########################
41-
# Initialize number of dependencies
42-
nb_dep = np.ma.masked_where(~dep, np.zeros_like(dep, dtype=np.int_)).harden_mask()
43-
44-
# At the true positions in these columns, set a 1
45-
affected = ~np.any(dep, axis=1)
46-
nb_dep[:, affected] = 1
47-
affected = np.any(dep[:, affected], axis=1)
48-
49-
while np.any(affected):
50-
# Alter the affected positions
51-
nb_dep[:, affected] = np.min(nb_dep[affected], axis=1).compressed() + 1
52-
affected = np.any(dep[:, affected], axis=1)
53-
54-
#########################
55-
56-
# Initialize the models
57-
models = np.zeros((N, size), dtype=np.int_)
58-
models[:, :forced.size] = forced
59-
60-
# Fix the forced model
61-
if forced is not None and forced.size > 0:
62-
# Convert submodel to binary array
63-
affected = model[:forced.size]
64-
submodelb = np.zeros(len(dep), dtype=np.int_)
65-
submodelb[affected] = 1
66-
67-
# Update the model
68-
nb_dep[:, affected] -= 1
69-
affected = np.any(dep[:, affected], axis=1)
70-
while np.any(affected):
71-
# Alter the affected positions
72-
nb_dep[:, affected] = np.min(nb_dep[affected], axis=1) - submodelb[affected] + 1
73-
affected = np.any(dep[:, affected], axis=1)
74-
75-
# Sample all models
76-
for model in models:
77-
# Initialize i
78-
i = forced.size
79-
j = forced.size
80-
nb_dep_ = nb_dep.copy()
81-
82-
# Loop until a full model
83-
while i < size:
84-
85-
# Compute the minimal path for each term
86-
min_path = np.min(nb_dep_, axis=1).filled(0)
87-
88-
# Sample the first
89-
choices = np.ones(len(dep), dtype=np.bool_)
90-
choices[min_path >= size - i] = False # Remove those with too many dependencies
91-
choices[model[:i]] = False # Remove already in the model
92-
choices = np.flatnonzero(choices)
93-
model[i] = np.random.choice(choices)
94-
95-
# TODO: purely random sampling is a problem for true sampling
96-
97-
# Check if already hereditary
98-
if min_path[model[i]] > 0:
99-
# Update with dependencies
100-
choices = np.copy(dep[model[i]])
101-
choices[min_path >= size - i - 1] = False
102-
choices[model[:i+1]] = False
103-
choices = np.flatnonzero(choices)
104-
105-
# Check if there are any choices
106-
while choices.size != 0:
107-
# Sample a new term
108-
i += 1
109-
model[i] = np.random.choice(choices)
110-
111-
# Check for heredity
112-
if min_path[model[i]] <= 0:
113-
break
114-
115-
# Determine new choices
116-
choices = np.copy(dep[model[i]])
117-
choices[min_path >= size - i - 1] = False
118-
choices[model[:i+1]] = False
119-
choices = np.flatnonzero(choices)
120-
121-
# Increase the model size
122-
i += 1
123-
124-
# Convert submodel to binary array
125-
affected = model[j:i]
126-
submodelb = np.zeros(len(dep), dtype=np.int_)
127-
submodelb[affected] = 1
128-
129-
# Update the model
130-
nb_dep_[:, affected] -= 1
131-
affected = np.any(dep[:, affected], axis=1)
132-
while np.any(affected):
133-
# Alter the affected positions
134-
nb_dep_[:, affected] = np.min(nb_dep_[affected], axis=1) - submodelb[affected] + 1
135-
affected = np.any(dep[:, affected], axis=1)
136-
137-
# Set j to i for next iteration
138-
j = i
139-
140-
return models
141-
14210
def entropies_approx(submodels, freqs, model_size, dep, mode,
143-
forced=None, N=10000, eps=1e-6):
11+
forced=None, N=10000, sampler=sample_model_dep_onebyone, eps=1e-6):
14412
"""
14513
Compute the approximate entropy by sampling N random models
14614
and observing the frequency of each submodel.
@@ -177,6 +45,8 @@ def entropies_approx(submodels, freqs, model_size, dep, mode,
17745
N : int
17846
The number of random samples to draw to compute the
17947
theoretical frequency of a submodel.
48+
sampler : func(dep, model_size, N, forced, mode)
49+
The sampler to use when generating random hereditary models.
18050
eps : float
18151
A numerical stability parameter in computing the entropy.
18252
@@ -186,9 +56,7 @@ def entropies_approx(submodels, freqs, model_size, dep, mode,
18656
An array of floats of the same length as the submodels.
18757
"""
18858
# Generate random samples
189-
# samples = sample_model_dep(dep, model_size, N, forced, mode)
190-
# samples = sample_mcmc(dep, model_size, forced, mode, N, skip=10)
191-
samples = sample_random(dep, model_size, forced, mode, N)
59+
samples = sampler(dep, model_size, N, forced, mode)
19260

19361
# Convert samples to a boolean array
19462
samples = numba_int2bool(samples, len(dep))

0 commit comments

Comments
 (0)