Skip to content

Commit b3360a8

Browse files
author
Mathias Born
committed
Added SAMS examples + completed docs on SAMS
1 parent 22f0cb7 commit b3360a8

File tree

7 files changed

+290
-14
lines changed

7 files changed

+290
-14
lines changed

docs/source/_docs/analysis/customization.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ submodels occur most often.
117117
See :py:class:`SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor>` for information
118118
on the parameters.
119119

120-
120+
.. _a_cust_sams_entropy:
121121

122122
Entropy calculations
123123
^^^^^^^^^^^^^^^^^^^^

docs/source/_docs/analysis/example_scenarios.rst

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@
2323
Example scenarios
2424
=================
2525

26+
.. _a_drop_p_value:
27+
2628
Dropping based on p-values
2729
--------------------------
2830

@@ -66,6 +68,7 @@ we wish to investigate.
6668
>>> Y2X = model2Y2X(model, factors)
6769

6870
Then we need to decide on the model constraints. There are three types of models:
71+
6972
* No heredity: This means any term can occur in the model, without any restrictions.
7073
* Weak heredity: This means that if a term such as :math:`A \times B` occurs in the model,
7174
either :math:`A`, :math:`B`, or both must also occur in the model. Similar for
@@ -267,3 +270,145 @@ indicate the random effect groups with a color.
267270
:width: 100%
268271
:alt: The residual diagnostics
269272
:align: center
273+
274+
275+
Simulated Annealing Model Selection (SAMS)
276+
------------------------------------------
277+
278+
While model selection is often performed based on p-values or metrics such
279+
as the AICc or BIC, SAMS improves on most of them. For more extensive
280+
information on the algorithm, see :ref:`a_cust_sams`.
281+
282+
In this example, we have six factors, A through F, and wish to detect
283+
the weak heredity model :math:`A + C + A \times B`. The full Python
284+
script is at
285+
|link-qc-pre|\ |version|\ |link-qc-mid0-sams|\ sams_generic.py\ |link-qc-mid1|\ sams_generic.py\ |link-qc-post|.
286+
287+
First, the imports
288+
289+
>>> import numpy as np
290+
>>> import pandas as pd
291+
>>>
292+
>>> from pyoptex.utils import Factor
293+
>>> from pyoptex.utils.model import model2Y2X, order_dependencies, partial_rsm_names
294+
>>> from pyoptex.analysis import SamsRegressor
295+
>>> from pyoptex.analysis.utils.plot import plot_res_diagnostics
296+
297+
Next, we define the factors and simulate some data.
298+
299+
>>> # Define the factors
300+
>>> factors = [
301+
>>> Factor('A'), Factor('B'), Factor('C'),
302+
>>> Factor('D'), Factor('E'), Factor('F'),
303+
>>> ]
304+
>>>
305+
>>> # The number of random observations
306+
>>> N = 200
307+
>>>
308+
>>> # Define the data
309+
>>> data = pd.DataFrame(np.random.rand(N, len(factors)) * 2 - 1, columns=[str(f.name) for f in factors])
310+
>>> data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
311+
>>> + np.random.normal(0, 1, N)
312+
313+
Then, as in any analysis, we define the Y2X function, which is a full
314+
response surface model, and the corresponding heredity dependencies.
315+
316+
>>> # Create the model
317+
>>> model_order = {str(f.name): 'quad' for f in factors}
318+
>>> model = partial_rsm_names(model_order)
319+
>>> Y2X = model2Y2X(model, factors)
320+
>>>
321+
>>> # Define the dependencies
322+
>>> dependencies = order_dependencies(model, factors)
323+
324+
Finally, we fit the SAMS model
325+
326+
>>> regr = SamsRegressor(
327+
>>> factors, Y2X,
328+
>>> dependencies=dependencies, mode='weak',
329+
>>> forced_model=np.array([0], np.int_),
330+
>>> model_size=6, nb_models=5000, skipn=1000,
331+
>>> )
332+
>>> regr.fit(data.drop(columns='Y'), data['Y'])
333+
334+
.. note::
335+
By specifying the model_order parameter in the SamsRegressor,
336+
we can use the exact entropy caluclations. For more information,
337+
see :ref:`a_cust_sams_entropy`. The full Python script is at
338+
|link-qc-pre|\ |version|\ |link-qc-mid0-sams|\ sams_partial_rsm.py\ |link-qc-mid1|\ sams_partial_rsm.py\ |link-qc-post|.
339+
340+
Finally, we can analyze the generated models. To manually extract a model, use
341+
the :py:func:`plot_selection <pyoptex.analysis.estimators.sams.estimator.SamsRegressor.plot_selection>`
342+
343+
>>> regr.plot_selection().show()
344+
345+
.. figure:: /assets/img/raster_plot.png
346+
:width: 100%
347+
:alt: The raster plot of the SAMS algorithm.
348+
:align: center
349+
350+
:py:class:`SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor>`
351+
is a :py:class:`MultiRegressionMixin <pyoptex.analysis.mixins.fit_mixin.MultiRegressionMixin>`,
352+
meaning it finds multiple good-fitting models and orders them. By default, the best
353+
can be analyzed as before
354+
355+
>>> # Print the summary
356+
>>> print(regr.summary())
357+
OLS Regression Results
358+
==============================================================================
359+
Dep. Variable: y R-squared: 0.858
360+
Model: OLS Adj. R-squared: 0.856
361+
Method: Least Squares F-statistic: 394.5
362+
Date: Tue, 07 Jan 2025 Prob (F-statistic): 9.15e-83
363+
Time: 15:23:33 Log-Likelihood: -88.642
364+
No. Observations: 200 AIC: 185.3
365+
Df Residuals: 196 BIC: 198.5
366+
Df Model: 3
367+
Covariance Type: nonrobust
368+
==============================================================================
369+
coef std err t P>|t| [0.025 0.975]
370+
------------------------------------------------------------------------------
371+
const -0.0043 0.027 -0.159 0.874 -0.058 0.049
372+
x1 0.8045 0.048 16.689 0.000 0.709 0.900
373+
x2 1.1409 0.045 25.356 0.000 1.052 1.230
374+
x3 -1.7373 0.084 -20.769 0.000 -1.902 -1.572
375+
==============================================================================
376+
Omnibus: 1.979 Durbin-Watson: 2.166
377+
Prob(Omnibus): 0.372 Jarque-Bera (JB): 1.934
378+
Skew: -0.238 Prob(JB): 0.380
379+
Kurtosis: 2.932 Cond. No. 3.17
380+
==============================================================================
381+
382+
Or
383+
384+
>>> # Print the formula in encoded form
385+
>>> print(regr.model_formula(model=model))
386+
-0.004 * cst + 0.805 * A + 1.141 * C + -1.737 * A * B
387+
388+
Prediction is still the same.
389+
390+
>>> data['pred'] = regr.predict(data.drop(columns='Y'))
391+
392+
And the residual plot of the highest entropy model can be found using
393+
394+
>>> plot_res_diagnostics(
395+
>>> data, y_true='Y', y_pred='pred',
396+
>>> textcols=[str(f.name) for f in factors],
397+
>>> ).show()
398+
399+
.. note::
400+
If the best model is not the desired model, you can extract any other model
401+
in the list by accessing :py:attr:`models\_ <pyoptex.analysis.estimators.sams.estimator.SamsRegressor.models\_>`
402+
after fitting. These are ordered by highest entropy first.
403+
404+
Once you selected a model, you can refit it, similar to the
405+
:ref:`p-value example <a_drop_p_value>`. Instead of simply predicting based on the `regr`, we can transform
406+
the result to a strong model
407+
408+
>>> terms_strong = model2strong(regr.terms_, dependencies)
409+
>>> model = model.iloc[terms_strong]
410+
>>> Y2X = model2Y2X(model, factors)
411+
412+
And fit a simple model
413+
414+
>>> regr_simple = SimpleRegressor(factors, Y2X).fit(data.drop(columns='Y'), data['Y'])
78.1 KB
Loading
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
2+
#!/usr/bin/env python3
3+
4+
import numpy as np
5+
import pandas as pd
6+
7+
from pyoptex._seed import set_seed
8+
from pyoptex.utils import Factor
9+
from pyoptex.utils.model import model2Y2X, order_dependencies, partial_rsm_names
10+
from pyoptex.analysis import SamsRegressor
11+
from pyoptex.analysis.utils.plot import plot_res_diagnostics
12+
13+
# Seed randomization
14+
set_seed(42)
15+
16+
# Define the factors
17+
factors = [
18+
Factor('A'), Factor('B'), Factor('C'),
19+
Factor('D'), Factor('E'), Factor('F'),
20+
]
21+
22+
# The number of random observations
23+
N = 200
24+
25+
# Define the data
26+
data = pd.DataFrame(np.random.rand(N, len(factors)) * 2 - 1, columns=[str(f.name) for f in factors])
27+
data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
28+
+ np.random.normal(0, 1, N)
29+
30+
# Define the model orders
31+
model_order = {str(f.name): 'quad' for f in factors}
32+
33+
# Create the model
34+
model = partial_rsm_names(model_order)
35+
Y2X = model2Y2X(model, factors)
36+
37+
# Define the dependencies
38+
dependencies = order_dependencies(model, factors)
39+
40+
# Create the regressor
41+
regr = SamsRegressor(
42+
factors, Y2X,
43+
dependencies=dependencies, mode='weak',
44+
forced_model=np.array([0], np.int_),
45+
model_size=6, nb_models=5000, skipn=1000,
46+
)
47+
regr.fit(data.drop(columns='Y'), data['Y'])
48+
49+
# Plot the raster plot
50+
regr.plot_selection().show()
51+
52+
# Print the summary
53+
print(regr.summary())
54+
55+
# Print the formula in encoded form
56+
print(regr.model_formula(model=model))
57+
58+
# Predict
59+
data['pred'] = regr.predict(data.drop(columns='Y'))
60+
61+
# Plot the residual diagnostics
62+
plot_res_diagnostics(
63+
data, y_true='Y', y_pred='pred',
64+
textcols=[str(f.name) for f in factors],
65+
).show()
66+
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
2+
#!/usr/bin/env python3
3+
4+
import numpy as np
5+
import pandas as pd
6+
7+
from pyoptex._seed import set_seed
8+
from pyoptex.utils import Factor
9+
from pyoptex.utils.model import model2Y2X, order_dependencies, partial_rsm_names
10+
from pyoptex.analysis import SamsRegressor
11+
from pyoptex.analysis.utils.plot import plot_res_diagnostics
12+
13+
# Seed randomization
14+
set_seed(42)
15+
16+
# Define the factors
17+
factors = [
18+
Factor('A'), Factor('B'), Factor('C'),
19+
Factor('D'), Factor('E'), Factor('F'),
20+
]
21+
22+
# The number of random observations
23+
N = 200
24+
25+
# Define the data
26+
data = pd.DataFrame(np.random.rand(N, len(factors)) * 2 - 1, columns=[str(f.name) for f in factors])
27+
data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
28+
+ np.random.normal(0, 1, N)
29+
30+
# Define the model orders
31+
model_order = {str(f.name): 'quad' for f in factors}
32+
33+
# Create the model
34+
model = partial_rsm_names(model_order)
35+
Y2X = model2Y2X(model, factors)
36+
37+
# Define the dependencies
38+
dependencies = order_dependencies(model, factors)
39+
40+
# Create the regressor
41+
regr = SamsRegressor(
42+
factors, Y2X,
43+
dependencies=dependencies, mode='weak',
44+
forced_model=np.array([0], np.int_),
45+
model_size=6, nb_models=5000, skipn=1000,
46+
entropy_model_order=model_order
47+
)
48+
regr.fit(data.drop(columns='Y'), data['Y'])
49+
50+
# Plot the raster plot
51+
regr.plot_selection().show()
52+
53+
# Print the summary
54+
print(regr.summary())
55+
56+
# Print the formula in encoded form
57+
print(regr.model_formula(model=model))
58+
59+
# Predict
60+
data['pred'] = regr.predict(data.drop(columns='Y'))
61+
62+
# Plot the residual diagnostics
63+
plot_res_diagnostics(
64+
data, y_true='Y', y_pred='pred',
65+
textcols=[str(f.name) for f in factors],
66+
).show()
67+

src/pyoptex/analysis/estimators/sams/estimator.py

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@ class SamsRegressor(MultiRegressionMixin):
3737
.. note::
3838
A more detailed guide on SAMS can be found at :ref:`a_cust_sams`.
3939
40-
TODO: entropy calculations + examples + index
41-
4240
Attributes
4341
----------
4442
dependencies : np.array(2d)
@@ -82,7 +80,7 @@ class SamsRegressor(MultiRegressionMixin):
8280
calculations.
8381
nterms_bnb : None or int or iterable(int)
8482
The fixed sizes of submodels to apply the branch-and-bound algorithm
85-
on. If None, every size from one to the `model_size` - 2 is tested
83+
on. If None, every size from one to the `model_size` - 2 (inclusive) is tested
8684
as recommended by the original paper. If an int, every size from
8785
one until the specified number is tested. If an iterable, only the
8886
values from the iterable are tested.
@@ -95,13 +93,13 @@ class SamsRegressor(MultiRegressionMixin):
9593
to three minutes.
9694
entropy_sampler : func(dep, model_size, N, forced, mode)
9795
The sampler to use when generating random hereditary models. See the
98-
:ref:`samplers_sams` for an indication on which sampler to use.
96+
documentation on customizing SAMS for an indication on which sampler to use.
9997
entropy_sampling_N : int
10098
The number of random samples to draw using the sampler to compute
10199
the theoretical frequencies of the submodels.
102100
entropy_model_order : dict(str: ('lin' or 'tfi' or 'quad'))
103101
The order of the terms in the model. Please read the warning in
104-
:ref:`warning_sams`.
102+
the documentation on customizing SAMS.
105103
tqdm : bool
106104
Whether to use tqdm to track the progress
107105
@@ -193,7 +191,7 @@ def __init__(self, factors=(), Y2X=identityY2X, random_effects=(),
193191
calculations.
194192
nterms_bnb : None or int or iterable(int)
195193
The fixed sizes of submodels to apply the branch-and-bound algorithm
196-
on. If None, every size from one to the `model_size` - 2 is tested
194+
on. If None, every size from one to the `model_size` - 2 (inclusive) is tested
197195
as recommended by the original paper. If an int, every size from
198196
one until the specified number is tested. If an iterable, only the
199197
values from the iterable are tested.
@@ -206,13 +204,13 @@ def __init__(self, factors=(), Y2X=identityY2X, random_effects=(),
206204
to three minutes.
207205
entropy_sampler : func(dep, model_size, N, forced, mode)
208206
The sampler to use when generating random hereditary models. See the
209-
:ref:`samplers_sams` for an indication on which sampler to use.
207+
documentation on customizing SAMS for an indication on which sampler to use.
210208
entropy_sampling_N : int
211209
The number of random samples to draw using the sampler to compute
212210
the theoretical frequencies of the submodels.
213211
entropy_model_order : dict(str: ('lin' or 'tfi' or 'quad'))
214212
The order of the terms in the model. Please read the warning in
215-
:ref:`warning_sams`.
213+
the documentation on customizing SAMS.
216214
tqdm : bool
217215
Whether to use tqdm to track the progress
218216
"""
@@ -252,7 +250,7 @@ def _regr_params(self, X, y):
252250
self._model_size = self.model_size if self.model_size is not None else int(len(X) / 3)
253251

254252
# Set some default values
255-
self._nterms_bnb = self._model_size - 2 if self.nterms_bnb is None else self.nterms_bnb
253+
self._nterms_bnb = self._model_size - 1 if self.nterms_bnb is None else self.nterms_bnb
256254
self._nterms_bnb = range(len(self.forced_model) + 1, self._nterms_bnb) \
257255
if isinstance(self._nterms_bnb, int) else self._nterms_bnb
258256
self._est_ratios = np.ones(len(self._re)) if len(self._re) > 0 and self.est_ratios is None else self.est_ratios
@@ -311,8 +309,8 @@ def _validate_fit(self, X, y):
311309
model_types_ord = {'quad': 2, 'tfi': 1, 'lin': 0}
312310

313311
# Reorder the model types based on the factors
314-
assert all(str(f.name) in self.entropy_model_order.keys() for f in factors), 'All factors must have an entropy model order specified'
315-
entropy_model_order = {str(f.name): self.entropy_model_order[str(f.name)] for f in factors}
312+
assert all(str(f.name) in self.entropy_model_order.keys() for f in self._factors), 'All factors must have an entropy model order specified'
313+
entropy_model_order = {str(f.name): self.entropy_model_order[str(f.name)] for f in self._factors}
316314

317315
# Assert the ordering
318316
assert np.all(np.diff([model_types_ord[typ] for typ in entropy_model_order.values()]) <= 0), 'Model types must be ordered quad > tfi > lin for entropy calculations'

0 commit comments

Comments
 (0)