|
23 | 23 | Example scenarios
|
24 | 24 | =================
|
25 | 25 |
|
| 26 | +.. _a_drop_p_value: |
| 27 | + |
26 | 28 | Dropping based on p-values
|
27 | 29 | --------------------------
|
28 | 30 |
|
@@ -66,6 +68,7 @@ we wish to investigate.
|
66 | 68 | >>> Y2X = model2Y2X(model, factors)
|
67 | 69 |
|
68 | 70 | Then we need to decide on the model constraints. There are three types of models:
|
| 71 | + |
69 | 72 | * No heredity: This means any term can occur in the model, without any restrictions.
|
70 | 73 | * Weak heredity: This means that if a term such as :math:`A \times B` occurs in the model,
|
71 | 74 | either :math:`A`, :math:`B`, or both must also occur in the model. Similar for
|
@@ -267,3 +270,145 @@ indicate the random effect groups with a color.
|
267 | 270 | :width: 100%
|
268 | 271 | :alt: The residual diagnostics
|
269 | 272 | :align: center
|
| 273 | + |
| 274 | + |
| 275 | +Simulated Annealing Model Selection (SAMS) |
| 276 | +------------------------------------------ |
| 277 | + |
| 278 | +While model selection is often performed based on p-values or metrics such |
| 279 | +as the AICc or BIC, SAMS improves on most of them. For more extensive |
| 280 | +information on the algorithm, see :ref:`a_cust_sams`. |
| 281 | + |
| 282 | +In this example, we have six factors, A through F, and wish to detect |
| 283 | +the weak heredity model :math:`A + C + A \times B`. The full Python |
| 284 | +script is at |
| 285 | +|link-qc-pre|\ |version|\ |link-qc-mid0-sams|\ sams_generic.py\ |link-qc-mid1|\ sams_generic.py\ |link-qc-post|. |
| 286 | + |
| 287 | +First, the imports |
| 288 | + |
| 289 | +>>> import numpy as np |
| 290 | +>>> import pandas as pd |
| 291 | +>>> |
| 292 | +>>> from pyoptex.utils import Factor |
| 293 | +>>> from pyoptex.utils.model import model2Y2X, order_dependencies, partial_rsm_names |
| 294 | +>>> from pyoptex.analysis import SamsRegressor |
| 295 | +>>> from pyoptex.analysis.utils.plot import plot_res_diagnostics |
| 296 | + |
| 297 | +Next, we define the factors and simulate some data. |
| 298 | + |
| 299 | +>>> # Define the factors |
| 300 | +>>> factors = [ |
| 301 | +>>> Factor('A'), Factor('B'), Factor('C'), |
| 302 | +>>> Factor('D'), Factor('E'), Factor('F'), |
| 303 | +>>> ] |
| 304 | +>>> |
| 305 | +>>> # The number of random observations |
| 306 | +>>> N = 200 |
| 307 | +>>> |
| 308 | +>>> # Define the data |
| 309 | +>>> data = pd.DataFrame(np.random.rand(N, len(factors)) * 2 - 1, columns=[str(f.name) for f in factors]) |
| 310 | +>>> data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\ |
| 311 | +>>> + np.random.normal(0, 1, N) |
| 312 | + |
| 313 | +Then, as in any analysis, we define the Y2X function, which is a full |
| 314 | +response surface model, and the corresponding heredity dependencies. |
| 315 | + |
| 316 | +>>> # Create the model |
| 317 | +>>> model_order = {str(f.name): 'quad' for f in factors} |
| 318 | +>>> model = partial_rsm_names(model_order) |
| 319 | +>>> Y2X = model2Y2X(model, factors) |
| 320 | +>>> |
| 321 | +>>> # Define the dependencies |
| 322 | +>>> dependencies = order_dependencies(model, factors) |
| 323 | + |
| 324 | +Finally, we fit the SAMS model |
| 325 | + |
| 326 | +>>> regr = SamsRegressor( |
| 327 | +>>> factors, Y2X, |
| 328 | +>>> dependencies=dependencies, mode='weak', |
| 329 | +>>> forced_model=np.array([0], np.int_), |
| 330 | +>>> model_size=6, nb_models=5000, skipn=1000, |
| 331 | +>>> ) |
| 332 | +>>> regr.fit(data.drop(columns='Y'), data['Y']) |
| 333 | + |
| 334 | +.. note:: |
| 335 | + By specifying the model_order parameter in the SamsRegressor, |
| 336 | + we can use the exact entropy caluclations. For more information, |
| 337 | + see :ref:`a_cust_sams_entropy`. The full Python script is at |
| 338 | + |link-qc-pre|\ |version|\ |link-qc-mid0-sams|\ sams_partial_rsm.py\ |link-qc-mid1|\ sams_partial_rsm.py\ |link-qc-post|. |
| 339 | + |
| 340 | +Finally, we can analyze the generated models. To manually extract a model, use |
| 341 | +the :py:func:`plot_selection <pyoptex.analysis.estimators.sams.estimator.SamsRegressor.plot_selection>` |
| 342 | + |
| 343 | +>>> regr.plot_selection().show() |
| 344 | + |
| 345 | +.. figure:: /assets/img/raster_plot.png |
| 346 | + :width: 100% |
| 347 | + :alt: The raster plot of the SAMS algorithm. |
| 348 | + :align: center |
| 349 | + |
| 350 | +:py:class:`SamsRegressor <pyoptex.analysis.estimators.sams.estimator.SamsRegressor>` |
| 351 | +is a :py:class:`MultiRegressionMixin <pyoptex.analysis.mixins.fit_mixin.MultiRegressionMixin>`, |
| 352 | +meaning it finds multiple good-fitting models and orders them. By default, the best |
| 353 | +can be analyzed as before |
| 354 | + |
| 355 | +>>> # Print the summary |
| 356 | +>>> print(regr.summary()) |
| 357 | + OLS Regression Results |
| 358 | +============================================================================== |
| 359 | +Dep. Variable: y R-squared: 0.858 |
| 360 | +Model: OLS Adj. R-squared: 0.856 |
| 361 | +Method: Least Squares F-statistic: 394.5 |
| 362 | +Date: Tue, 07 Jan 2025 Prob (F-statistic): 9.15e-83 |
| 363 | +Time: 15:23:33 Log-Likelihood: -88.642 |
| 364 | +No. Observations: 200 AIC: 185.3 |
| 365 | +Df Residuals: 196 BIC: 198.5 |
| 366 | +Df Model: 3 |
| 367 | +Covariance Type: nonrobust |
| 368 | +============================================================================== |
| 369 | + coef std err t P>|t| [0.025 0.975] |
| 370 | +------------------------------------------------------------------------------ |
| 371 | +const -0.0043 0.027 -0.159 0.874 -0.058 0.049 |
| 372 | +x1 0.8045 0.048 16.689 0.000 0.709 0.900 |
| 373 | +x2 1.1409 0.045 25.356 0.000 1.052 1.230 |
| 374 | +x3 -1.7373 0.084 -20.769 0.000 -1.902 -1.572 |
| 375 | +============================================================================== |
| 376 | +Omnibus: 1.979 Durbin-Watson: 2.166 |
| 377 | +Prob(Omnibus): 0.372 Jarque-Bera (JB): 1.934 |
| 378 | +Skew: -0.238 Prob(JB): 0.380 |
| 379 | +Kurtosis: 2.932 Cond. No. 3.17 |
| 380 | +============================================================================== |
| 381 | + |
| 382 | +Or |
| 383 | + |
| 384 | +>>> # Print the formula in encoded form |
| 385 | +>>> print(regr.model_formula(model=model)) |
| 386 | +-0.004 * cst + 0.805 * A + 1.141 * C + -1.737 * A * B |
| 387 | + |
| 388 | +Prediction is still the same. |
| 389 | + |
| 390 | +>>> data['pred'] = regr.predict(data.drop(columns='Y')) |
| 391 | + |
| 392 | +And the residual plot of the highest entropy model can be found using |
| 393 | + |
| 394 | +>>> plot_res_diagnostics( |
| 395 | +>>> data, y_true='Y', y_pred='pred', |
| 396 | +>>> textcols=[str(f.name) for f in factors], |
| 397 | +>>> ).show() |
| 398 | + |
| 399 | +.. note:: |
| 400 | + If the best model is not the desired model, you can extract any other model |
| 401 | + in the list by accessing :py:attr:`models\_ <pyoptex.analysis.estimators.sams.estimator.SamsRegressor.models\_>` |
| 402 | + after fitting. These are ordered by highest entropy first. |
| 403 | + |
| 404 | + Once you selected a model, you can refit it, similar to the |
| 405 | + :ref:`p-value example <a_drop_p_value>`. Instead of simply predicting based on the `regr`, we can transform |
| 406 | + the result to a strong model |
| 407 | + |
| 408 | + >>> terms_strong = model2strong(regr.terms_, dependencies) |
| 409 | + >>> model = model.iloc[terms_strong] |
| 410 | + >>> Y2X = model2Y2X(model, factors) |
| 411 | + |
| 412 | + And fit a simple model |
| 413 | + |
| 414 | + >>> regr_simple = SimpleRegressor(factors, Y2X).fit(data.drop(columns='Y'), data['Y']) |
0 commit comments