Variables selection in statistical models

variable selection

generalized linear model

Author

Jong-Hoon Kim

Published

May 18, 2024

When building statistical models, one of the most critical steps is variable selection. This process involves choosing the appropriate predictors or features that will be included in your model. The goal is to create a model that is both accurate and interpretable, avoiding overfitting and underfitting. I’ll explore what I am learning regarding issues related to variable selection, the common methods used, and best practices to ensure robust model performance.

The following contents are mainly based on the work by Sauerbrei et al. (1), which I am learning. As such, the following contents should not be used as a definitive guide to variable selection.

Types of Models

Variable selection should start by defining what kind of model one is developing. The work by Shmueli (2) discusses that there may be three kinds of statistical models: descriptive, explanatory, and predictive models. The first sentence of its abstract reads, “Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description.”

Descriptive models summarize patterns in the data without causal claims
Explanatory models aim to test causal hypotheses
Predictive models aim to forecast new observations accurately

The type of model determines which selection criteria to use. Akaike’s information criterion (AIC) is in principle preferable for predictive models, whereas Schwarz’s Bayesian information criterion (BIC) is more appropriate for descriptive models.

Selection Strategies

Once candidate variables are listed (using prior knowledge and, ideally, a directed acyclic graph), common algorithmic strategies include:

Forward selection (FS): starts with no variables and adds one at a time based on model performance
Backward elimination (BE): starts with all variables and removes the least significant at each step; preferred because it starts with a plausible full model
Stepwise: combines FS with additional BE
Best subset selection: exhaustive search over all subsets

Additional approaches covered by Sauerbrei et al. (1) include:

Change-in-estimate criterion
Lasso — shrinks coefficients and performs selection simultaneously
Elastic net — combines L1 (Lasso) and L2 (Ridge) penalties
Boosting — component-wise boosting is a forward stagewise regression procedure applicable to generalised linear models
Resampling-based methods: bootstrap inclusion frequencies (BIF) and stability selection

Post-Selection Inference

Since the main aim of descriptive models is the interpretation of estimated regression coefficients, point estimates should be accompanied by confidence intervals and sometimes also by \(p\) values. Post-selection inference is challenging because the same data used for selection are also used for estimation, which can bias confidence intervals.

Generalised additive models (GAMs) offer a flexible extension; see Sauerbrei et al. (1) for a state-of-the-art overview.

Best Practices

Understand your data: Perform exploratory data analysis (EDA) before selection
Use domain knowledge: Leverage subject-matter expertise to guide candidate variables
Use prior information and DAGs: Drawing a directed acyclic graph (DAG) helps identify confounders and mediators
Cross-validation: Always validate model performance on held-out data
Combine methods: Filter methods can narrow the field; wrapper or embedded methods can fine-tune
Re-evaluate regularly: As new data arrive or the problem changes, revisit the selection

References

Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, et al. State of the art in selection of variables and functional forms in multivariable analysis–outstanding issues. Diagnostic and Prognostic Research. 2020;4:3. doi:10.1186/s41512-020-00074-3

Shmueli G. To explain or to predict? Statistical Science. 2010;25(3):289–310. doi:10.1214/10-STS330

Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MHH, et al. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in Ecology and Evolution. 2009;24(3):127–35. doi:10.1016/j.tree.2008.10.008

Types of Models

Selection Strategies

Filter, Wrapper, and Embedded Methods

Post-Selection Inference

Best Practices

References