Variables selection in statistical models
When building statistical models, one of the most critical steps is variable selection. This process involves choosing the appropriate predictors or features that will be included in your model. The goal is to create a model that is both accurate and interpretable, avoiding overfitting and underfitting. I’ll explore what I am learning regarding issues related to variable selection, the common methods used, and best practices to ensure robust model performance.
The following contents are mainly based on the work by Sauerbrei et al. (1), which I am learning. As such, the following contents should not be used as a definitive guide to variable selection.
Types of Models
Variable selection should start by defining what kind of model one is developing. The work by Shmueli (2) discusses that there may be three kinds of statistical models: descriptive, explanatory, and predictive models. The first sentence of its abstract reads, “Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description.”
- Descriptive models summarize patterns in the data without causal claims
- Explanatory models aim to test causal hypotheses
- Predictive models aim to forecast new observations accurately
The type of model determines which selection criteria to use. Akaike’s information criterion (AIC) is in principle preferable for predictive models, whereas Schwarz’s Bayesian information criterion (BIC) is more appropriate for descriptive models.
Selection Strategies
Once candidate variables are listed (using prior knowledge and, ideally, a directed acyclic graph), common algorithmic strategies include:
- Forward selection (FS): starts with no variables and adds one at a time based on model performance
- Backward elimination (BE): starts with all variables and removes the least significant at each step; preferred because it starts with a plausible full model
- Stepwise: combines FS with additional BE
- Best subset selection: exhaustive search over all subsets
Additional approaches covered by Sauerbrei et al. (1) include:
- Change-in-estimate criterion
- Lasso — shrinks coefficients and performs selection simultaneously
- Elastic net — combines L1 (Lasso) and L2 (Ridge) penalties
- Boosting — component-wise boosting is a forward stagewise regression procedure applicable to generalised linear models
- Resampling-based methods: bootstrap inclusion frequencies (BIF) and stability selection
Filter, Wrapper, and Embedded Methods
A broader taxonomy divides methods into three categories:
Filter methods apply statistical measures to score each feature independently of the model: - Correlation coefficient (linear relationships) - Chi-square test (categorical variables) - ANOVA F-value (continuous variables)
Wrapper methods evaluate feature subsets based on actual model performance, including forward selection, backward elimination, and recursive feature elimination (RFE).
Embedded methods perform selection during model training: - Lasso (L1): penalty equal to the absolute value of coefficients; can shrink some to zero - Ridge (L2): penalty equal to the square of coefficients; reduces impact of less important variables without full selection - Tree-based methods (Random Forests, Gradient Boosting): inherently rank variables by their contribution to reducing impurity
For generalised linear mixed models, Bolker et al. (3) provide a practical introduction to model selection and inference.
Post-Selection Inference
Since the main aim of descriptive models is the interpretation of estimated regression coefficients, point estimates should be accompanied by confidence intervals and sometimes also by \(p\) values. Post-selection inference is challenging because the same data used for selection are also used for estimation, which can bias confidence intervals.
Generalised additive models (GAMs) offer a flexible extension; see Sauerbrei et al. (1) for a state-of-the-art overview.
Best Practices
- Understand your data: Perform exploratory data analysis (EDA) before selection
- Use domain knowledge: Leverage subject-matter expertise to guide candidate variables
- Use prior information and DAGs: Drawing a directed acyclic graph (DAG) helps identify confounders and mediators
- Cross-validation: Always validate model performance on held-out data
- Combine methods: Filter methods can narrow the field; wrapper or embedded methods can fine-tune
- Re-evaluate regularly: As new data arrive or the problem changes, revisit the selection