Generalized Variable Importance Metric: An approach to identify important predictors from machine learning models

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Interpreting black box machine learning methods posses a significant challenge, with existing approaches often being data and model specific. In this thesis, a “Generalized Variable Importance Metric (GVIM)” is defined to measure predictor importance utilizing black box methods without relying on model-based parameters. GVIM, which is defined for a predictor using the true conditional expectation function, assesses the predictor’s impact on a continuous or binary response. A permutation-based approach to estimate GVIM is proposed in this thesis, akin to those by Breiman (2001) and Fisher et al. (2019a). However, black-box models underestimate GVIM when predictors are correlated. Through a bias-variance decomposition, the source of the bias is identified and its pattern in high correlation scenarios is demonstrated, suggesting ways to minimize it. The primary bias stems from black-box models’ limited ability to extrapolate to regions that have low probability because of the correlations. A conditional GVIM method (CGVIM) based on Strobl et al. (2008) is introduced, its bias-variance decomposition is derived, and its relationship with predictor correlations is shown. Both GVIM and CGVIM exhibited a quadratic relationship with the conditional average treatment effect (CATE). Finally, I demonstrated the application of GVIM and CGVIM to investigate risk factors for cognitive decline using data from the Canadian Longitudinal Study on Aging (CLSA) dataset. The proposed method is model-agnostic and offers a causal interpretation, which is crucial for clinical and public health research. Understanding exposure-outcome relationships is vital in health science, where traditional models like regression are preferred for interpretability, but machine learning excels in prediction. GVIM and CGVIM, being model-agnostic, allow researchers to choose their preferred machine learning model without sacrificing prediction or inference capabilities.

Description

Keywords

Causal Inference, Correlation Distortion, Machine Learning, Variable Importance

Citation

ISSN

Related Outputs

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Items in TSpace are protected by copyright, with all rights reserved, unless otherwise indicated.