Generalized Variable Importance Metric: An approach to identify important predictors from machine learning models

dc.contributor.advisorKustra, Rafal
dc.contributor.authorKhan, Mohammad Kaviul Anam
dc.contributor.departmentDalla Lana School of Public Health
dc.date2025-06
dc.date.accepted2025-06
dc.date.accessioned2025-07-31T15:34:09Z
dc.date.available2025-07-31T15:34:09Z
dc.date.convocation2025-06
dc.date.issued2025-06
dc.description.abstractInterpreting black box machine learning methods posses a significant challenge, with existing approaches often being data and model specific. In this thesis, a “Generalized Variable Importance Metric (GVIM)” is defined to measure predictor importance utilizing black box methods without relying on model-based parameters. GVIM, which is defined for a predictor using the true conditional expectation function, assesses the predictor’s impact on a continuous or binary response. A permutation-based approach to estimate GVIM is proposed in this thesis, akin to those by Breiman (2001) and Fisher et al. (2019a). However, black-box models underestimate GVIM when predictors are correlated. Through a bias-variance decomposition, the source of the bias is identified and its pattern in high correlation scenarios is demonstrated, suggesting ways to minimize it. The primary bias stems from black-box models’ limited ability to extrapolate to regions that have low probability because of the correlations. A conditional GVIM method (CGVIM) based on Strobl et al. (2008) is introduced, its bias-variance decomposition is derived, and its relationship with predictor correlations is shown. Both GVIM and CGVIM exhibited a quadratic relationship with the conditional average treatment effect (CATE). Finally, I demonstrated the application of GVIM and CGVIM to investigate risk factors for cognitive decline using data from the Canadian Longitudinal Study on Aging (CLSA) dataset. The proposed method is model-agnostic and offers a causal interpretation, which is crucial for clinical and public health research. Understanding exposure-outcome relationships is vital in health science, where traditional models like regression are preferred for interpretability, but machine learning excels in prediction. GVIM and CGVIM, being model-agnostic, allow researchers to choose their preferred machine learning model without sacrificing prediction or inference capabilities.
dc.description.degreePh.D.
dc.identifier.urihttps://hdl.handle.net/1807/145099
dc.rightsAttribution 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectCausal Inference
dc.subjectCorrelation Distortion
dc.subjectMachine Learning
dc.subjectVariable Importance
dc.subject.classification0308
dc.titleGeneralized Variable Importance Metric: An approach to identify important predictors from machine learning models
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Khan_Mohammad_Kaviul_Anam_202506_PhD_thesis.pdf
Size:
3.83 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
cc_by.rdf
Size:
10.76 KB
Format:
RDF serialized in XML
Description:
Loading...
Thumbnail Image
Name:
TSpace_LAC_SGS_license_MOA2015.txt
Size:
2.45 KB
Format:
Plain Text
Description:
Loading...
Thumbnail Image
Name:
TSpace_LAC_SGS_license_MOA2015.pdf
Size:
69.65 KB
Format:
Adobe Portable Document Format
Description: