Generalized Variable Importance Metric: An approach to identify important predictors from machine learning models
| dc.contributor.advisor | Kustra, Rafal | |
| dc.contributor.author | Khan, Mohammad Kaviul Anam | |
| dc.contributor.department | Dalla Lana School of Public Health | |
| dc.date | 2025-06 | |
| dc.date.accepted | 2025-06 | |
| dc.date.accessioned | 2025-07-31T15:34:09Z | |
| dc.date.available | 2025-07-31T15:34:09Z | |
| dc.date.convocation | 2025-06 | |
| dc.date.issued | 2025-06 | |
| dc.description.abstract | Interpreting black box machine learning methods posses a significant challenge, with existing approaches often being data and model specific. In this thesis, a “Generalized Variable Importance Metric (GVIM)” is defined to measure predictor importance utilizing black box methods without relying on model-based parameters. GVIM, which is defined for a predictor using the true conditional expectation function, assesses the predictor’s impact on a continuous or binary response. A permutation-based approach to estimate GVIM is proposed in this thesis, akin to those by Breiman (2001) and Fisher et al. (2019a). However, black-box models underestimate GVIM when predictors are correlated. Through a bias-variance decomposition, the source of the bias is identified and its pattern in high correlation scenarios is demonstrated, suggesting ways to minimize it. The primary bias stems from black-box models’ limited ability to extrapolate to regions that have low probability because of the correlations. A conditional GVIM method (CGVIM) based on Strobl et al. (2008) is introduced, its bias-variance decomposition is derived, and its relationship with predictor correlations is shown. Both GVIM and CGVIM exhibited a quadratic relationship with the conditional average treatment effect (CATE). Finally, I demonstrated the application of GVIM and CGVIM to investigate risk factors for cognitive decline using data from the Canadian Longitudinal Study on Aging (CLSA) dataset. The proposed method is model-agnostic and offers a causal interpretation, which is crucial for clinical and public health research. Understanding exposure-outcome relationships is vital in health science, where traditional models like regression are preferred for interpretability, but machine learning excels in prediction. GVIM and CGVIM, being model-agnostic, allow researchers to choose their preferred machine learning model without sacrificing prediction or inference capabilities. | |
| dc.description.degree | Ph.D. | |
| dc.identifier.uri | https://hdl.handle.net/1807/145099 | |
| dc.rights | Attribution 4.0 International | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Causal Inference | |
| dc.subject | Correlation Distortion | |
| dc.subject | Machine Learning | |
| dc.subject | Variable Importance | |
| dc.subject.classification | 0308 | |
| dc.title | Generalized Variable Importance Metric: An approach to identify important predictors from machine learning models | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Khan_Mohammad_Kaviul_Anam_202506_PhD_thesis.pdf
- Size:
- 3.83 MB
- Format:
- Adobe Portable Document Format
License bundle
1 - 3 of 3
Loading...
- Name:
- TSpace_LAC_SGS_license_MOA2015.pdf
- Size:
- 69.65 KB
- Format:
- Adobe Portable Document Format
- Description:
