Document Type
Article
Publication Date
6-4-2025
Abstract
Effective variable selection is central to the success of health services research, where large, complex datasets often include numerous variables with varying degrees of relevance. This paper presents a structured approach to variable selection, highlighting the importance of combining domain expertise with advanced analytical techniques to ensure the inclusion of only the most pertinent variables. We explore several methods, including manual selection, correlation matrices, random forests, and stepwise regression, each with its strengths and limitations in managing multicollinearity, dimensionality, and interpretability. By carefully preprocessing variables—removing redundant, irrelevant, or missing data—and applying feature selection tools like decision tree-based algorithms, researchers can streamline their models to focus on the most impactful predictors. This approach not only improves the reliability and precision of findings but also enhances the interpretability of complex models, particularly when working with social determinants of health (SDOH). Through a case study using the LexisNexis SDOH dataset, we illustrate how these methods can be tailored to identify patients at highest risk for adverse health outcomes. The proposed framework fosters more accurate, actionable insights and supports targeted interventions that aim to reduce health inequities.
Language
English
Publication Title
Health Services and Outcomes Research Methodology
Rights
© The Author(s) 2025. This is an Open Access work distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Dong, W., Lal, T., Liu, F. et al. Methodological considerations for optimal variable selection in machine learning for health services research. Health Serv Outcomes Res Method 25, 474–486 (2025). https://doi.org/10.1007/s10742-025-00347-8
Manuscript Version
Final Publisher Version