TY - JOUR
T1 - Comparison of statistical and machine learning models for healthcare cost data
T2 - A simulation study motivated by Oncology Care Model (OCM) data
AU - Mazumdar, Madhu
AU - Lin, Jung Yi Joyce
AU - Zhang, Wei
AU - Li, Lihua
AU - Liu, Mark
AU - Dharmarajan, Kavita
AU - Sanderson, Mark
AU - Isola, Luis
AU - Hu, Liangyuan
N1 - Funding Information:
Research reported in this publication was supported in part by the National Cancer Institute Cancer Center Support Grant P30CA196521–01 awarded to the Tisch Cancer Institute of the Icahn School of Medicine at Mount Sinai (TCI-ISMMS). MM, LL, LH, JL are members of the Biostatistics Shared Resource Facility for TCI-ISMMS and were provided support for this project in terms of protected time. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Publisher Copyright:
© 2020 The Author(s).
PY - 2020/4/25
Y1 - 2020/4/25
N2 - Background: The Oncology Care Model (OCM) was developed as a payment model to encourage participating practices to provide better-quality care for cancer patients at a lower cost. The risk-adjustment model used in OCM is a Gamma generalized linear model (Gamma GLM) with log-link. The predicted value of expense for the episodes identified for our academic medical center (AMC), based on the model fitted to the national data, did not correlate well with our observed expense. This motivated us to fit the Gamma GLM to our AMC data and compare it with two other flexible modeling methods: Random Forest (RF) and Partially Linear Additive Quantile Regression (PLAQR). We also performed a simulation study to assess comparative performance of these methods and examined the impact of non-linearity and interaction effects, two understudied aspects in the field of cost prediction. Methods: The simulation was designed with an outcome of cost generated from four distributions: Gamma, Weibull, Log-normal with a heteroscedastic error term, and heavy-tailed. Simulation parameters both similar to and different from OCM data were considered. The performance metrics considered were the root mean square error (RMSE), mean absolute prediction error (MAPE), and cost accuracy (CA). Bootstrap resampling was utilized to estimate the operating characteristics of the performance metrics, which were described by boxplots. Results: RF attained the best performance with lowest RMSE, MAPE, and highest CA for most of the scenarios. When the models were misspecified, their performance was further differentiated. Model performance differed more for non-exponential than exponential outcome distributions. Conclusions: RF outperformed Gamma GLM and PLAQR in predicting overall and top decile costs. RF demonstrated improved prediction under various scenarios common in healthcare cost modeling. Additionally, RF did not require prespecification of outcome distribution, nonlinearity effect, or interaction terms. Therefore, RF appears to be the best tool to predict average cost. However, when the goal is to estimate extreme expenses, e.g., high cost episodes, the accuracy gained by RF versus its computational costs may need to be considered.
AB - Background: The Oncology Care Model (OCM) was developed as a payment model to encourage participating practices to provide better-quality care for cancer patients at a lower cost. The risk-adjustment model used in OCM is a Gamma generalized linear model (Gamma GLM) with log-link. The predicted value of expense for the episodes identified for our academic medical center (AMC), based on the model fitted to the national data, did not correlate well with our observed expense. This motivated us to fit the Gamma GLM to our AMC data and compare it with two other flexible modeling methods: Random Forest (RF) and Partially Linear Additive Quantile Regression (PLAQR). We also performed a simulation study to assess comparative performance of these methods and examined the impact of non-linearity and interaction effects, two understudied aspects in the field of cost prediction. Methods: The simulation was designed with an outcome of cost generated from four distributions: Gamma, Weibull, Log-normal with a heteroscedastic error term, and heavy-tailed. Simulation parameters both similar to and different from OCM data were considered. The performance metrics considered were the root mean square error (RMSE), mean absolute prediction error (MAPE), and cost accuracy (CA). Bootstrap resampling was utilized to estimate the operating characteristics of the performance metrics, which were described by boxplots. Results: RF attained the best performance with lowest RMSE, MAPE, and highest CA for most of the scenarios. When the models were misspecified, their performance was further differentiated. Model performance differed more for non-exponential than exponential outcome distributions. Conclusions: RF outperformed Gamma GLM and PLAQR in predicting overall and top decile costs. RF demonstrated improved prediction under various scenarios common in healthcare cost modeling. Additionally, RF did not require prespecification of outcome distribution, nonlinearity effect, or interaction terms. Therefore, RF appears to be the best tool to predict average cost. However, when the goal is to estimate extreme expenses, e.g., high cost episodes, the accuracy gained by RF versus its computational costs may need to be considered.
KW - Generalized linear model
KW - Machine learning
KW - Oncology care model
KW - Quantile regression
KW - Risk-adjustment model
UR - http://www.scopus.com/inward/record.url?scp=85084030686&partnerID=8YFLogxK
U2 - 10.1186/s12913-020-05148-y
DO - 10.1186/s12913-020-05148-y
M3 - Article
C2 - 32334595
AN - SCOPUS:85084030686
SN - 1472-6963
VL - 20
JO - BMC Health Services Research
JF - BMC Health Services Research
IS - 1
M1 - 350
ER -