COVID-19 Pandemic
Applying machine learning to identify unrecognized COVID-19 deaths attributed to other causes of death in the United States Mathew V. Kiang* Mathew V. Kiang Dielle J. Lundberg Rafeya Raquib Benjamin Huynh Richard Li M. Maria Glymour Andrew C. Stokes
The true number of deaths caused by the SARS-CoV-2 virus in the US has been debated since the start of the pandemic. Excess mortality models are often used but cannot distinguish deaths due to SARS-CoV-2 infection versus other pandemic-related causes such as health care interruptions or social and economic impacts. We use a novel machine learning approach trained on individual death certificate data to differentiate deaths and produce refined estimates of COVID-19 mortality from 2020 to 2021. Leveraging variation in death certificate accuracy by place of death (i.e., COVID-19 deaths are more accurately reported in hospital settings due to widespread testing and national guidance), we fit six ML models across four covariate sets, which included individual, county, and pandemic characteristics. We tuned models and assessed performance using a train-validate-test split on the in-hospital deaths. The final model, XGBoost, had high out-of-sample performance (AUC ROC: .90; sensitivity: .80; specificity: .85). We used this model to predict COVID-19 mortality for out-of-hospital deaths and calculated adjusted reporting ratios (ARRs) for the number of predicted COVID-19 deaths compared to officially reported COVID-19 deaths. We estimated the actual number of COVID-19 deaths to be 31% higher than official reports (ARR: 1.31; 95% uncertainty interval [95% UI]: 1.30, 1.32). There was substantial variation between groups (Figure). For example, COVID-19 was underreported more frequently on death certificates recorded as Hispanic (1.44 [1.42, 1.45]) and on those recorded as male (1.35 [1.33, 1.38]). Major differences in ARR between counties and over time indicate how incomplete reporting of COVID-19 deaths could influence pandemic response. Our estimates of overall underreported COVID-19 deaths are consistent with unexplained excess mortality during the pandemic. Incorporating ML into the US death reporting system may provide more rapid, accurate, and complete mortality estimates.