Science Communication & Media
Assessing the Validity of Public Web Search Data for Predictive Modeling of Epidemics Marie-Laure Charpignon* Marie-Laure Charpignon Anika Puri Maimuna Majumder
Non-conventional data collected from public web searches, including Google Search Trends (GST), are increasingly used for predictive modeling of epidemics. Here, we investigated the validity of leveraging GST data to predict COVID-19 progression. Specifically, we examined the relationship between COVID-19-related morbidity-mortality (ie, case and death counts) and public interest in COVID-19 as ascertained through GST. We focused on the United States and analyzed data from Feb 2020 to May 2022, inclusive. We found that states with a larger share of Republican-leaning voters had lower levels of public interest in COVID-19 than those of COVID-19 morbidity-mortality. This pattern was most prominent during the Omicron wave. Further, we characterized the spatio-temporal dynamics of decoupling between Internet search interest and COVID-19 morbidity-mortality by state. Decoupling was defined as the point-in-time difference between the normalized GST value and the corresponding normalized COVID-19 morbidity-mortality; a negative value would indicate greater COVID-19 morbidity-mortality than the corresponding level of public interest. In Republican-leaning states, we found greater negative decoupling between Internet search interest and COVID-19 morbidity-mortality than in Democrat-leaning states, as the pandemic progressed. This result calls into question the effectiveness and generalizability of using GST to build epidemic forecasting models across locations. To account for confounders of the political leaning-Internet search interest relationship, we implemented a multivariable regression adjusting for local levels of social vulnerability, Internet penetration, and vaccine uptake. The presence of decoupling between COVID-19 morbidity-mortality and public interest suggests that GST information value may vary over time and by political leaning. Thus, caution is warranted when employing such data for spatio-temporal epidemic models of COVID-19 and other infectious diseases.