Abstract Search – Society for Epidemiologic Research

Big Data/Machine Learning/AI

A framework for collecting data for public health research from the web using large language models Thomas Berkane* Thomas Berkane Marie-Laure Charpignon Maimuna S. Majumder

Many public health researchers rely on using data from the web in their work, from epidemiological reports and vaccination statistics published by governments to public sentiment on health topics discussed in news and social media. However, these data are typically scattered across the web, becoming useful for longitudinal or cross-sectional studies only after compilation into structured datasets. Further, manually searching for data points is time-consuming and prone to human error.

We propose a framework that automates web-scale collection of research data end-to-end, leveraging large language models (LLMs). Given a user-provided description of the target dataset, our framework generates search queries, navigates the web to find relevant pages, selectively extracts data, performs quality control, and produces a structured dataset. The framework operates in a human-in-the-loop manner, allowing users to inspect and adjust the data collection process at each stage to ensure alignment with their goals. In addition to mitigating LLM hallucinations through grounding, we correct for two types of bias introduced by search engines: webpage recency and user geographical location. The framework maintains transparency by linking each data point to its original source. The quality control step automatically flags potentially anomalous data points for user review, such as outliers and duplicates.

After validating each step of our framework, we present three case studies illustrating its application to collecting diverse types of public health data from the web: (1) time series of cholera cases globally, (2) US state-level COVID-19 contact tracing app downloads, and (3) timelines of natural disasters — events often leading to disease outbreaks — in Haiti and Cameroon. The dataset derived for case study (3) is shown in Fig. 1.

Future research will expand our framework to handle the extraction of public health data from more modalities than text, such as images and PDFs.