Big Data/Machine Learning/AI
Secure extraction of structured epidemiologic data from multilingual pathology reports using on-premise large language models Janos Tibor Fekete* Janos Tibor Fekete Fekete Fekete Fekete Fekete Semmelweis University, Dept. of Bioinformatics, Budapest, Hungary
Background: The transformation of unstructured clinical text into structured, analysis-ready data remains a major bottleneck in epidemiology and cancer research. Although large language models (LLMs) offer substantial potential for clinical data extraction, their adoption is limited by concerns related to data sovereignty, multilingual performance, and infrastructure requirements. We introduce CIDER (ClinIcal Data ExtractoR), an open-source, on-premise LLM-based pipeline designed for secure, scalable extraction of structured epidemiologic variables from multilingual pathology reports.
Methods: CIDER employs an asynchronous FastAPI architecture combined with a vLLM inference engine to deploy the Qwen3-VL-32B-Instruct-FP8 model within an air-gapped institutional environment. System performance was evaluated using 2,073 real-world Hungarian-language histopathology reports from an institutional OnkoBank. Extraction accuracy was assessed against expert-curated registry data for seven key variables (sex, year of surgery, T stage, N stage, primary tumor site, histology, and tumor size). Technical reproducibility was evaluated across three independent runs (temperature = 0.1).
Results: CIDER demonstrated high concordance with expert-curated data, exceeding 98% accuracy for demographic variables and maintaining >95% accuracy for T stage and >92% for N stage. Importantly, the system recovered clinically valid information missing from manual curation, including 62.8% of previously unrecorded T stages (n = 713) and 91.5% of missing tumor size values (n = 289). Performance was highly stable across repeated runs and remained robust even at extreme temperature settings.
Conclusions: CIDER shows that locally deployed, open-source LLMs can achieve near–expert-level performance in extracting structured epidemiologic data from complex, English and non-English clinical narratives while fully preserving data sovereignty. CIDER is publicly accessible at: https://llm.gyorffylab.com/cider.
