Multi-Hospital Electronic Health Record Foundation Models Without Data Sharing: A Comparison of Federated Learning and Inference-Time Ensembling. | Englander Institute for Precision Medicine

Title	Multi-Hospital Electronic Health Record Foundation Models Without Data Sharing: A Comparison of Federated Learning and Inference-Time Ensembling.
Publication Type	Journal Article
Year of Publication	2026
Authors	Elemento O
Journal	medRxiv
Date Published	2026 Apr 27
Abstract	BACKGROUND: Foundation models for electronic health records (EHRs) perform strongly on clinical prediction, but every published model has been trained within a single health system. No multi-institutional EHR foundation model currently exists, largely because privacy regulations and governance barriers block data pooling across hospitals. Two strategies could build such models without pooling: federated learning (exchanges model weights) and inference-time ensembling (exchanges only predictions at query time). Whether either is viable for autoregressive EHR foundation models, and whether individual hospitals benefit from participating, is not established. METHODS: We trained a generative pretrained transformer (GPT) style EHR foundation model on 100,163 Medical Information Mart for Intensive Care (MIMIC-IV) patients, partitioned into five heterogeneously distributed (non-IID) sites by Dirichlet allocation over International Classification of Diseases (ICD) chapters. We compared centralized training, federated averaging, and inference-time ensembling, and each hospital's solo model against the ensemble including it. Models were evaluated on 15,012 held-out patients using per-condition area under the receiver operating characteristic curve (AUROC) for five acute conditions and micro-averaged area under the precision-recall curve (AUPRC) across 2,590 diagnoses. RESULTS: Centralized training achieved per-condition AUROC 0.75-0.85 and overall AUPRC 0.376. Federated averaging recovered 85% of centralized AUPRC (0.321) and 98-100% of per-condition AUROC. Inference-time ensembling, requiring no training-time exchange, recovered 77% of AUPRC (0.291) and 97-99% of per-condition AUROC. An estimated 87% of participating hospitals received a better model from the ensemble than from training alone; only hospitals with ~40% of the network's patients matched the ensemble on their own. FedProx collapsed to the marginal baseline. CONCLUSIONS: Multi-institutional EHR foundation models can be built without pooling patient data. Inference-time ensembling benefits most participating hospitals and imposes the lightest governance burden; federated learning recovers more performance but requires weight sharing. These findings offer a practical path toward collaborative clinical AI.
DOI	10.64898/2026.04.24.26351702
Alternate Journal	medRxiv
PubMed ID	42094144
PubMed Central ID	PMC13142595