Englander Institute for Precision Medicine

Landmark AI Project Harnesses Voice to Diagnose Disease, Releases First Data

News from the EIPM

Researchers at USF Health and Weill Cornell Medicine, as part of an expansive, multi-institutional project investigating voice as a biomarker for disease, have reached a significant milestone by publishing the first version of their clinically validated voice dataset to an online artificial intelligence platform where it will be an invaluable resource for researchers across the globe.  

The National Institutes of Health-funded project, Voice as a Biomarker of Healthseeks to build an ethically sourced AI-enabled database of 10,000 human voices from patients with different illnesses to help doctors diagnose and treat diseases, such as cancer and depression, based on the sound of a patient’s voice. 

The initial data release includes more than 12,500 separate recordings from 306 participants across the United States and Canada. Published on Health Data Nexus and available to the community of health researchers studying voice, the data release comes at the end of the second year of the four-year $14 million project, with several additional releases scheduled for the next two years. Already among the largest collections of human voices, by the end of the study the repository will become the world’s flagship database for AI voice and health.

Dr. Yaël Bensoussan“There is so much information in these first recordings and we are excited to receive feedback on it because what we are developing what will be an unequalled resource for the scientific community,” said Dr. Yaël Bensoussan, director of the USF Health Voice Center and co-lead of the project. “It is really important for us to understand what people can do with this initial data and what kinds of clinical questions they can answer.”  

As one of four precision health data projects funded by the NIH Common Fund’s Bridge2AI program, Voice as a Biomarker of Health aims to introduce a transformative new method of diagnosing and treating diseases by training AI models to identify illnesses through changes in the human voice, with vast implications for the clinical setting. 

The University of South Florida is the lead institution for the project in collaboration with Weill Cornell Medicine and 10 other institutions across the United States and Canada. Dr. Bensoussan of the USF Health Morsani College of Medicine, and Dr. Olivier Elemento, director of the Englander Institute for Precision Medicine at Weill Cornell Medicine, are the project’s co-principal investigators.

While previous research utilizing voice and AI to detect disease is encouraging, it has been limited due to the small size of data sets, as well as concerns over data security, ownership and bias. Voice as a Biomarker of Health is addressing that shortcoming by bringing together medical voice, AI engineering and ethics experts to generate a landmark voice database using privacy-preserving AI. 

Photo by Travis Curry"Artificial intelligence is revolutionizing our ability to detect and understand disease, and this groundbreaking voice dataset is a monumental step forward in that journey," said Dr. Elemento, who is also a professor of physiology and biophysics at Weill Cornell Medicine. "These clinically validated data, combined with cutting-edge AI techniques, pave the way for new diagnostic possibilities and groundbreaking innovations that will transform patient care globally."

The newly published data set is particularly notable for the breadth and quality of its recordings, which were collected across numerous institutions from outpatient clinical settings. Data is clinically validated and standardized across locations and all participants perform the same tests and acoustic tasks. The three types of acoustic tasks — respiratory, voice, and speech and linguistic — include over 20 tasks such as breathing at rest, coughing, enunciating “E” at long intervals, reading specific passages, free speech and other voice-related activities. 

This highest-quality, standardized data will be essential in corroborating existing voice algorithms and fueling the development of new discoveries, said Dr. Bensoussan.

“Researchers will be able to use it as a benchmark data set to confirm that their algorithms are valid,” she said. “For example, some startups have already developed algorithms to diagnose voice biomarkers with their proprietary data, and our data set can be used to see if it also works with people with different types of diseases.”

Accompanying the data release is a Bridge2AI Voice Prep Kit offering a host of tools to researchers for preprocessing and utilizing the data. The Bridge2AI consortium is also hosting the 2025 Voice AI Symposium and Hackathon, April 22-24, in Tampa, FL, which will connect clinician-scientists, researchers, patients and top minds in AI to advance the application of voice AI in health care and the utility of the database in pioneering new discoveries. 

“An undertaking of this size and scope is a team effort, with so many investigators and institutions in the United States and Canada coming together to fuel discoveries in health care,” Dr. Bensoussan said. “From this work, and the research it will enable in the rest of the world, I think you are going to see a lot of progress and some very impactful products developed.”

 # # #

Appendix:

Voice as a Biomarker of Disease lead investigators: 

A version of this story first appeared on the University of South Florida newsroom.

 

Weill Cornell Medicine Englander Institute for Precision Medicine 413 E 69th Street
Belfer Research Building
New York, NY 10021