Englander Institute for Precision Medicine

A Multimodal Approach for Deep-Learning Classification of Vocal Fold Pathologies in Stroboscopy.

TitleA Multimodal Approach for Deep-Learning Classification of Vocal Fold Pathologies in Stroboscopy.
Publication TypeJournal Article
Year of Publication2026
AuthorsSurapaneni S, Kutler RB, Setzen SA, Kim YEun, Yao P, Siddiqui SH, Pitman MJ, Sulica L, Elemento O, Khosravi P, Rameau A
JournalLaryngoscope
Date Published2026 Jan 05
ISSN1531-4995
Abstract

OBJECTIVE: To develop and validate a multimodal deep-learning classifier trained on stroboscopic image, voice, and clinicodemographic data, differentiating between three different vocal fold (VF) states: healthy (HVF), unilateral paralysis (UVFP), and VF lesions, including benign and malignant pathologies.

METHODS: Patients with UVFP (n = 54), VF lesions (n = 42), and HVF (n = 41) were retrospectively identified. Image frames and voice samples were extracted from stroboscopic videos. Clinicodemographic variables were collected from the electronic health record. Patient-level data was independently divided into training (80%) and testing (20%). Visual features were extracted using a transformer DINOv2 and acoustic features were extracted using Librosa. All three feature modalities were combined using a custom multilayer perceptron. Unimodality models using only image or only voice data were trained for comparison. Accuracy and F1 scores were used to validate the models.

RESULTS: On a hold-out test set, the multimodal classifier demonstrated stronger performance (76.9% accuracy) compared to the image classifier (61.5% accuracy) and audio classifier (65.4% accuracy). On an external dataset, the multimodal classifier accuracy dropped to 45%, though still an improvement compared to accuracies of 42% and 31% for the video-only and audio-only modalities, respectively.

CONCLUSIONS: In this proof-of-concept study, we successfully developed a multimodal dataset and classifier for VF pathology, demonstrating the potential of combining stroboscopic frames, voice and text data. The multimodal classifier achieved higher accuracy than the image-only model and audio-only models. Future models should validate these findings on larger datasets.

DOI10.1002/lary.70355
Alternate JournalLaryngoscope
PubMed ID41489089
Grant ListK76 AG079040 / AG / NIA NIH HHS / United States
OT2 OD032720 / CD / ODCDC CDC HHS / United States

Weill Cornell Medicine Englander Institute for Precision Medicine 413 E 69th Street
Belfer Research Building
New York, NY 10021