Advancing Speech Recognition With Vocal Tract Kinematics

Olorundamilola Kazeem, University of California, Berkeley

Photo of Olorundamilola Kazeem

Articulatory Automatic Speech Recognition (AASR) represents a significant advancement in speech technology, enhancing traditional Automatic Speech Recognition (ASR) systems by integrating detailed physical aspects of speech production. This study investigates the potentials and challenges of incorporating vocal tract kinematics — such as tongue and lip movements, airflow, and muscle activity — into ASR systems. By capturing these dynamic articulatory features, AASR provides a richer and more accurate representation of speech, particularly improving recognition accuracy in noisy environments and for speakers with diverse accents or speech impairments. A key advantage of AASR is its efficacy in low-resource settings where traditional ASR systems based on acoustic features are often inadequate and less effective. Compared to the traditional acoustic features (e.g., Mel-Spectrogram), articulatory features are interpretable, low-dimensional, and speaker agnostic, offering a valuable solution for the resource-constrained scenarios. Furthermore, the inherently interpretable nature of articulatory features contributes to the development of transparent and trustworthy AI, allowing for greater insight into the model's decision-making processes and enhancing trust in ASR technologies.