Recent Advances in Speech Language Models Interspeech-2024 Tutorial

Recent advances in representation learning make it possible to build spoken language processing applications on top of automatically discovered representations or units without relying on any textual resources or automatic speech recognition tools. These new developments represent a unique opportunity to redefine the entire field of speech and language processing, opening up the development of applications for under-resourced and unwritten languages while incorporating the richness and expressivity of oral language.

This tutorial will, therefore, discuss the foundations and recent advancements of the new sub-field, which we refer to as Speech Language Modeling. We will present the relevant components composing the whole pipeline of speech language models and some of their applications and discuss how such a complex pipeline should be evaluated. We will additionally discuss future research directions as well as how progress should be measured in this new field of research.

Presenters

Dr. Yossi Adi - The Hebrew University of Jerusalem, Israel

Yossi Adi is an Assistant Professor at the Hebrew University of Jerusalem, the school of computer science and engineering. Prior to his current position, Yossi was a staff research scientist at Meta's Fundamental AI Research (FAIR) team. Yossi holds a Ph.D. in computer science from Bar-Ilan University and has received several prestigious awards, including the IAAI Best Doctoral Dissertation Award (2020) and the Alon scholarship (2023). Yossi's research spans core machine learning and deep learning algorithms with a specific emphasis on their application to spoken language modeling. Yossi has published various papers on speech language models at top-tier machine-learning, natural language processing, and speech conferences and journals. To name a few Lakhotia, et al. (2021), Kharitonov, et al. (2021), Polyak, et al. (2021), Kreuk, et al. (2022), Sicherman, and Adi (2023), Hassid, et al. (2024). Yossi has been a member of several technical committees, including the IEEE Machine Learning for Signal Processing Technical Committee (MLSP), the Workshop on Machine Learning in Speech and Language Processing (MLSLP), released an OSS package for spoken language processing (textless-lib) and is co-organizing two special sessions on the topic at Interspeech-2024.

Dr. Soumi Maiti - Carnegie Mellon University, USA

Soumi Maiti is a post-doctoral researcher at the Language Technology Institute at Carnegie Mellon University. Her research focuses on the application of machine learning in speech processing systems, with a particular interest in understanding human perception of speech and languages. Soumi holds a Ph.D. in computer science from the City University of New York. She has held previous positions at Apple, Google, and Interactions LLC in various capacities. Her recent research studies include speech language models (Maiti, et al. 2023) and establishing evaluation of speech generative models (Maiti, et al., 2023, Saeki, et al. 2024). She has previously taken part in education short course Inclusive Neural Speech Synthesis in ICASSP 2022.

Prof. Shinji Watanabe - Carnegie Mellon University, USA

Shinji Watanabe is an Associate Professor at Carnegie Mellon University. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 400 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. His recent studies include speech foundation model studies, e.g., self-supervised learning (Mohamed, et al. 2022), reproduction of OpenAI whisper models (Peng, et al. 2023), and speech language models (Maiti, et al., 2023, Huang, et al. 2023). He is also interested in establishing the evaluation of speech generative models based on speech language models (Maiti, et al., 2023, Saeki, et al. 2024). He has intensive experience in conducting tutorials, e.g., ICASSP 2012, 2021, 2022, and Interspeech 2016, 2019, 2022, and 2023, including "Self-supervised Representation Learning for Speech Processing" in Interspeech 2022 related to this tutorial proposal. He is an IEEE and ISCA Fellow.

Details & Materials

Below you will find all the relevant information and materials about the tutorial.

Where

Hall: Acesso, Interspeech 2024, Kos.

When

September 1st, 2024, 1:45 - 4:45 (PM).

Slides

Presentation slides of the tutorial can be downloaded using the following link.