


default search action
24th Interspeech 2023: Dublin, Ireland
- Naomi Harte, Julie Carson-Berndsen, Gareth Jones:
24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023. ISCA 2023
Keynote 1 ISCA Medallist
- Shrikanth Narayanan:
Bridging Speech Science and Technology - Now and Into the Future. 1
Speech Synthesis: Prosody and Emotion
- Jianrong Wang, Yaxin Zhao, Li Liu
, Tianyi Xu, Qi Li, Sen Li:
Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks. 2-6 - Zhaoci Liu, Zhen-Hua Ling, Ya-Jun Hu, Jia Pan, Jin-Wei Wang, Yun-Di Wu:
Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations. 7-11 - Haobin Tang, Xulong Zhang
, Jianzong Wang, Ning Cheng, Jing Xiao:
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis. 12-16 - Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi Saruwatari:
Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus. 17-21 - Rui Liu, Haolin Zuo, De Hu, Guanglai Gao, Haizhou Li:
Explicit Intensity Control for Accented Text-to-speech. 22-26 - Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski
, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba:
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech. 27-31
Statistical Machine Translation
- Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot:
Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer. 32-36 - Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Mathur, Alexandra Chronopoulou, Marcello Federico:
Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters. 37-41 - Kun Song, Yi Ren, Yi Lei, Chunfeng Wang, Kun Wei, Lei Xie, Xiang Yin, Zejun Ma:
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation. 42-46 - Marco Gaido
, Sara Papi
, Matteo Negri, Marco Turchi:
Joint Speech Translation and Named Entity Recognition. 47-51 - Gerard Sant, Carlos Escolano:
Analysis of Acoustic information in End-to-End Spoken Language Translation. 52-56 - Peidong Wang, Eric Sun, Jian Xue, Yu Wu, Long Zhou, Yashesh Gaur, Shujie Liu, Jinyu Li:
LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers. 57-61
Self-Supervised Learning in ASR
- Yifan Peng
, Yui Sudo
, Muhammad Shakeel
, Shinji Watanabe
:
DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models. 62-66 - Salah Zaiem, Titouan Parcollet, Slim Essid:
Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations. 67-71 - Zhao Yang, Dianwen Ng, Chong Zhang
, Xiao Fu, Rui Jiang, Wei Xi, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma, Jizhong Zhao:
Dual Acoustic Linguistic Self-supervised Representation Learning for Cross-Domain Speech Recognition. 72-76 - Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi:
O-1: Self-training with Oracle and 1-best Hypothesis. 77-81 - Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen:
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. 82-86 - Léa-Marie Lam-Yee-Mui, Lucas Ondel Yang, Ondrej Klejch:
Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models. 87-91
Prosody
- Xinya Zhang, Ying Chen:
Chinese EFL Learners' Perception of English Prosodic Focus. 92-96 - Thomas Sostarics
, Jennifer Cole:
Pitch Accent Variation and the Interpretation of Rising and Falling Intonation in American English. 97-101 - Jianjing Kuang, May Pik Yu Chan, Nari Rhee:
Tonal coarticulation as a cue for upcoming prosodic boundary. 102-106 - Sophie Repp, Lara Muhtz, Johannes Heim
:
Alignment of Beat Gestures and Prosodic Prominence in German. 107-111 - Hannah White
, Joshua Penney
, Andy Gibson
, Anita Szakay
, Felicity Cox
:
Creak Prevalence and Prosodic Context in Australian English. 112-116 - Kübra Bodur
, Roxane Bertrand, James Sneed German, Stéphane Rauzy, Corinne Fredouille, Christine Meunier:
Speech reduction: position within French prosodic structure. 117-121
Speech Production
- Ziyu Zhu, Yujie Chi, Zhao Zhang, Kiyoshi Honda, Jianguo Wei:
Transvelar Nasal Coupling Contributing to Speaker Characteristics in Non-nasal Vowels. 122-126 - Yuto Otani, Shun Sawada, Hidefumi Ohmura, Kouichi Katsurada:
Speech Synthesis from Articulatory Movements Recorded by Real-time MRI. 127-131 - Zheng Yuan
, Aldo Pastore
, Dorina De Jong, Hao Xu, Luciano Fadiga, Alessandro D'Ausilio:
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech with a Siamese RNN. 132-136 - James J. Mahshie, Michael Larsen:
Did you see that? Exploring the role of vision in the development of consonant feature contrasts in children with cochlear implants. 137-140
Dysarthric Speech Assessment
- Loes van Bemmel
, Chiara Pesenti, Xue Wei
, Helmer Strik:
Automatic assessments of dysarthric speech: the usability of acoustic-phonetic features. 141-145 - Chowdam Venkata Thirumala Kumar, Tanuka Bhattacharjee, Yamini Belur, Atchayaram Nalini, Ravi Yadav
, Prasanta Kumar Ghosh:
Classification of Multi-class Vowels and Fricatives From Patients Having Amyotrophic Lateral Sclerosis with Varied Levels of Dysarthria Severity. 146-150 - Jinzi Qi, Hugo Van hamme
:
Parameter-efficient Dysarthric Speech Recognition Using Adapter Fusion and Householder Transformation. 151-155 - Enno Hermann, Mathew Magimai-Doss
:
Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation. 156-160 - Dianna Yee, Colin Lea, Jaya Narain, Zifang Huang, Lauren Tooley, Jeffrey P. Bigham, Leah Findlater:
Latent Phrase Matching for Dysarthric Speech. 161-165 - Eun Jung Yeo, Kwanghee Choi, Sunhee Kim, Minhwa Chung:
Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification. 166-170
Speech Coding: Transmission and Enhancement
- Youqiang Zheng, Li Xiao, Weiping Tu, Yuhong Yang, Xinmeng Xu:
CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding. 171-175 - Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani:
Target Speech Extraction with Conditional Diffusion Model. 176-180 - Elad Cohen, Hai Victor Habi, Arnon Netzer:
Towards Fully Quantized Neural Networks For Speech Enhancement. 181-185 - Youshan Zhang, Jialu Li:
Complex Image Generation SwinTransformer Network for Audio Denoising. 186-190
Speech Recognition: Signal Processing, Acoustic Modeling, Robustness, Adaptation 1
- Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran:
Using Text Injection to Improve Recognition of Personal Identifiers in Speech. 191-195 - Tamás Grósz
, Yaroslav Getman
, Ragheb Al-Ghezi, Aku Rouhe, Mikko Kurimo:
Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model. 196-200 - Jan Lehecka, Jan Svec
, Josef V. Psutka, Pavel Ircing:
Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech. 201-205 - Mayank Kumar Singh, Naoya Takahashi, Naoyuki Onoe:
Iteratively Improving Speech Recognition and Voice Conversion. 206-210 - Kavan Fatehi, Ayse Küçükyilmaz:
LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems. 211-215 - Hongfei Xue, Qijie Shao, Peikun Chen, Pengcheng Guo, Lei Xie, Jie Liu:
TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition. 216-220 - Zelin Wu, Tsendsuren Munkhdalai, Pat Rondon, Golan Pundak, Khe Chai Sim, Christopher Li:
Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR. 221-225 - Hang Zhou, Xiaoxu Zheng, Yunhe Wang, Michael Bi Mi, Deyi Xiong, Kai Han:
GhostRNN: Reducing State Redundancy in RNN with Cheap Operations. 226-230 - Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan:
Task-Agnostic Structured Pruning of Speech Representation Models. 231-235 - Naoyuki Kanda, Takuya Yoshioka, Yang Liu:
Factual Consistency Oriented Speech Recognition. 236-240 - Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales:
Multi-Head State Space Model for Speech Recognition. 241-245 - Yingying Gao, Shilei Zhang, Zihao Cui, Chao Deng, Junlan Feng:
Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search. 246-250 - Kinan Martin, Jon Gauthier
, Canaan Breiss
, Roger Levy:
Probing Self-supervised Speech Models for Phonetic and Phonemic Information: A Case Study in Aspiration. 251-255 - Philip Harding, Sibo Tong, Simon Wiesler:
Selective Biasing with Trie-based Contextual Adapters for Personalised Speech Recognition using Neural Transducers. 256-260
Analysis of Speech and Audio Signals 1
- Xiao-Min Zeng, Yan Song, Ian McLoughlin
, Lin Liu, Li-Rong Dai:
Robust Prototype Learning for Anomalous Sound Detection. 261-265 - Saksham Singh Kushwaha, Magdalena Fuentes:
A multimodal prototypical approach for unsupervised sound classification. 266-270 - Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong Wang:
Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms. 271-275 - Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang:
Adapting Language-Audio Models as Few-Shot Audio Learners. 276-280 - Mengwei Wang, Zhe Yang:
TFECN: Time-Frequency Enhanced ConvNet for Audio Classification. 281-285 - Won-Gook Choi, Joon-Hyuk Chang:
Resolution Consistency Training on Time-Frequency Domain for Semi-Supervised Sound Event Detection. 286-290 - Kang Li, Yan Song, Ian McLoughlin
, Lin Liu, Jin Li, Li-Rong Dai:
Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection. 291-295 - Dianwen Ng, Yang Xiao, Jia Qi Yip, Zhao Yang, Biao Tian, Qiang Fu, Eng Siong Chng, Bin Ma:
Small Footprint Multi-channel Network for Keyword Spotting with Centroid Based Awareness. 296-300 - Wei Xie, Yanxiong Li, Qianhua He, Wenchang Cao, Tuomas Virtanen
:
Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes. 301-305 - Mohammad Hassan Vali
, Tom Bäckström
:
Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion. 306-310 - Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Serguei Barannikov, Irina Piontkovskaya, Sergey I. Nikolenko, Evgeny Burnaev
:
Topological Data Analysis for Speech Processing. 311-315 - Kangwook Jang, Sungnyun Kim, Se-Young Yun, Hoirin Kim:
Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation. 316-320 - Timm Koppelmann, Semih Agcaer
, Rainer Martin
:
Personalized Acoustic Scene Classification in Ultra-low Power Embedded Devices Using Privacy-preserving Data Augmentation. 321-325 - Wei-Cheng Lin, Luca Bondi
, Shabnam Ghaffarzadegan:
Background Domain Switch: A Novel Data Augmentation Technique for Robust Sound Event Detection. 326-330 - Yuanbo Hou, Siyang Song, Cheng Luo, Andrew Mitchell
, Qiaoqiao Ren, Weicheng Xie, Jian Kang
, Wenwu Wang, Dick Botteldooren
:
Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning. 331-335 - Hejing Zhang, Jian Guan, Qiaoxi Zhu, Feiyang Xiao, Youde Liu:
Anomalous Sound Detection Using Self-Attention-Based Frequency Pattern Analysis of Machine Sounds. 336-340 - Yifei Xin, Yuexian Zou:
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions. 341-345 - Suhas BN, Sarah Rajtmajer, Saeed Abdullah:
Differential Privacy enabled Dementia Classification: An Exploration of the Privacy-Accuracy Trade-off in Speech Signal Data. 346-350 - Shijun Wang, Jón Guðnason, Damian Borth:
Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech. 351-355 - Swarup Ranjan Behera, Pailla Balakrishna Reddy, Achyut Mani Tripathi, Megavath Bharadwaj Rathod, Tejesh Karavadi:
Towards Multi-Lingual Audio Question Answering. 356-360
Speech Recognition: Architecture, Search, and Linguistic Components 1
- Hanan Aldarmaki
, Ahmad Ghannam:
Diacritic Recognition Performance in Arabic ASR. 361-365 - Jari Kolehmainen, Yile Gu, Aditya Gourav, Prashanth Gurunath Shivakumar, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko:
Personalization for BERT-based Discriminative Speech Recognition Rescoring. 366-370 - Aravind Krishnan, Jesujoba O. Alabi, Dietrich Klakow:
On the N-gram Approximation of Pre-trained Language Models. 371-375 - Tianyu Huang, Chung Hoon Hong, Carl Wivagg, Kanna Shimizu:
Record Deduplication for Entity Distribution Modeling in ASR Transcripts. 376-380 - Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke:
Learning When to Trust Which Teacher for Weakly Supervised ASR. 381-385 - Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma:
Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer. 386-390
Speech Recognition: Technologies and Systems for New Applications 1
- Puyuan Peng, Shang-Wen Li, Okko Räsänen
, Abdelrahman Mohamed, David Harwath:
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model. 391-395 - Puyuan Peng, Brian Yan, Shinji Watanabe
, David Harwath:
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization. 396-400 - Roger K. Moore
, Ricard Marxer
:
Progress and Prospects for Spoken Language Technology: Results from Five Sexennial Surveys. 401-405 - Ramon Sanabria, Ondrej Klejch, Hao Tang, Sharon Goldwater:
Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling. 406-410 - Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai:
CASA-ASR: Context-Aware Speaker-Attributed ASR. 411-415 - Shun Takahashi, Sakriani Sakti:
Unsupervised Learning of Discrete Latent Representations with Data-Adaptive Dimensionality from Continuous Speech Streams. 416-420 - Gaobin Yang, Jun Du, Maokui He, Shutong Niu, Baoxiang Li, Jiakui Li, Chin-Hui Lee:
AD-TUNING: An Adaptive CHILD-TUNING Approach to Efficient Hyperparameter Optimization of Child Networks for Speech Processing Tasks in the SUPERB Benchmark. 421-425 - Jeremy H. M. Wong, Huayun Zhang, Nancy F. Chen
:
Distilling knowledge from Gaussian process teacher to neural network student. 426-430 - Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak:
Segmental SpeechCLIP: Utilizing Pretrained Image-text Models for Audio-Visual Learning. 431-435 - Christiaan Jacobs, Nathanaël Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett
, Herman Kamper
:
Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili. 436-440 - Ruan van der Merwe, Herman Kamper
:
Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning. 441-445 - Martin Polácek, Petr Cerva, Jindrich Zdánský, Lenka Weingartová:
Online Punctuation Restoration using ELECTRA Model for streaming ASR Systems. 446-450 - Szu-Jui Chen, Debjyoti Paul, Yutong Pang, Peng Su, Xuedong Zhang:
Language Agnostic Data-Driven Inverse Text Normalization. 451-455 - Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath:
How to Estimate Model Transferability of Pre-Trained Speech Models? 456-460 - Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nobukatsu Hojo:
Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model. 461-465
Lexical and Language Modeling for ASR
- Kamer Ali Yuksel, Thiago Castro Ferreira, Golara Javadi, Mohamed Al-Badrashiny, Ahmet Gunduz:
NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning. 466-470 - Yile Gu, Prashanth Gurunath Shivakumar, Jari Kolehmainen, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko:
Scaling Laws for Discriminative Speech Recognition Rescoring Models. 471-475 - Hong Liu, Zhaobiao Lv, Zhijian Ou, Wenbo Zhao, Qing Xiao:
Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition. 476-480 - Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang:
Memory Augmented Lookup Dictionary Based Language Modeling for Automatic Speech Recognition. 481-485 - Yu Iwamoto, Takahiro Shinozaki:
Memory Network-Based End-To-End Neural ES-KMeans for Improved Word Segmentation. 486-490 - Yui Sudo
, Kazuya Hata, Kazuhiro Nakadai:
Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation. 491-495
Language Identification and Diarization
- Winstead Zhu, Md. Iftekhar Tanveer, Yang Janet Liu, Seye Ojumu, Rosie Jones:
Lightweight and Efficient Spoken Language Identification of Long-form Audio. 496-500 - Jagabandhu Mishra, Jayadev N. Patil, Amartya Chowdhury, S. R. Mahadeva Prasanna:
End to End Spoken Language Diarization with Wav2vec Embeddings. 501-505 - Oriol Nieto, Zeyu Jin, Franck Dernoncourt, Justin Salamon:
Efficient Spoken Language Recognition via Multilabel Classification. 506-510 - Pavel Matejka, Anna Silnova, Josef Slavícek, Ladislav Mosner, Oldrich Plchot, Michal Klco, Junyi Peng, Themos Stafylakis
, Lukás Burget:
Description and Analysis of ABC Submission to NIST LRE 2022. 511-515 - Tanel Alumäe, Kunnar Kukk
, Viet Bac Le, Claude Barras, Abdel Messaoudi, Waad Ben Kheder:
Exploring the Impact of Pretrained Models and Web-Scraped Data for the 2022 NIST Language Recognition Evaluation. 516-520 - Jesús Villalba, Jonas Borgstrom, Maliha Jahan, Saurabh Kataria, Leibny Paola García, Pedro A. Torres-Carrasquillo, Najim Dehak:
Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22. 521-525
Speech Quality Assessment
- Xinyu Liang, Fredrik Cumlin, Christian Schüldt, Saikat Chatterjee:
DeePMOS: Deep Posterior Mean-Opinion-Score of Speech. 526-530 - Ashwini Dasare, Pradyoth Hegde, Supritha M. Shetty, Deepak K. T.:
The Role of Formant and Excitation Source Features in Perceived Naturalness of Low Resource Tribal Language TTS: An Empirical Study. 531-535 - Wuxuan Gong, Jing Wang, Yitong Liu, Hongwen Yang:
A no-reference speech quality assessment method based on neural network with densely connected convolutional architecture. 536-540 - Bao Thang Ta
, Minh Tu Le, Nhat Minh Le, Van Hai Do:
Probing Speech Quality Information in ASR Systems. 541-545 - Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda:
Preference-based training framework for automatic speech quality assessment using deep neural network. 546-550 - Wannaphong Phatthiyaphaibun, Chompakorn Chaksangchaichot, Thanawin Rakthanmanon, Ekapol Chuangsuwanich
, Sarana Nutanong:
Crowdsourced Data Validation for ASR Training. 551-555
Feature Modeling for ASR
- Zhouyuan Huo, Khe Chai Sim, Dongseong Hwang, Tsendsuren Munkhdalai, Tara N. Sainath, Pedro Moreno Mengibar:
Re-investigating the Efficient Transfer Learning of Speech Foundation Model using Feature Fusion Methods. 556-560 - Gege Qi, Yuefeng Chen, Xiaofeng Mao, Xiaojun Jia, Ranjie Duan, Rong Zhang, Hui Xue:
Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training. 561-565 - Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei, Feng Chen, Song-Lu Chen, Xu-Cheng Yin:
InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition. 566-570 - Yizhou Tan, Haojun Ai, Shengchen Li, Feng Zhang:
Transductive Feature Space Regularization for Few-shot Bioacoustic Event Detection. 571-575 - Jisung Wang, Haram Lee, Myungwoo Oh
:
Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition. 576-580 - Titouan Parcollet, Shucong Zhang, Rogier van Dalen, Alberto Gil C. P. Ramos, Sourav Bhattacharya:
On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning. 581-585
Interfacing Speech Technology and Phonetics
- Louis ten Bosch, Martijn Bentum, Lou Boves:
Phonemic competition in end-to-end ASR models. 586-590 - Vincent Hughes, Jessica Wormald, Paul Foulkes, Philip Harrison, Finnian Kelly
, David van der Vloed, Poppy Welch, Chenzi Xu
:
Automatic speaker recognition with variation across vocal conditions: a controlled experiment with implications for forensics. 591-595 - Bernhard C. Geiger, Barbara Schuppler
:
Exploring Graph Theory Methods For the Analysis of Pronunciation Variation in Spontaneous Speech. 596-600 - Bryony Nuttall, Philip Harrison, Vincent Hughes:
Automatic Speaker Recognition performance with matched and mismatched female bilingual speech data. 601-605
Speech Synthesis: Multilinguality
- Hongsun Yang, Ji-Hoon Kim, Yooncheol Ju, Ilhwan Kim, Byeong-Yeol Kim, Shukjae Choi, Hyung Yong Kim:
FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters. 606-610 - Hoyeon Lee, Hyun-Wook Yoon, Jong-Hwan Kim, Jae-Min Kim:
Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model. 611-615 - Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu:
DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech. 616-620 - Konstantinos Markopoulos, Georgia Maniati, Georgios Vamvoukakis, Nikolaos Ellinas, Georgios Vardaxoglou, Panos Kakoulidis
, Junkwang Oh, Gunu Jho, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis, Spyros Raptis:
Generating Multilingual Gender-Ambiguous Text-to-Speech Voices. 621-625 - Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro:
RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech. 626-630 - Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime Lorenzo-Trueba:
Multilingual context-based pronunciation learning for Text-to-Speech. 631-635
Speech Emotion Recognition 1
- Minh Tran, Yufeng Yin, Mohammad Soleymani:
Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition. 636-640 - Huang-Cheng Chou, Lucas Goncalves, Seong-Gyun Leem, Chi-Chun Lee
, Carlos Busso:
The Importance of Calibration: Rethinking Confidence and Performance of Speech Multi-label Emotion Classifiers. 641-645 - Mohammad Ibrahim Malik, Siddique Latif
, Raja Jurdak
, Björn W. Schuller:
A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model. 646-650 - Basmah Alsenani, Tanaya Guha, Alessandro Vinciarelli:
Privacy Risks in Speech Emotion Recognition: A Systematic Study on Gender Inference Attack. 651-655 - James Tavernor
, Matthew Perez
, Emily Mower Provost:
Episodic Memory For Domain-Adaptable, Robust Speech Emotion Recognition. 656-660 - Chaoyue Ding, Jiakui Li, Daoming Zong, Baoxiang Li, Tian-Hao Zhang, Qunyan Zhou:
Stable Speech Emotion Recognition with Head-k-Pooling Loss. 661-665
Show and Tell: Health applications and emotion recognition
- Matthew Gibson, Ievgen Karaulov, Oleksii Zhelo, Filip Jurcícek:
A Personalised Speech Communication Application for Dysarthric Speakers. 666-667 - Sun-Kyung Lee, Jong-Hwan Kim:
Video Multimodal Emotion Recognition System for Real World Applications. 668-669 - Mahdin Rohmatillah, Bobbi Aditya, Li-Jen Yang, Bryan Gautama Ngo, Willianto Sulaiman, Jen-Tzung Chien:
Promoting Mental Self-Disclosure in a Spoken Dialogue System. 670-671 - Pawel Bujnowski, Bartlomiej Kuzma, Bartlomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, Piotr Andruszkiewicz:
"Select language, modality or put on a mask!" Experiments with Multimodal Emotion Recognition. 672-673 - Hannah Valentine, Joel MacAuslan, Maria I. Grigos, Marisha Speights:
My Vowels Matter: Formant Automation Tools for Diverse Child Speech. 674-675 - Nicky Chong-White, Arun Sebastian, Jorge Mejia:
NEMA: An Ecologically Valid Tool for Assessing Hearing Devices, Advanced Algorithms, and Communication in Diverse Listening Environments. 676-677 - Vikram Ramanarayanan, David Pautler, Lakshmi Arbatti, Abhishek Hosamath, Michael Neumann, Hardik Kothare, Oliver Roesler, Jackson Liscombe, Andrew Cornish, Doug Habberstad, Vanessa Richter, David Fox, David Suendermann-Oeft, Ira Shoulson:
When Words Speak Just as Loudly as Actions: Virtual Agent Based Remote Health Assessment Integrating What Patients Say with What They Do. 678-679 - Kowshik Siva Sai Motepalli, Vamshiraghusimha Narasinga, Harsha Pathuri, Hina Khan, Sangeetha Mahesh, Ajish K. Abraham, Anil Kumar Vuppala:
Stuttering Detection Application. 680-681 - Mario Zusag, Laurin Wagner:
Providing Interpretable Insights for Neurological Speech and Cognitive Disorders from Interactive Serious Games. 682-683 - Jacob C. Solinsky, Raymond L. Finzel, Martin Michalowski, Serguei Pakhomov:
Automated Neural Nursing Assistant (ANNA): An Over-The-Phone System for Cognitive Monitoring. 684-685 - Ankit Gupta, Abhijeet Bishnu, Mandar Gogate, Kia Dashtipour, Tughrul Arslan, Ahsan Adeel, Amir Hussain, Tharmalingam Ratnarajah, Mathini Sellathurai:
5G-IoT Cloud based Demonstration of Real-Time Audio-Visual Speech Enhancement for Multimodal Hearing-aids. 686-687 - Mohsin Raza, Adewale Adetomi, Khubaib Ahmed, Amir Hussain, Tughrul Arslan, Ahsan Adeel:
Towards Two-point Neuron-inspired Energy-efficient Multimodal Open Master Hearing Aid. 688-689
Spoken Dialog Systems and Conversational Analysis 1
- Xuxin Cheng, Wanshi Xu, Ziyu Yao, Zhihong Zhu, Yaowei Li, Hongxiang Li, Yuexian Zou:
FC-MTLF: A Fine- and Coarse-grained Multi-Task Learning Framework for Cross-Lingual Spoken Language Understanding. 690-694 - Xuxin Cheng, Ziyu Yao, Zhihong Zhu, Yaowei Li, Hongxiang Li, Yuexian Zou:
C²A-SLU: Cross and Contrastive Attention for Improving ASR Robustness in Spoken Language Understanding. 695-699 - Henry Weld, Sijia Hu, Siqu Long, Josiah Poon, Soyeon Caren Han:
Tri-level Joint Natural Language Understanding for Multi-turn Conversational Datasets. 700-704 - Gaëlle Laperrière, Ha Nguyen, Sahar Ghannay, Bassam Jabaian, Yannick Estève:
Semantic Enrichment Towards Efficient Speech Representations. 705-709 - Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng
, Brian Yan, Emiru Tsunoo, Shinji Watanabe
:
Tensor decomposition for minimization of E2E SLU model toward on-device processing. 710-714 - Tianjun Mao, Chenghong Zhang:
DiffSLU: Knowledge Distillation Based Diffusion Model for Cross-Lingual Spoken Language Understanding. 715-719 - Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe
:
Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding. 720-724 - Zhiyuan Zhu, Yusheng Liao, Yu Wang, Yunfeng Guan:
Contrastive Learning Based ASR Robust Knowledge Selection For Spoken Dialogue System. 725-729 - Seongmin Park, Jinkyu Seo, Jihwa Lee:
Unsupervised Dialogue Topic Segmentation in Hyperdimensional Space. 730-734 - Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti:
An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for Spoken Language Understanding. 735-739 - Zhenhe Wu, Xiaoguang Yu, Meng Chen, Liangqing Wu, Jiahao Ji, Zhoujun Li:
Enhancing New Intent Discovery via Robust Neighbor-based Contrastive Learning. 740-744 - Andreas Schwarz, Di He, Maarten Van Segbroeck, Mohammed Hethnawi, Ariya Rastrow:
Personalized Predictive ASR for Latency Reduction in Voice Assistants. 745-749 - Avik Ray, Yilin Shen, Hongxia Jin:
Compositional Generalization in Spoken Language Understanding. 750-754 - Zefei Li, Anil Ramakrishna, Anna Rumshisky, Andy Rosenbaum, Saleh Soltan, Rahul Gupta:
Sampling bias in NLU models: Impact and Mitigation. 755-759 - Jiarui Lu, Bo-Hsiang Tseng, Joel Ruben Antony Moniz, Site Li, Xueyun Zhu, Hong Yu, Murat Akbacak:
5IDER: Unified Query Rewriting for Steering, Intent Carryover, Disfluencies, Entity Carryover and Repair. 760-764 - Xiaohan Shi, Xingfeng Li, Tomoki Toda:
Emotion Awareness in Multi-utterance Turn for Improving Emotion Prediction in Multi-Speaker Conversation. 765-769 - Minghan Wang, Yinglu Li, Jiaxin Guo, Xiaosong Qiao, Zongyao Li, Hengchao Shang, Daimeng Wei, Shimin Tao, Min Zhang, Hao Yang:
WhiSLU: End-to-End Spoken Language Understanding with Whisper. 770-774
Speech Coding and Enhancement 1
- Chuan Wen, Sarah Verhulst:
Biophysically-inspired single-channel speech enhancement in the time domain. 775-779 - Md Asif Jalal, Pablo Peso Parada, Jisi Zhang, Mete Ozay, Karthikeyan Saravanan, Myoungji Han, Jungin Lee, Seokyeong Jung:
On-Device Speaker Anonymization of Acoustic Embeddings for ASR based on Flexible Location Gradient Reversal Layer. 780-784 - Hye-jin Shim, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen:
How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning. 785-789 - Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro:
CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram. 790-794 - Zhuangqi Chen, Xianjun Xia, Cheng Chen, Xianke Wang, Yanhong Leng, Li Chen, Roberto Togneri
, Yijian Xiao, Piao Ding, Shenyi Song, Pingjian Zhang:
A Two-stage Progressive Neural Network for Acoustic Echo Cancellation. 795-799 - Linping Xu, Jiawei Jiang, Dejun Zhang, Xianjun Xia, Li Chen, Yijian Xiao, Piao Ding, Shenyi Song, Sixing Yin, Ferdous Sohel
:
An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec. 800-803 - Shucong Zhang, Malcolm Chadwick, Alberto Gil C. P. Ramos, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya:
Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations. 804-808 - Nursadul Mamun, John H. L. Hansen:
CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement. 809-813 - Hejung Yang, Hong-Goo Kang:
Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement. 814-818 - Wei Xiao, Wenzhe Liu, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu:
Multi-mode Neural Speech Coding Based on Deep Generative Networks. 819-823 - Soo Hyun Bae, Seok Wan Chae, Youngseok Kim, Keunsang Lee, Hyunjin Lim, Lae-Hoon Kim:
Streaming Dual-Path Transformer for Speech Enhancement. 824-828 - Mahsa Kadkhodaei Elyaderani, Shahram Shirani:
Sequence-to-Sequence Multi-Modal Speech In-Painting. 829-833 - Hao Zhang, Meng Yu, Yuzhong Wu, Tao Yu, Dong Yu:
Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression. 834-838 - Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi:
Differentially Private Adapters for Parameter Efficient Acoustic Modeling. 839-843 - Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling:
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation. 844-848 - Yasufumi Uezu, Sicheng Wang, Teruki Toya, Masashi Unoki:
Consonant-emphasis Method Incorporating Robust Consonant-section Detection to Improve Intelligibility of Bone-conducted speech. 849-853 - Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo:
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss. 854-858 - Joon Byun, Seungmin Shin
, Jongmo Sung, Seungkwon Beack, Youngcheol Park:
Perceptual Improvement of Deep Neural Network (DNN) Speech Coder Using Parametric and Non-parametric Density Models. 859-863 - Dongheon Lee, Dayun Choi, Jung-Woo Choi:
DeFT-AN RT: Real-time Multichannel Speech Enhancement using Dense Frequency-Time Attentive Network and Non-overlapping Synthesis Window. 864-868
Speech Recognition: Signal Processing, Acoustic Modeling, Robustness, Adaptation 2
- Kyungmin Lee, Haeri Kim, Sichen Jin, Jinhwan Park, Youngho Han:
A More Accurate Internal Language Model Score Estimation for the Hybrid Autoregressive Transducer. 869-873 - Kyungmin Lee, Hyeontaek Lim, Munhwan Lee, Hong-Gee Kim:
Attention Gate Between Capsules in Fully Capsule-Network Speech Recognition. 874-878 - Jiatong Shi, Dan Berrebbi, William Chen, En-Pei Hu, Wei-Ping Huang, Ho-Lam Chung, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe
:
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. 884-888 - Do-Hee Kim, Daeyeol Shim, Joon-Hyuk Chang:
General-purpose Adversarial Training for Enhanced Automatic Speech Recognition Model Generalization. 889-893 - Keke Zhao, Peng Song, Shaokai Li
, Wenming Zheng:
Joint Instance Reconstruction and Feature Subspace Alignment for Cross-Domain Speech Emotion Recognition. 894-898 - Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami:
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data. 899-903 - Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma:
Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition. 904-908 - Nithish Muthuchamy Selvaraj, Xiaobao Guo, Adams Wai-Kin Kong, Bingquan Shen, Alex C. Kot:
Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers. 909-913 - Tian-Hao Zhang, Haibo Qin, Zhi-Hao Lai, Song-Lu Chen, Qi Liu, Feng Chen, Xinyuan Qian, Xu-Cheng Yin:
Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding. 914-918 - Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen:
Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation. 919-923 - Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola García, Daniel Povey, Sanjeev Khudanpur:
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts. 924-928 - Shubo Lv, Xiong Wang, Sining Sun, Long Ma, Lei Xie:
DCCRN-KWS: An Audio Bias Based Model for Noise Robust Small-Footprint Keyword Spotting. 929-933 - Li Fu, Siqi Li, Qingtao Li, Fangzhu Li, Liping Deng, Fan Lu, Meng Chen, Youzheng Wu, Xiaodong He:
OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition. 934-938 - Maurits J. R. Bleeker, Pawel Swietojanski, Stefan Braun, Xiaodan Zhuang:
Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition. 939-943 - Steven Vander Eeckt
, Hugo Van hamme
:
Rehearsal-Free Online Continual Learning for Automatic Speech Recognition. 944-948
Speech Recognition: Technologies and Systems for New Applications 2
- Kaiqi Fu, Shaojun Gao, Shuju Shi, Xiaohai Tian, Wei Li, Zejun Ma:
Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring. 949-953 - Shuju Shi, Kaiqi Fu, Yiwei Gu, Xiaohai Tian, Shaojun Gao, Wei Li, Zejun Ma:
Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment. 954-958 - Hyungshin Ryu, Sunhee Kim, Minhwa Chung:
A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning. 959-963 - Xing Wei, Roeland van Hout, Catia Cucchiarini, Danielle Reuvekamp, Helmer Strik:
Assessing Intelligibility in Non-native Speech: Comparing Measures Obtained at Different Levels. 964-968 - Yukang Liang, Kaitao Song, Shaoguang Mao, Huiqiang Jiang, Luna Qiu, Yuqing Yang, Dongsheng Li, Linli Xu, Lili Qiu:
End-to-End Word-Level Pronunciation Assessment with MASK Pre-training. 969-973 - Fu-An Chao, Tien-Hong Lo, Tzu-I Wu, Yao-Ting Sung, Berlin Chen:
A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment. 974-978 - Yingxiang Gao, Jaehyun Choi, Nobuaki Minematsu, Noriko Nakanishi, Daisuke Saito:
Automatic Prediction of Language Learners' Listenability Using Speech and Text Features Extracted from Listening Drills. 979-983 - Ram C. M. C. Shekar, Mu Yang, Kevin Hirschi
, Stephen D. Looney, Okim Kang, John H. L. Hansen:
Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-level Goodness of Pronunciation Transformer. 984-988 - Rao Ma, Mengjie Qian, Mark J. F. Gales, Kate M. Knill:
Adapting an Unadaptable ASR System. 989-993 - Jungbae Park, Seungtaek Choi
:
Addressing Cold Start Problem for End-to-end Automatic Speech Scoring. 994-998 - Manuel Sam Ribeiro, Giulia Comini, Jaime Lorenzo-Trueba:
Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings. 999-1003 - Caitlin Richter, Ragnar Pálsson, Luke O'Brien, Kolbrún Friðriksdóttir, Branislav Bédi, Eydís Huld Magnúsdóttir, Jón Guðnason:
Orthography-based Pronunciation Scoring for Better CAPT Feedback. 1004-1008 - Hongfu Liu, Mingqian Shi, Ye Wang
:
Zero-Shot Automatic Pronunciation Assessment. 1009-1013 - Tuong Tu Huu, Viet-Thanh Pham, Thi Thu Trang Nguyen, Thai Lai Dao
:
Mispronunciation detection and diagnosis model for tonal language, applied to Vietnamese. 1014-1018
Keynote 2
- Virginia Dignum:
Beyond the AI hype: Balancing Innovation and Social Responsibility. 1019
Paralinguistics 1
- Georg Stemmer, Paulo López-Meyer, Juan A. del Hoyo Ontiveros, Jose A. Lopez, Héctor A. Cordourier, Tobias Bocklet
:
Detection of Emotional Hotspots in Meetings Using a Cross-Corpus Approach. 1020-1024 - Takuto Matsuda, Yoshiko Arimoto
:
Detection of Laughter and Screaming Using the Attention and CTC Models. 1025-1029 - Debasmita Bhattacharya, Jie Chi, Julia Hirschberg, Peter Bell:
Capturing Formality in Speech Across Domains and Languages. 1030-1034 - Jialu Li
, Mark Hasegawa-Johnson, Nancy L. McElwain:
Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio. 1035-1039 - Kathrin Feindt, Martina Rossi
, Ghazaleh Esfandiari-Baiat, Axel G. Ekström, Margaret Zellers
:
Cues to next-speaker projection in conversational Swedish: Evidence from reaction times. 1040-1044 - Abeer A. N. Buker, Huda Alsofyani, Alessandro Vinciarelli:
Multiple Instance Learning for Inference of Child Attachment From Paralinguistic Aspects of Speech. 1045-1049
Speech Enhancement and Denoising
- Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Pärnamaa, Huaming Wang:
Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation. 1050-1054 - Andong Li, Weixin Meng, Guochen Yu, Wenzhe Liu, Xiaodong Li, Chengshi Zheng:
TaylorBeamixer: Learning Taylor-Inspired All-Neural Multi-Channel Speech Enhancement from Beam-Space Dictionary Perspective. 1055-1059 - Yulong Wang, Xueliang Zhang:
MFT-CRN: Multi-scale Fourier Transform for Monaural Speech Enhancement. 1060-1064 - Zilu Guo, Jun Du, Chin-Hui Lee, Yu Gao, Wenbin Zhang:
Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement. 1065-1069 - Hassan Taherian, Ashutosh Pandey, Daniel Wong, Buye Xu, DeLiang Wang:
Multi-input Multi-output Complex Spectral Mapping for Speaker Separation. 1070-1074 - Maurice Oberhag
, Daniel Neudek
, Rainer Martin
, Tobias Rosenkranz, Henning Puder:
Short-term Extrapolation of Speech Signals Using Recursive Neural Networks in the STFT Domain. 1075-1079
Speech Synthesis: Evaluation
- Ayushi Pandey, Jens Edlund, Sébastien Le Maguer, Naomi Harte
:
Listener sensitivity to deviating obstruents in WaveNet. 1080-1084 - Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura
, Kentaro Seki, Detai Xin, Hiroshi Saruwatari:
How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics. 1085-1089 - Joshua Camp, Tom Kenter, Lev Finkelstein, Rob Clark:
MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors. 1090-1094 - Hui Wang, Shiwan Zhao, Xiguang Zheng, Yong Qin:
RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting. 1095-1099 - Gerda Ana Melnik-Leroy, Gediminas Navickas:
Can Better Perception Become a Disadvantage? Synthetic Speech Perception in Congenitally Blind Users. 1100-1103 - Erica Cooper, Junichi Yamagishi:
Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech. 1104-1108
End-to-end Spoken Dialog Systems
- Mutian He, Philip N. Garner
:
Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding. 1109-1113 - Shangeth Rajaa:
Improving End-to-End SLU performance with Prosodic Attention and Distillation. 1114-1118 - Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer:
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding. 1119-1123 - Lingyan Huang, Tao Li, Haodong Zhou, Qingyang Hong, Lin Li:
Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding. 1124-1128 - Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas, Hong-Kwang Jeff Kuo, Brian Kingsbury:
ConvKT: Conversation-Level Knowledge Transfer for Context Aware End-to-End Spoken Language Understanding. 1129-1133 - Xuxin Cheng, Zhihong Zhu, Ziyu Yao, Hongxiang Li, Yaowei Li, Yuexian Zou:
GhostT5: Generate More Features with Cheap Operations to Improve Textless Spoken Question Answering. 1134-1138
Biosignal-enabled Spoken Communication
- Kaibo Zhang, Lili Cao, Yiming Ding, Yanru Li, Chao Zhang, Ji Wu, Demin Han:
Obstructive Sleep Apnea Detection using Pre-trained Speech Representations. 1139-1143 - Ruicong Wang, Siqi Cai, Haizhou Li:
EEG-based Auditory Attention Detection with Spatiotemporal Graph and Graph Convolutional Network. 1144-1148 - Rachel Beeson, Korin Richmond:
Silent Speech Recognition with Articulator Positions Estimated from Tongue Ultrasound and Lip Video. 1149-1153 - Kai Yang, Zhuang Xie, Di Zhou
, Longbiao Wang, Gaoyan Zhang:
Auditory Attention Detection in Real-Life Scenarios Using Common Spatial Patterns from EEG. 1154-1158 - Soowon Kim, Young-Eun Lee, Seo-Hyun Lee, Seong-Whan Lee:
Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG. 1159-1163 - Tamás Gábor Csapó, Frigyes Viktor Arthur
, Péter Nagy, Ádám Boncz:
Towards Ultrasound Tongue Image prediction from EEG during speech production. 1164-1168 - László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Tamás Gábor Csapó:
Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks. 1169-1173 - Kevin Scheck, Tanja Schultz:
STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks. 1174-1178 - Inge Salomons, Eder del Blanco, Eva Navas, Inma Hernáez:
Spanish Phone Confusion Analysis for EMG-Based Silent Speech Interfaces. 1179-1183 - Huiyan Li, Mingyi Wang, Han Gao, Shuo Zhao, Guang Li, You Wang:
Hybrid Silent Speech Interface Through Fusion of Electroencephalography and Electromyography. 1184-1188
Neural-based Speech and Acoustic Analysis
- Eklavya Sarkar
, Mathew Magimai-Doss
:
Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers? 1189-1193 - Jinjin Cai, Sudip Vhaduri, Xiao Luo:
Discovering COVID-19 Coughing and Breathing Patterns from Unlabeled Data Using Contrastive Learning with Varying Pre-Training Domains. 1194-1198 - Yifei Xin, Dongchao Yang, Yuexian Zou:
Background-aware Modeling for Weakly Supervised Sound Event Detection. 1199-1203 - Prerak Srivastava, Antoine Deleforge, Archontis Politis
, Emmanuel Vincent:
How to (Virtually) Train Your Speaker Localizer. 1204-1208 - Sreyan Ghosh, Utkarsh Tyagi, S. Ramaneswaran, Harshvardhan Srivastava, Dinesh Manocha:
MMER: Multimodal Multi-task Learning for Speech Emotion Recognition. 1209-1213 - Tanmay Khandelwal, Rohan Kumar Das:
A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds. 1214-1218
DiGo - Dialog for Good: Speech and Language Technology for Social Good
- Michael Neumann, Hardik Kothare, Doug Habberstad, Vikram Ramanarayanan:
A Multimodal Investigation of Speech, Text, Cognitive and Facial Video Features for Characterizing Depression With and Without Medication. 1219-1223 - Angus Addlesee, Marco Damonte:
Understanding Disrupted Sentences Using Underspecified Abstract Meaning Representation. 1224-1228 - Anjalie Field, Prateek Verma, Nay San, Jennifer L. Eberhardt, Dan Jurafsky:
Developing Speech Processing Pipelines for Police Accountability. 1229-1233 - Éva Székely, Joakim Gustafson, Ilaria Torre:
Prosody-controllable Gender-ambiguous Speech Synthesis: A Tool for Investigating Implicit Bias in Speech Perception. 1234-1238 - Jean-Luc Rouas
, Yaru Wu, Takaaki Shochi
:
Affective attributes of French caregivers' professional speech. 1239-1243
Speech Recognition: Signal Processing, Acoustic Modeling, Robustness, Adaptation 3
- Edresson Casanova, Christopher Shulby, Alexander Korolev, Arnaldo Cândido Júnior, Anderson da Silva Soares, Sandra M. Aluísio, Moacir Antonelli Ponti:
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion. 1244-1248 - Yue Gu, Zhihao Du, Shiliang Zhang, Qian Chen, Jiqing Han:
Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition. 1249-1253 - Aoi Ito, Tatsuya Komatsu, Yusuke Fujita, Yusuke Kida:
Target Vocabulary Recognition Based on Multi-Task Learning with Decomposed Teacher Sequences. 1254-1258 - Gaofei Shen
, Afra Alishahi, Arianna Bisazza, Grzegorz Chrupala
:
Wave to Syntax: Probing spoken language models for syntax. 1259-1263 - Burin Naowarat, Philip Harding, Pasquale D'Alterio, Sibo Tong, Bashar Awwad Shiekh Hasan:
Effective Training of Attention-based Contextual Biasing Adapters with Synthetic Audio for Personalised ASR. 1264-1268 - Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang, Chao Zhang, Xie Chen:
Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation. 1269-1273 - Mirazul Haque, Rutvij Shah, Simin Chen, Berrak Sisman, Cong Liu, Wei Yang:
SlothSpeech: Denial-of-service Attack Against Speech Recognition Models. 1274-1278 - Zhihan Wang, Feng Hou, Ruili Wang:
CLRL-Tuning: A Novel Continual Learning Approach for Automatic Speech Recognition. 1279-1283 - Li-Fang Lai, Nicole R. Holliday:
Exploring Sources of Racial Bias in Automatic Speech Recognition through the Lens of Rhythmic Variation. 1284-1288 - Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland:
Can Contextual Biasing Remain Effective with Whisper and GPT-2? 1289-1293 - Daisuke Niizumi
, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada
, Kunio Kashino:
Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation. 1294-1298 - Xiaodong Cui, George Saon
, Brian Kingsbury:
Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition. 1299-1303 - Jiamin Xie, John H. L. Hansen:
MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition. 1304-1308 - Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng:
Adapting Multi-Lingual ASR Models for Handling Multiple Talkers. 1314-1318 - Dianwen Ng, Chong Zhang
, Ruixi Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Qian Chen, Wen Wang, Eng Siong Chng, Bin Ma:
Adapter-tuning with Effective Token-dependent Representation Shift for Automatic Speech Recognition. 1319-1323 - Yiting Lu, Philip Harding, Kanthashree Mysore Sathyendra, Sibo Tong, Xuandi Fu, Jing Liu, Feng-Ju Chang, Simon Wiesler, Grant P. Strimel:
Model-Internal Slot-triggered Biasing for Domain Expansion in Neural Transducer ASR Models. 1324-1328 - Zengwei Yao, Wei Kang, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Yifan Yang, Long Lin, Daniel Povey:
Delay-penalized CTC Implemented Based on Finite State Transducer. 1329-1333
Speech Recognition: Architecture, Search, and Linguistic Components 2
- Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng:
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation. 1334-1338 - Zhipeng Chen, Haihua Xu, Yerbolat Khassanov, Yi He, Lu Lu, Zejun Ma, Ji Wu:
Knowledge Distillation Approach for Efficient Internal Language Model Estimation. 1339-1343 - Jiban Adhikary, Keith Vertanen:
Language Model Personalization for Improved Touchscreen Typing. 1344-1348 - Minkyu Jung, Ohhyeok Kwon, Seunghyun Seo, Soonshin Seo:
Blank Collapse: Compressing CTC Emission for the Faster Decoding. 1349-1353 - Cal Peyser, Zhong Meng, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho, Ke Hu:
Improving Joint Speech-Text Representations Without Alignment. 1354-1358 - Robert Flynn, Anton Ragni:
Leveraging Cross-Utterance Context For ASR Decoding. 1359-1363 - Minglun Han
, Feilong Chen, Jing Shi, Shuang Xu, Bo Xu:
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation. 1364-1368 - Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
:
Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition. 1369-1373 - Dongcheng Jiang, Chao Zhang, Philip C. Woodland:
A Neural Time Alignment Module for End-to-End Automatic Speech Recognition. 1374-1378 - Yuang Li, Yu Wu, Jinyu Li, Shujie Liu:
Accelerating Transducers through Adjacent Token Merging. 1379-1383 - Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang:
Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition. 1384-1388 - Wenxuan Wang, Guodong Ma
, Yuke Li, Binbin Du:
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition. 1389-1393 - Jaeyoung Lee, Masato Mimura, Tatsuya Kawahara:
Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model. 1394-1398 - Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe
:
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning. 1399-1403 - Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg:
SpellMapper: A non-autoregressive neural spellchecker for ASR customization with candidate retrieval based on n-gram mappings. 1404-1408 - Shaan Bijwadia, Shuo-Yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang:
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models. 1409-1413 - Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg:
Confidence-based Ensembles of End-to-End Speech Recognition Models. 1414-1418 - Jie Chi, Brian Lu, Jason Eisner, Peter Bell, Preethi Jyothi, Ahmed M. Ali:
Unsupervised Code-switched Text Generation from Parallel Text. 1419-1423 - Dingyi Wang
, Mengjie Luo, Lin Li
, Xiaoqin Wang, Shushan Qiao, Yumei Zhou:
A Binary Keyword Spotting System with Error-Diffusion Based Feature Binarization. 1424-1428 - Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang:
Language-universal Phonetic Encoder for Low-resource Speech Recognition. 1429-1433 - Chong-En Lin, Kuan-Yu Chen:
A Lexical-aware Non-autoregressive Transformer-based ASR Model. 1434-1438 - Joshua Jansen van Vüren, Thomas Niesler:
Improving Under-Resourced Code-Switched Speech Recognition: Large Pre-trained Models or Architectural Interventions. 1439-1443
Spoken Language Translation, Information Retrieval, Summarization, Resources, and Evaluation 1
- Jerome R. Bellegarda:
Pragmatic Pertinence: A Learnable Confidence Metric to Assess the Subjective Quality of LM-Generated Text. 1444-1448 - Yuanchao Li, Zeyu Zhao
, Ondrej Klejch, Peter Bell, Catherine Lai:
ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition. 1449-1453 - Roshan Sharma, Siddhant Arora, Kenneth Zheng, Shinji Watanabe
, Rita Singh, Bhiksha Raj:
BASS: Block-wise Adaptation for Speech Summarization. 1454-1458 - Meena M. Chandra Shekar, John H. L. Hansen:
Speaker Tracking using Graph Attention Networks with Varying Duration Utterances across Multi-Channel Naturalistic Data: Fearless Steps Apollo-11 Audio Corpus. 1459-1463 - Tianfang Yan, Kikuo Maekawa, Yukiko Nota, Masayuki Hirata:
Combining language corpora in a Japanese electromagnetic articulography database for acoustic-to-articulatory inversion. 1464-1467 - Xiaoheng Zhang
, Yang Li:
A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition. 1468-1472 - Cristian Chivriga, Rinita Roy:
Large Dataset Generation of Synchronized Music Audio and Lyrics at Scale using Teacher-Student Paradigm. 1473-1477 - Adhiraj Banerjee, Vipul Arora:
Enc-Dec RNN Acoustic Word Embeddings learned via Pairwise Prediction. 1478-1482 - Samantha Kotey, Rozenn Dahyot
, Naomi Harte
:
Query Based Acoustic Summarization for Podcasts. 1483-1487 - Ying Shi, Dong Wang, Lantian Li, Jiqing Han, Shi Yin:
Spot Keywords From Very Noisy and Mixed Speech. 1488-1492 - Khandokar Md. Nayem
, Ran Xue, Ching-Yun Chang, Akshaya Vishnu Kudlu Shanbhogue:
Knowledge Distillation on Joint Task End-to-End Speech Translation. 1493-1497 - Hao Yang, Jinming Zhao, Gholamreza Haffari, Ehsan Shareghi
:
Investigating Pre-trained Audio Encoders in the Low-Resource Condition. 1498-1502 - Guan-Wei Wu, Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee:
Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target. 1503-1507
Speech, Voice, and Hearing Disorders 1
- Eungbeom Kim, Yunkee Chae, Jaeheon Sim, Kyogu Lee:
Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test. 1508-1512 - Katerina Papadimitriou, Gerasimos Potamianos:
Multimodal Locally Enhanced Transformer for Continuous Sign Language Recognition. 1513-1517 - Monica González Machorro, Pascal Hecker, Uwe D. Reichel, Helly N. Hammer, Robert Hoepner, Lisa Pedrotti, Alisha Zmutt, Hesam Sagha, Johan van Beek, Florian Eyben, Dagmar M. Schuller, Björn W. Schuller, Bert Arnrich:
Towards Supporting an Early Diagnosis of Multiple Sclerosis using Vocal Features. 1518-1522 - Siddharth Rathod, Monil Charola, Akshat Vora, Yash Jogi, Hemant A. Patil:
Whisper Features for Dysarthric Severity-Level Classification. 1523-1527 - Jiyang Tang
, William Chen, Xuankai Chang, Shinji Watanabe
, Brian MacWhinney:
A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning. 1528-1532 - Zhengjun Yue, Erfan Loweimi, Zoran Cvetkovic:
Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra. 1533-1537 - Sebastian P. Bayerl, Dominik Wagner, Ilja Baumann, Florian Hönig, Tobias Bocklet
, Elmar Nöth, Korbinian Riedhammer
:
A Stutter Seldom Comes Alone - Cross-Corpus Stuttering Detection as a Multi-label Problem. 1538-1542 - Tanuka Bhattacharjee, Anjali Jayakumar, Yamini Belur, Atchayaram Nalini, Ravi Yadav
, Prasanta Kumar Ghosh:
Transfer Learning to Aid Dysarthria Severity Classification for Patients with Amyotrophic Lateral Sclerosis. 1543-1547 - Helin Wang, Thomas Thebaud, Jesús Villalba, Myra Sydnor, Becky Lammers, Najim Dehak, Laureano Moro-Velázquez:
DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model. 1548-1552 - Ramin Hedeshy
, Raphael Menges
, Steffen Staab
:
CNVVE: Dataset and Benchmark for Classifying Non-verbal Voice. 1553-1557 - Massa Baali, Ibrahim Almakky, Shady Shehata, Fakhri Karray:
Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation. 1558-1562 - Theodoros Kouzelis, Georgios Paraskevopoulos, Athanasios Katsamanis, Vassilis Katsouros:
Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling. 1563-1567 - Michal Novotný, Tereza Tykalová, Michal Simek, Tomás Kouba, Jan Rusz
:
Glottal source analysis of voice deficits in basal ganglia dysfunction: evidence from de novo Parkinson's disease and Huntington's disease. 1568-1572 - Jihyun Mun, Sunhee Kim, Myeong-Ju Kim, Jiwon Ryu, Sejoong Kim, Minhwa Chung:
An Analysis of Glottal Features of Chronic Kidney Disease Speech and Its Application to CKD Detection. 1573-1577 - Varun Belagali, M. V. Achuth Rao, Prasanta Kumar Ghosh:
Weakly supervised glottis segmentation in high-speed videoendoscopy using bounding box labels. 1578-1582
Speech Recognition: Technologies and Systems for New Applications 3
- Zhengyang Li, Chenwei Liang, Timo Lohrenz, Marvin Sach, Björn Möller, Tim Fingscheidt:
An Efficient and Noise-Robust Audiovisual Encoder for Audiovisual Speech Recognition. 1583-1587 - Satwinder Singh, Feng Hou, Ruili Wang:
A Novel Self-training Approach for Low-resource Speech Recognition. 1588-1592 - Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Shiliang Zhang:
FunASR: A Fundamental End-to-End Speech Recognition Toolkit. 1593-1597 - Pingchuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic:
Streaming Audio-Visual Speech Recognition with Alignment Regularization. 1598-1602 - Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, Maja Pantic:
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition. 1603-1607 - Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason:
Multimodal Speech Recognition for Language-Guided Embodied Agents. 1608-1612
Spoken Term Detection and Voice Search
- Kumari Nishu, Minsik Cho, Devang Naik:
Matching Latent Encoding for Audio-Text based Keyword Spotting. 1613-1617 - P. Sudhakar, K. Sreenivasa Rao, Pabitra Mitra:
Self-Paced Pattern Augmentation for Spoken Term Detection in Zero-Resource. 1618-1622 - Gene-Ping Yang, Yue Gu, Qingming Tang, Dongsu Du, Yuzong Liu:
On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation. 1623-1627 - Umberto Michieli
, Pablo Peso Parada, Mete Ozay:
Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics. 1628-1632 - Seunghan Yang, Byeonggeun Kim, Kyuhong Shim
, Simyoung Chang:
Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data. 1633-1637 - Chouchang Yang, Yashas Malur Saidutta, Rakshith Sharma Srinivasa, Ching Hua Lee, Yilin Shen, Hongxia Jin:
Robust Keyword Spotting for Noisy Environments by Leveraging Speech Enhancement and Speech Presence Probability. 1638-1642
Models for Streaming ASR
- Yuting Yang, Yuke Li, Binbin Du:
Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning. 1643-1647 - Xingchen Song, Di Wu, Binbin Zhang, Zhendong Peng, Bo Dang, Fuping Pan, Zhiyong Wu:
ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs. 1648-1652 - Hanbyul Kim, Seunghyun Seo, Lukas Lee, Seolki Baek:
Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation. 1653-1657 - Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff:
DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer. 1658-1662 - Kyuhong Shim
, Jinkyu Lee, Simyoung Chang, Kyuwoong Hwang:
Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer. 1663-1667 - Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao Li, Changru Chen, Chao Li, Lei Xie:
Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition. 1668-1672
Source Separation
- Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu, Timo Gerkmann:
Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. 1673-1677 - Kohei Saijo, Tetsuji Ogawa:
Remixing-based Unsupervised Source Separation from Scratch. 1678-1682 - Yuki Okamoto, Kanta Shimonishi, Keisuke Imoto, Kota Dohi, Shota Horiguchi, Yohei Kawaguchi:
CAPTDURE: Captioned Sound Dataset of Single Sources. 1683-1687 - Hokuto Munakata, Ryu Takeda, Kazunori Komatani:
Recursive Sound Source Separation with Deep Learning-based Beamforming for Unknown Number of Sources. 1688-1692 - Ladislav Mosner, Oldrich Plchot, Junyi Peng, Lukás Burget, Jan Cernocký:
Multi-Channel Speech Separation with Cross-Attention and Beamforming. 1693-1697 - Deokjun Eom, Woo Hyun Nam, Kyung-Rae Kim:
Background-Sound Controllable Voice Source Separation. 1698-1702
Speech and Language in Health: From Remote Monitoring to Medical Conversations 1
- Daniel Escobar-Grisales, Tomás Arias-Vergara, Cristian David Ríos-Urrego, Elmar Nöth, Adolfo M. García, Juan Rafael Orozco-Arroyave
:
An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson's Disease Patients. 1703-1707 - Khanh-Tung Tran, Truong Hoang, Duy Khuong Nguyen, Hoang D. Nguyen, Xuan-Son Vu:
Personalization for Robust Voice Pathology Detection in Sound Waves. 1708-1712 - Helen Meng, Brian Mak
, Man-Wai Mak, Helene H. Fung, Xianmin Gong
, Timothy C. Y. Kwok, Xunying Liu, Vincent C. T. Mok, Patrick C. M. Wong, Jean Woo, Xixin Wu, Ka Ho Wong, Sean Shensheng Xu, Naijun Zheng
, Ranzo Huang
, Jiawen Kang, Xiaoquan Ke, Junan Li, Jinchao Li, Yi Wang:
Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders. 1713-1717 - Minxue Niu, Amrit Romana, Mimansa Jaiswal, Melvin G. McInnis, Emily Mower Provost:
Capturing Mismatch between Textual and Acoustic Emotion Expressions for Mood Identification in Bipolar Disorder. 1718-1722 - Qifei Li, Dong Wang, Yiming Ren, Yingming Gao, Ya Li:
FTA-net: A Frequency and Time Attention Network for Speech Depression Detection. 1723-1727 - Salvatore Fara, Orlaith Hickey, Alexandra Livia Georgescu
, Stefano Goria, Emilia Molimpakis, Nicholas Cummins
:
Bayesian Networks for the robust and unbiased prediction of depression and its symptoms utilizing speech and multimodal data. 1728-1732 - Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui Jin, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu:
Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition. 1733-1737 - Edward L. Campbell, Judith Dineley
, Pauline Conde, Faith Matcham, Katie M. White
, Carolin Oetzmann
, Sara Simblett
, Stuart Bruce, Amos A. Folarin, Til Wykes, Srinivasan Vairavan, Richard J. B. Dobson
, Laura Docío Fernández, Carmen García-Mateo, Vaibhav A. Narayan, Matthew Hotopf, Nicholas Cummins
:
Classifying depression symptom severity: Assessment of speech representations in personalized and generalized machine learning models. 1738-1742 - Shabnam Ghaffarzadegan, Luca Bondi
, Ho-Hsiang Wu, Sirajum Munir, Kelly J. Shields, Samarjit Das, Joseph Aracri:
Active Learning for Abnormal Lung Sound Data Curation and Detection in Asthma. 1743-1747 - Paula Andrea Pérez-Toro
, Tomás Arias-Vergara, Franziska Braun, Florian Hönig, Carlos Andrés Tobón-Quintero, David Aguillón, Francisco Lopera, Liliana Hincapié-Henao, Maria Schuster, Korbinian Riedhammer
, Andreas Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave
:
Automatic Assessment of Alzheimer's across Three Languages Using Speech and Language Features. 1748-1752 - Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye, Helen Meng, Xunying Liu:
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition. 1753-1757 - Jan Svihlík
, Vojtech Illner, Petr Krýze, Mário Sousa
, Paul Krack, Elina Tripoliti, Robert Jech, Jan Rusz
:
Relationship between LTAS-based spectral moments and acoustic parameters of hypokinetic dysarthria in Parkinson's disease. 1758-1762 - Eduardo Alvarado, Nicolás Grágeda, Alejandro Luzanto, Rodrigo Mahú, Jorge Wuth, Laura Mendoza, Richard M. Stern, Néstor Becerra Yoma:
Respiratory distress estimation in human-robot interaction scenario. 1763-1767 - Emma Reyner-Fuentes, Esther Rituerto-González, Isabel Trancoso
, Carmen Peláez-Moreno:
Prediction of the Gender-based Violence Victim Condition using Speech: What do Machine Learning Models rely on? 1768-1772 - Monil Charola, Aastha Kachhi, Hemant A. Patil:
Whisper Encoder features for Infant Cry Classification. 1773-1777
Speech Perception
- Nika Jurov, William J. Idsardi, Naomi H. Feldman
:
A neural architecture for selective attention to speech features. 1778-1782 - Mingyue Huo, Yinglun Sun, Daniel Fogerty, Yan Tang
:
Quantifying Informational Masking due to Masker Intelligibility in Same-talker Speech-in-speech Perception. 1783-1787 - Santiago Cuervo, Ricard Marxer:
On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions. 1788-1792 - Felicia Schulz, Mirella De Sisto, M. Paula M. P. Roncaglia-Denissen, Peter Hendrix:
Predicting Perceptual Centers Located at Vowel Onset in German Speech Using Long Short-Term Memory Networks. 1793-1797 - Martin Cooke, María Luisa García Lecumberri:
Exploring the mutual intelligibility breakdown caused by sculpting speech from a competing speech signal. 1798-1802 - Mafuyu Kitahara
, Naoya Watabe, Hiroto Noguchi, Chuyu Huang, Ayako Hashimoto, Ai Mizoguchi
:
Perception of Incomplete Voicing Neutralization of Obstruents in Tohoku Japanese. 1803-1807
Phonetics and Phonology: Languages and Varieties
- Jasmin Pöhnlein, Felicitas Kleber:
The emergence of obstruent-intrinsic f0 and VOT as cues to the fortis/lenis contrast in West Central Bavarian. 1808-1812 - William N. Havard, Yaya Sy, Camila Scaff, Loann Peurey, Alejandrina Cristià:
〈'〉 in Tsimane': a Preliminary Investigation. 1813-1817 - Dennis Hoffmann, Maria O'Reilly:
Segmental features of Brazilian (Santa Catarina) Hunsrik. 1818-1822 - Louise Ratko
, Joshua Penney
, Felicity Cox
:
Opening or Closing? An Electroglottographic Analysis of Voiceless Coda Consonants in Australian English. 1823-1827 - Franka Zebe:
Increasing aspiration of word-medial fortis plosives in Swiss Standard German. 1828-1832 - Bowei Shao
, Philipp Buech
, Anne Hermes, Maria Giavazzi:
Lexical Stress and Velar Palatalization in Italian: A spatio-temporal Interaction. 1833-1837
Paralinguistics 2
- Zihan Wu, Neil Scheidwasser-Clow, Karl El Hajal, Milos Cernak:
Speaker Embeddings as Individuality Proxy for Voice Stress Detection. 1838-1842 - Jingyao Wu
, Ting Dang
, Vidhyasaharan Sethu
, Eliathamby Ambikairajah:
From Interval to Ordinal: A HMM based Approach for Emotion Label Conversion. 1843-1847 - Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li:
Turbo your multi-modal classification with contrastive learning. 1848-1852 - Georgios Ioannides, Michael Owen, Andrew Fletcher, Viktor Rozgic, Chao Wang:
Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition. 1853-1857 - Ruiteng Zhang, Jianguo Wei, Xugang Lu, Yongwei Li, Junhai Xu, Di Jin, Jianhua Tao:
SOT: Self-supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition. 1858-1862 - Lokesh Bansal, S. Pavankumar Dubagunta, Malolan Chetlur, Pushpak Jagtap, Aravind Ganapathiraju:
On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion and Automatic Speech Recognition. 1863-1867 - Shao-Hao Lu, Yun-Shao Lin, Chi-Chun Lee
:
Speaking State Decoder with Transition Detection for Next Speaker Prediction. 1868-1872 - Yuki Kitagishi
, Naohiro Tawara, Atsunori Ogawa, Ryo Masumura, Taichi Asami:
What are differences? Comparing DNN and Human by Their Performance and Characteristics in Speaker Age Estimation. 1873-1877 - Joop Arts, Khiet P. Truong:
Effects of perceived gender on the perceived social function of laughter. 1878-1882 - Tilak Purohit
, Bogdan Vlasenko, Mathew Magimai-Doss
:
Implicit phonetic information modeling for speech emotion recognition. 1883-1887 - Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso:
Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters. 1888-1892 - Yang Liu, Haoqin Sun, Geng Chen, Qingyue Wang, Zhen Zhao, Xugang Lu, Longbiao Wang:
Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions. 1893-1897 - Abinay Reddy Naini, Ali N. Salman, Carlos Busso:
Preference Learning Labels by Anchoring on Consecutive Annotations. 1898-1902 - Orchid Chetia Phukan, Arun Balaji Buduru
, Rajesh Sharma:
Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks. 1903-1907 - Cheng Lu, Hailun Lian, Wenming Zheng, Yuan Zong, Yan Zhao
, Sunan Li:
Learning Local to Global Feature Aggregation for Speech Emotion Recognition. 1908-1912 - Xuechen Wang, Shiwan Zhao, Yong Qin:
Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition. 1913-1917
Speaker and Language Identification 1
- Viet-Thanh Pham, Xuan Thai Hoa Nguyen, Vu Hoang, Thi Thu Trang Nguyen:
Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition. 1918-1922 - Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen:
What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model. 1923-1927 - Yooyoung Lee, Craig S. Greenberg, Eliot Godard, Asad A. Butt, Elliot Singer, Trang Nguyen, Lisa P. Mason, Douglas A. Reynolds:
The 2022 NIST Language Recognition Evaluation. 1928-1932 - Salvatore Sarni, Sandro Cumani, Sabato Marco Siniscalchi, Andrea Bottino:
Description and analysis of the KPT system for NIST Language Recognition Evaluation 2022. 1933-1937 - Jia Qi Yip, Duc-Tuan Truong
, Dianwen Ng, Chong Zhang
, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma:
ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention. 1938-1942 - Jiadi Yao, Chengdong Liang, Zhendong Peng, Binbin Zhang, Xiao-Lei Zhang:
Branch-ECAPA-TDNN: A Parallel Branch Architecture to Capture Local and Global Features for Speaker Verification. 1943-1947 - Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen:
Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech. 1948-1952 - Spandan Dey, Premjeet Singh, Goutam Saha:
Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language Identification. 1953-1957 - Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegnér
:
A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model. 1958-1962 - Pablo Andrés Tamayo Flórez, Rubén Manrique, Bernardo Pereira Nunes:
HABLA: A Dataset of Latin American Spanish Accents for Voice Anti-spoofing. 1963-1967 - Rui Li, Zhiwei Xie, Haihua Xu, Yizhou Peng, Hexin Liu, Hao Huang, Eng Siong Chng:
Self-supervised Learning Representation based Accent Recognition with Persistent Accent Memory. 1968-1972 - Bei Liu, Haoyu Wang, Yanmin Qian:
Extremely Low Bit Quantization for Mobile Speaker Verification Systems Under 1MB Memory. 1973-1977 - Sourya Dipta Das, Yash Vadi, Abhishek Unnam, Kuldeep Yadav:
Unsupervised Out-of-Distribution Dialect Detection with Mahalanobis Distance. 1978-1982 - Hervé Bredin:
pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. 1983-1987 - Jingyu Li, Wei Liu, Zhaoyang Zhang, Jiong Wang
, Tan Lee
:
Model Compression for DNN-based Speaker Verification Using Weight Quantization. 1988-1992 - Bhavik Vachhani, Dipesh K. Singh, Rustom Lawyer:
Multi-resolution Approach to Identification of Spoken Languages and To Improve Overall Language Diarization System Using Whisper Model. 1993-1997 - Chang Zeng
, Xin Wang, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi:
Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms. 1998-2002 - Zhida Song, Liang He
, Baowei Zhao, Minqiang Xu
, Yu Zheng:
Dynamic Fully-Connected Layer for Large-Scale Speaker Verification. 2003-2007
Show and Tell: Speech tools, speech enhancement, speech synthesis
- Hendrik Schröter, Alberto N. Escalante-B., Tobias Rosenkranz, Andreas Maier:
DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement. 2008-2009 - Felix Burkhardt, Florian Eyben, Björn W. Schuller:
Nkululeko: Machine Learning Experiments on Speaker Characteristics Without Programming. 2010-2011 - Sébastien Le Maguer, Mark Anderson, Naomi Harte:
Sp1NY: A Quick and Flexible Speech Visualisation Tool in Python. 2012-2013 - Niamh Corkey, Johannah O'Mahony, Simon King:
Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0. 2014-2015 - Éva Székely, Siyang Wang, Joakim Gustafson:
So-to-Speak: An Exploratory Platform for Investigating the Interplay between Style and Prosody in TTS. 2016-2017 - Takayuki Arai, Tsukasa Yoshinaga, Akiyoshi Iida:
Comparing /b/ and /d/ with a Single Physical Model of the Human Vocal Tract to Visualize Droplets Produced while Speaking. 2018-2019 - Erik Ekstedt, Gabriel Skantze:
Show & Tell: Voice Activity Projection and Turn-taking. 2020-2021 - Héctor A. Cordourier, Georg Stemmer, Sinem Aslan, Tobias Bocklet, Himanshu Bhalla:
Real Time Detection of Soft Voice for Speech Enhancement. 2022-2023 - Avani Tanna, Michael Saxon, Amr El Abbadi, William Yang Wang:
Data Augmentation for Diverse Voice Conversion in Noisy Environments. 2024-2025 - Mandar Gogate, Kia Dashtipour, Amir Hussain:
Application for Real-time Audio-Visual Speech Enhancement. 2026-2027
Speech Synthesis and Voice Conversion
- Eunseop Yoon, Hee Suk Yoon
, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John B. Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo:
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction. 2028-2032 - Oleg Rybakov, Fadi Biadsy, Xia Zhang, Liyang Jiang, Phoenix Meadowlark, Shivani Agrawal:
Streaming Parrotron for on-device speech-to-speech conversion. 2033-2037 - Zein Shaheen, Tasnima Sadekova, Yulia Matveeva, Alexandra Shirshova, Mikhail A. Kudinov:
Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech. 2038-2042 - Takuma Okamoto, Tomoki Toda, Hisashi Kawai:
E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion. 2043-2047 - Yerin Choi
, Myoung-Wan Koo:
DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer. 2048-2052 - Matthew Baas, Benjamin van Niekerk
, Herman Kamper
:
Voice Conversion With Just Nearest Neighbors. 2053-2057 - Kou Tanaka, Takuhiro Kaneko, Hirokazu Kameoka, Shogo Seki:
CFVC: Conditional Filtering for Controllable Voice Conversion. 2058-2062 - Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Jixun Yao, Shuai Wang, Lei Xie, Mengxiao Bi:
DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding. 2063-2067 - Yun Chen, Lingxiao Yang, Qi Chen, Jian-Huang Lai, Xiaohua Xie:
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion. 2068-2072 - Bohan Wang, Damien Ronssin, Milos Cernak:
ALO-VC: Any-to-any Low-latency One-shot Voice Conversion. 2073-2077 - Christoph Minixhofer, Ondrej Klejch, Peter Bell:
Evaluating and reducing the distance between synthetic and real speech distributions. 2078-2082 - Waris Quamer
, Anurag Das, Ricardo Gutierrez-Osuna:
Decoupling Segmental and Prosodic Cues of Non-native Speech through Vector Quantization. 2083-2087 - Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima:
VC-T: Streaming Voice Conversion Based on Neural Transducer. 2088-2092 - Suhita Ghosh
, Arnab Das
, Yamini Sinha, Ingo Siegert, Tim Polzehl, Sebastian Stober:
Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion. 2093-2097 - Meiying Chen, Zhiyao Duan:
ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed. 2098-2102 - Yeonjong Choi, Chao Xie, Tomoki Toda:
Reverberation-Controllable Voice Conversion Using Reverberation Time Estimator. 2103-2107 - Cheng Yu, Yang Li, Weiqin Zu, Fanglei Sun, Zheng Tian, Jun Wang:
Cross-utterance Conditioned Coherent Speech Editing. 2108-2112
Spoken Language Translation, Information Retrieval, Summarization, Resources, and Evaluation 2
- Jianrong Wang, Yuchen Huo, Li Liu
, Tianyi Xu, Qi Li, Sen Li:
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information. 2113-2117 - Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang:
CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition. 2118-2122 - Yuhang Li, Xiao Wei, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang:
Improving Zero-shot Cross-domain Slot Filling via Transformer-based Slot Semantics Fusion. 2123-2127 - Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Dongwon Kim, Seungjin Lee, Sung Won Han:
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer. 2128-2132 - Viet Dac Lai, Abel Salinas, Hao Tan, Trung Bui, Quan Tran, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Thien Huu Nguyen
:
Boosting Punctuation Restoration with Data Generation and Reinforcement Learning. 2133-2137 - Yi-Fen Liu, Xiang-Li Lu:
J-ToneNet: A Transformer-based Encoding Network for Improving Tone Classification in Continuous Speech via F0 Sequences. 2138-2142 - Jonathan E. Avila, Nigel G. Ward:
Towards Cross-Language Prosody Transfer for Dialog. 2143-2147 - Santosh Kesiraju, Marek Sarvas, Tomás Pavlícek, Cécile Macaire, Alejandro Ciuba:
Strategies for Improving Low Resource Speech to Text Translation Relying on Pre-trained ASR Models. 2148-2152 - Alkis Koudounas
, Moreno La Quatra
, Lorenzo Vaiani, Luca Colomba, Giuseppe Attanasio, Eliana Pastor
, Luca Cagliero
, Elena Baralis:
ITALIC: An Italian Intent Classification Dataset. 2153-2157 - Janine Rugayan, Giampiero Salvi
, Torbjørn Svendsen
:
Perceptual and Task-Oriented Assessment of a Semantic Metric for ASR Evaluation. 2158-2162 - Guangpeng Li, Lu Chen, Kai Yu:
How ChatGPT is Robust for Spoken Language Understanding? 2163-2167 - Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao:
GigaST: A 10, 000-hour Pseudo Speech Translation Corpus. 2168-2172 - Jiaxin Fan, Yong Zhang, Hanzhang Li, Jianzong Wang, Zhitao Li, Sheng Ouyang, Ning Cheng, Jing Xiao:
Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism. 2173-2177 - Per Fallgren, Jens Edlund:
Crowdsource-based Validation of the Audio Cocktail as a Sound Browsing Tool. 2178-2182 - Yunxiang Li, Pengfei Liu, Xixin Wu, Helen Meng:
PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts. 2183-2187 - Shuhei Kato, Taiichi Hashimoto:
Speech-to-Face Conversion Using Denoising Diffusion Probabilistic Models. 2188-2192 - Yuta Nishikawa, Satoshi Nakamura:
Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation. 2193-2197
Novel Transformer Models for ASR
- Martin Radfar, Paulina Lyskawa, Brandon Trujillo, Yi Xie, Kai Zhen, Jahn Heymann, Denis Filimonov, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris:
Conmer: Streaming Conformer Without Self-attention for Interactive Voice Assistants. 2198-2202 - Do-Hee Kim, Ji-Eun Choi, Joon-Hyuk Chang:
Intra-ensemble: A New Method for Combining Intermediate Outputs in Transformer-based Automatic Speech Recognition. 2203-2207 - Yifan Peng
, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang
, Suwon Shon, Prashant Sridhar, Shinji Watanabe
:
A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks. 2208-2212 - Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet, Petr Motlícek:
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition. 2213-2217 - Carlos Carvalho
, Alberto Abad
:
Memory-augmented conformer for improved end-to-end long-form ASR. 2218-2222 - Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen, Xunying Liu:
Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems. 2223-2227
Speaker Recognition 1
- Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen
, Jiajun Qi:
An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification. 2228-2232 - Jian Zhang, Liang He
, Xiaochen Guo, Jing Ma:
A Study on Visualization of Voiceprint Feature. 2233-2237 - Ivan Yakovlev, Anton Okhotnikov, Nikita Torgashov
, Rostislav Makarov, Yuri Voevodin, Konstantin Simonchik:
VoxTube: a multilingual speaker recognition dataset. 2238-2242 - Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang:
Visualizing Data Augmentation in Deep Speaker Recognition. 2243-2247
Cross-lingual and Multilingual ASR
- Zhilong Zhang, Wei Wang, Yanmin Qian:
Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition. 2248-2252 - Wei Wang, Yanmin Qian:
UniSplice: Universal Cross-Lingual Data Splicing for Low-Resource ASR. 2253-2257 - Kevin Glocker
, Aaricia Herygers
, Munir Georges
:
Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes. 2258-2262 - Li Li, Dongxing Xu, Haoran Wei, Yanhua Long:
Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system. 2263-2267 - Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogério Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James R. Glass:
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages. 2268-2272 - Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Jinfeng Bai:
DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model. 2273-2277
Voice Conversion
- Hai Zhu, Huayi Zhan, Hong Cheng, Ying Wu:
Emotional Voice Conversion with Semi-Supervised Generative Modeling. 2278-2282 - Ha-Yeong Choi, Sang-Hoon Lee
, Seong-Whan Lee:
Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation. 2283-2287 - Pengfei Wei, Xiang Yin, Chunfeng Wang, Zhonghao Li, Xinghua Qu, Zhiqiang Xu, Zejun Ma:
S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion. 2288-2292 - Le Xu, Rongxiu Zhong, Ying Liu, Huibao Yang, Shilei Zhang:
Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion. 2293-2297 - Zhonghua Liu, Shijun Wang, Ning Chen:
Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation. 2298-2302 - Wonjune Kang, Mark Hasegawa-Johnson, Deb Roy:
End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions. 2303-2307
Speech and Language in Health: From Remote Monitoring to Medical Conversations 2
- Franziska Braun, Sebastian P. Bayerl, Paula Andrea Pérez-Toro
, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Elmar Nöth, Tobias Bocklet
, Korbinian Riedhammer
:
Classifying Dementia in the Presence of Depression: A Cross-Corpus Study. 2308-2312 - Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Helen Meng, Xunying Liu:
Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition. 2313-2317 - Dominik Wagner, Ilja Baumann, Franziska Braun, Sebastian P. Bayerl, Elmar Nöth, Korbinian Riedhammer
, Tobias Bocklet
:
Multi-class Detection of Pathological Speech with Latent Features: How does it perform on unseen data? 2318-2322 - Hardik Kothare, Michael Neumann, Jackson Liscombe, Jordan R. Green, Vikram Ramanarayanan:
Responsiveness, Sensitivity and Clinical Utility of Timing-Related Speech Biomarkers for Remote Monitoring of ALS Disease Progression. 2323-2327 - Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu:
Use of Speech Impairment Severity for Dysarthric Speech Recognition. 2328-2332 - Mohammed Mosuily
, Lindsay Welch
, Jagmohan Chauhan:
MMLung: Moving Closer to Practical Lung Health Estimation using Smartphones. 2333-2337 - Siyuan Chen
, Colin A. Grambow, Mojtaba Kadkhodaie Elyaderani, Alireza Sadeghi, Federico Fancellu, Thomas Schaaf:
Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization. 2338-2342 - Jinhan Wang, Vijay Ravi, Abeer Alwan:
Non-uniform Speaker Disentanglement For Depression Detection From Raw Speech Signals. 2343-2347 - Kubilay Can Demir
, Tobias Weise, Matthias May, Axel Schmid, Andreas Maier, Seung Hee Yang:
PoCaPNet: A Novel Approach for Surgical Phase Recognition Using Speech and X-Ray Images. 2348-2352 - Michael Neumann, Hardik Kothare, Vikram Ramanarayanan:
Combining Multiple Multimodal Speech Features into an Interpretable Index Score for Capturing Disease Progression in Amyotrophic Lateral Sclerosis. 2353-2357 - Adria Mallol-Ragolta, Nils Urbach, Shuo Liu, Anton Batliner, Björn W. Schuller:
The MASCFLICHT Corpus: Face Mask Type and Coverage Area Recognition from Speech. 2358-2362 - Catarina Botelho
, Alberto Abad
, Tanja Schultz, Isabel Trancoso
:
Towards Reference Speech Characterization for Health Applications. 2363-2367 - Cristian David Ríos-Urrego, Jan Rusz
, Elmar Nöth, Juan Rafael Orozco-Arroyave
:
Automatic Classification of Hypokinetic and Hyperkinetic Dysarthria based on GMM-Supervectors. 2368-2372 - Judith Dineley
, Ewan Carr, Faith Matcham, Johnny Downs
, Richard J. B. Dobson
, Thomas F. Quatieri, Nicholas Cummins
:
Towards robust paralinguistic assessment for real-world mobile health (mHealth) monitoring: an initial study of reverberation effects on speech. 2373-2377
Pathological Speech Analysis 1
- Leif E. R. Simmatis, Timothy Pommeé, Yana Yunusova
:
Multimodal Assessment of Bulbar Amyotrophic Lateral Sclerosis (ALS) Using a Novel Remote Speech Assessment App. 2378-2382 - David Martínez, Dayana Ribas, Eduardo Lleida:
On the Use of High Frequency Information for Voice Pathology Classification. 2383-2387 - Anna Favaro, Tianyu Cao
, Thomas Thebaud, Jesús Villalba, Ankur A. Butala
, Najim Dehak, Laureano Moro-Velázquez:
Do Phonatory Features Display Robustness to Characterize Parkinsonian Speech Across Corpora? 2388-2392 - Sudarsana Reddy Kadiri
, Manila Kodali
, Paavo Alku
:
Severity Classification of Parkinson's Disease from Speech using Single Frequency Filtering-based Features. 2393-2397 - Michal Simek, Tomás Kouba, Michal Novotný, Tereza Tykalová, Jan Rusz
:
Comparison of acoustic measures of dysphonia in Parkinson's disease and Huntington's disease: Effect of sex and speaking task. 2398-2402 - Lucía Gómez-Zaragozá, Simone Wills, Cristian Tejedor García
, Javier Marín-Morales, Mariano Alcañiz
, Helmer Strik:
Alzheimer Disease Classification through ASR-based Transcriptions: Exploring the Impact of Punctuation and Pauses. 2403-2407
Multimodal Speech Emotion Recognition
- Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou:
LanSER: Language-Model Supported Speech Emotion Recognition. 2408-2412 - Jiachen Luo, Huy Phan, Joshua D. Reiss:
Fine-tuned RoBERTa Model with a CNN-LSTM Network for Conversational Emotion Recognition. 2413-2417 - Eimear Stanley, Eric DeMattos, Anita Klementiev, Piotr Ozimek, Georgia Clarke, Michael Berger, Dimitri Palaz:
Emotion Label Encoding Using Word Embeddings for Speech Emotion Recognition. 2418-2422 - Zhongjie Li
, Gaoyan Zhang, Longbiao Wang, Jianwu Dang:
Discrimination of the Different Intents Carried by the Same Text Through Integrating Multimodal Information. 2423-2427 - Zhi Li, Ryu Takeda, Takahiro Hara:
Meta-domain Adversarial Contrastive Learning for Alleviating Individual Bias in Self-sentiment Predictions. 2428-2432 - Ziping Zhao, Tian Gao, Haishuai Wang, Björn W. Schuller:
SWRR: Feature Map Classifier Based on Sliding Window Attention and High-Response Feature Reuse for Multimodal Emotion Recognition. 2433-2437
Speech Coding and Enhancement 2
- Xinmeng Xu, Weiping Tu, Yuhong Yang:
PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement. 2438-2442 - Chang Han, Xinmeng Xu, Weiping Tu, Yuhong Yang, Yajie Liu:
Exploring the Interactions Between Target Positive and Negative Information for Acoustic Echo Cancellation. 2443-2447 - Pavel Andreev, Nicholas Babaev, Azat Saginbaev, Ivan Shchekotov, Aibek Alanov:
Iterative autoregression: a novel trick to improve your low-latency speech enhancement model. 2448-2452 - Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee:
A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models. 2453-2457 - Lior Frenkel, Jacob Goldberger, Shlomo E. Chazan:
Domain Adaptation for Speech Enhancement in a Large Domain Gap. 2458-2462 - Vasily Zadorozhnyy, Qiang Ye, Kazuhito Koishida:
SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks. 2463-2467 - Liang Liu, Haixin Guan
, Jinlong Ma, Wei Dai, Guangyong Wang, Shaowei Ding:
A Mask Free Neural Network for Monaural Speech Enhancement. 2468-2472 - Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang:
A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech. 2473-2477 - Ashutosh Pandey, Ke Tan, Buye Xu:
A Simple RNN Model for Lightweight, Low-compute and Low-latency Multichannel Speech Enhancement in the Time Domain. 2478-2482 - Jianwei Yu, Hangting Chen, Yi Luo, Rongzhi Gu, Chao Weng:
High Fidelity Speech Enhancement with Band-split RNN. 2483-2487 - Jiuxin Lin, Peng Wang, Heinrich Dinkel, Jun Chen, Zhiyong Wu, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang:
Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information. 2488-2492 - Anton Kovalyov, Kashyap Patel, Issa M. S. Panahi:
DFSNet: A Steerable Neural Beamformer Invariant to Microphone Array Configuration for Real-Time, Low-Latency Speech Enhancement. 2493-2497 - Xuechen Liu, Md. Sahidullah, Kong Aik Lee, Tomi Kinnunen:
Speaker-Aware Anti-spoofing. 2498-2502 - Shoko Araki, Ayako Yamamoto, Tsubasa Ochiai, Kenichi Arai, Atsunori Ogawa, Tomohiro Nakatani, Toshio Irino:
Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine. 2503-2507 - Marvin Sach, Jan Franzen, Bruno Defraene, Kristoff Fluyt, Maximilian Strake, Wouter Tirry, Tim Fingscheidt:
EffCRN: An Efficient Convolutional Recurrent Network for High-Performance Speech Enhancement. 2508-2512 - JungPhil Park, Jeong-Hwan Choi
, Yungyeo Kim, Joon-Hyuk Chang:
HAD-ANC: A Hybrid System Comprising an Adaptive Filter and Deep Neural Networks for Active Noise Control. 2513-2517 - Minghang Chu
, Jing Wang, Yaoyao Ma, Zhiwei Fan, Mengtao Yang, Chao Xu, Zhi Tao, Di Wu
:
MSAF: A Multiple Self-Attention Field Method for Speech Enhancement. 2518-2522 - Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng:
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression. 2523-2527 - Yixin Wan, Yuan Zhou, Xiulian Peng, Kai-Wei Chang, Yan Lu:
ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression. 2528-2532 - Lorenz Diener, Marju Purin, Sten Sootla, Ando Saabas, Robert Aichner, Ross Cutler:
PLCMOS - A Data-driven Non-intrusive Metric for The Evaluation of Packet Loss Concealment Algorithms. 2533-2537
Phonetics, Phonology, and Prosody 1
- Petra Wagner
, Simon Betz
:
Effects of Meter, Genre and Experience on Pausing, Lengthening and Prosodic Phrasing in German Poetry Reading. 2538-2542 - Tünde Szalay
, John Holik
, Duy Duong Nguyen, James Morandini, Catherine J. Madill:
Comparing first spectral moment of Australian English /s/ between straight and gay voices using three analysis window sizes. 2543-2547 - Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, David Chiang:
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet. 2548-2552 - Linda Gerlach, Kirsty McDougall, Finnian Kelly
, Anil Alexander:
Voice Twins: Discovering Extremely Similar-sounding, Unrelated Speakers. 2553-2557 - Hannah Hedegard
, Andrea Fröhlich, Fabian Tomaschek, Carina Steiner, Adrian Leemann:
Filling the population statistics gap: Swiss German reference data on F0 and speech tempo for forensic contexts. 2558-2562 - Mathilde Hutin
, Liesbeth Degand
, Marc Allassonnière-Tang:
Investigating the Syntax-Discourse Interface in the Phonetic Implementation of Discourse Markers. 2563-2567 - Robert Essery, Philip Harrison, Vincent Hughes:
Evaluation of a Forensic Automatic Speaker Recognition System with Emotional Speech Recordings. 2568-2572 - Emily Ahn, Gina-Anne Levow, Richard A. Wright, Eleanor Chodroff:
An Outlier Analysis of Vowel Formants from a Corpus Phonetics Pipeline. 2573-2577 - Liao Qu, Xianwei Zou
, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj:
The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features. 2578-2582 - Reed Blaylock
, Shrikanth Narayanan:
Beatboxing Kick Drum Kinematics. 2583-2587 - Huali Zhou, Xianming Bei, Nengheng Zheng, Qinglin Meng:
Effects of hearing loss and amplification on Mandarin consonant perception. 2588-2592 - Roland Adams, Calbert Graham
:
An Acoustic Analysis of Fricative Variation in Three Accents of English. 2593-2597 - Karolina Bros
:
Acoustic cues to stress perception in Spanish - a mismatch negativity study. 2598-2602 - Mitko Sabev, Bistra Andreeva
, Christoph Gabriel
, Jonas Gruenke:
Bulgarian Unstressed Vowel Reduction: Received Views vs Corpus Findings. 2603-2607 - Shelly Jain, Priyanshi Pal, Anil Kumar Vuppala, Prasanta Kumar Ghosh, Chiranjeevi Yarra:
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations. 2608-2612 - Hye-Sook Park, Sunhee Kim:
Identifying Stable Sections for Formant Frequency Extraction of French Nasal Vowels Based on Difference Thresholds. 2613-2617 - Nicolas Audibert, Francesca Carbone, Maud Champagne-Lavau, Aurélien Said Housseini, Caterina Petrone:
Evaluation of delexicalization methods for research on emotional speech. 2618-2622
Spoken Dialog Systems and Conversational Analysis 2
- Jay Kejriwal
, Stefan Benus
:
Relationship between auditory and semantic entrainment using Deep Neural Networks (DNN). 2623-2627 - Jay Kejriwal
, Stefan Benus
, Lina Maria Rojas-Barahona:
Unsupervised Auditory and Semantic Entrainment Models with Deep Neural Networks. 2628-2632 - Elizabeth Nielsen, Mark Steedman, Sharon Goldwater:
Parsing dialog turns with prosodic features in English. 2633-2637 - Toshiki Muromachi, Yoshinobu Kano:
Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness Using Dynamic-Prompt-Tune. 2638-2642 - Tahiya Chowdhury, Verónica Romero, Amanda Stent:
Parameter Selection for Analyzing Conversations with Autism Spectrum Disorder. 2643-2647 - Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Tashweena Heeramun, Karan Sawnhey, Ed Yanosik, Saravana Rathinam, Saurabh Adya:
Efficient Multimodal Neural Networks for Trigger-less Voice Assistants. 2648-2652 - Rachel Ostrand, Victor S. Ferreira, David Piorkowski
:
Rapid Lexical Alignment to a Conversational Agent. 2653-2657 - Fuma Kurata, Mao Saeki, Shinya Fujie, Yoichi Matsuyama:
Multimodal Turn-Taking Model Using Visual Cues for End-of-Utterance Prediction in Spoken Dialogue Systems. 2658-2662 - Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka:
Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer. 2663-2667 - Jin Sakuma, Shinya Fujie, Huaibo Zhao, Tetsunori Kobayashi:
Improving the response timing estimation for spoken dialogue systems by reducing the effect of speech recognition delay. 2668-2672 - Keulbit Kim, Namhyun Cho:
Focus-attention-enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition. 2673-2677 - Haotian Wang, Jun Du, Hengshun Zhou, Chin-Hui Lee, Yuling Ren, Jiangjiang Zhao:
A Multiple-Teacher Pruning Based Self-Distillation (MT-PSD) Approach to Model Compression for Audio-Visual Wake Word Spotting. 2678-2682 - Anika A. Spiesberger, Andreas Triantafyllopoulos, Iosif Tsangko, Björn W. Schuller:
Abusive Speech Detection in Indic Languages Using Acoustic Features. 2683-2687 - Digvijay Ingle, Ayush Kumar, Jithendra Vepa:
Listening To Silences In Contact Center Conversations Using Textual Cues. 2688-2692 - Heuiyeen Yeen, Minju Kim, Myoung-Wan Koo:
I Learned Error, I Can Fix It! : A Detector-Corrector Structure for ASR Error Calibration. 2693-2697 - Maeva Garnier, Éric Le Ferrand, Fabien Ringeval:
Verbal and nonverbal feedback signals in response to increasing levels of miscommunication. 2698-2702 - Shahin Amiriparian
, Lukas Christ, Regina Kushtanova, Maurice Gerczuk
, Alexandra Teynor, Björn W. Schuller:
Speech-Based Classification of Defensive Communication: A Novel Dataset and Results. 2703-2707 - Sarenne Wallbridge, Peter Bell, Catherine Lai:
Quantifying the perceptual value of lexical and non-lexical channels in speech. 2708-2712 - Kazuya Tsubokura, Yurie Iribe, Norihide Kitaoka:
Relationships Between Gender, Personality Traits and Features of Multi-Modal Data to Responses to Spoken Dialog Systems Breakdown. 2713-2717 - Huan Zhao, Bo Li, Zixing Zhang:
Speaker-aware Cross-modal Fusion Architecture for Conversational Emotion Recognition. 2718-2722
Analysis of Speech and Audio Signals 2
- Zhiheng Liao, Feifei Xiong, Juan Luo
, Minjie Cai, Eng Siong Chng, Jinwei Feng, Xionghu Zhong:
Blind Estimation of Room Impulse Response from Monaural Reverberant Speech with Segmental Generative Neural Network. 2723-2727 - Xin Ren, Juan Luo
, Xionghu Zhong, Minjie Cai:
Emotion-Aware Audio-Driven Face Animation via Contrastive Feature Disentanglement. 2728-2732 - Kanta Shimonishi, Kota Dohi, Yohei Kawaguchi:
Anomalous Sound Detection Based on Sound Separation. 2733-2737 - Vitória S. Fahed, Emer P. Doheny
, Madeleine M. Lowery:
Random Forest Classification of Breathing Phases from Audio Signals Recorded using Mobile Devices. 2738-2742 - Youngdo Ahn, Chengyi Wang, Yu Wu, Jong Won Shin, Shujie Liu:
GRAVO: Learning to Generate Relevant Audio from Visual Features with Noisy Online Videos. 2743-2747 - Wanyue Zhai, Mark Hasegawa-Johnson:
Wav2ToBI: a new approach to automatic ToBI transcription. 2748-2752 - Lijian Gao, Qirong Mao, Ming Dong:
Joint-Former: Jointly Regularized and Locally Down-sampled Conformer for Semi-supervised Sound Event Detection. 2753-2757 - Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj:
Towards Attention-based Contrastive Learning for Audio Spoof Detection. 2758-2762 - Yifei Xin, Xiulian Peng, Yan Lu:
Masked Audio Modeling with CLAP and Multi-Objective Learning. 2763-2767 - Manuele Rusci, Tinne Tuytelaars
:
Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems. 2768-2772 - Abdul Hameed Azeemi
, Ihsan Ayyub Qazi
, Agha Ali Raza
:
Self-Supervised Dataset Pruning for Efficient Training in Audio Anti-spoofing. 2773-2777 - W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-Yiin Chang, Tara N. Sainath:
Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR. 2778-2782 - Théo Mariotte
, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas:
Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features. 2783-2787 - Jing Li, Yanhua Long, Yijie Li, Dongxing Xu:
Advanced RawNet2 with Attention-based Channel Masking for Synthetic Speech Detection. 2788-2792 - Juan Carlos Martinez-Sevilla
, María Alfaro-Contreras
, Jose J. Valero-Mas
, Jorge Calvo-Zaragoza:
Insights into end-to-end audio-to-score transcription with real recordings: A case study with saxophone works. 2793-2797 - Yuan Gong
, Sameer Khurana, Leonid Karlinsky, James R. Glass:
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers. 2798-2802 - Jingran Gong, Ning Chen:
Synthetic Voice Spoofing Detection based on Feature Pyramid Conformer. 2803-2807 - Yuankun Xie, Haonan Cheng, Yutian Wang, Long Ye:
Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection. 2808-2812 - Mine Kerpicci, Van Nguyen, Shuhua Zhang, Erik Visser:
Application of Knowledge Distillation to Multi-Task Speech Representation Learning. 2813-2817 - Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani:
DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes. 2818-2822 - Antonio Almudévar, Alfonso Ortega
, Luis Vicente
, Antonio Miguel, Eduardo Lleida:
Variational Classifier for Unsupervised Anomalous Sound Detection under Domain Generalization. 2823-2827 - Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak:
FlexiAST: Flexibility is What AST Needs. 2828-2832 - Ji Won Yoon, Seok Min Kim, Nam Soo Kim:
MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization. 2833-2837 - Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, H. Lilian Tang
, Mark D. Plumbley, Volkan Kiliç
, Wenwu Wang:
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention. 2838-2842
Speech Coding: Privacy
- Apiwat Ditthapron, Emmanuel O. Agu, Adam C. Lammert:
Masking Kernel for Learning Energy-Efficient Representations for Speaker Recognition and Mobile Health. 2843-2847 - Bajian Xiang, Hongkun Liu, Zedong Wu, Su Shen, Xiangdong Zhang:
eSTImate: A Real-time Speech Transmission Index Estimator With Speech Enhancement Auxiliary Task Using Self-Attention Feature Pyramid Network. 2848-2852 - Junyu Wang:
Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement. 2853-2857 - Minh Tran, Mohammad Soleymani:
Privacy-preserving Representation Learning for Speech Understanding. 2858-2862 - Michele Panariello, Massimiliano Todisco, Nicholas W. D. Evans:
Vocoder drift in x-vector-based speaker anonymization. 2863-2867 - Michele Panariello, Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas W. D. Evans:
Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems. 2868-2872
Analysis of Neural Speech Representations
- Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli:
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right? 2873-2877 - Olivier Zhang, Olivier Le Blouch, Nicolas Gengembre, Damien Lolive:
An extension of disentanglement metrics and its application to voice. 2878-2882 - Badr M. Abdullah, Mohammed Maqsood Shaik, Bernd Möbius
, Dietrich Klakow:
An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech. 2883-2887 - Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma:
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge? 2888-2892 - Akira Sasou, Yang Chen:
Comparison of GIF- and SSL-based Features in Pathological-voice Detection. 2893-2897 - Hanyu Meng
, Vidhyasaharan Sethu
, Eliathamby Ambikairajah:
What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions. 2898-2902
End-to-end ASR
- Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida
, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando:
End-to-End Joint Target and Non-Target Speakers ASR. 2903-2907 - Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, Zejun Ma:
Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition. 2908-2912 - Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura:
Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction. 2913-2917 - Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng:
Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition. 2918-2922 - Xuefei Wang, Yanhua Long, Yijie Li, Haoran Wei
:
Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition. 2923-2927 - Vladimir Bataev
, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg:
Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator. 2928-2932
Spoken Language Understanding, Summarization, and Information Retrieval
- He Huang, Jagadeesh Balam, Boris Ginsburg:
Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling. 2933-2937 - Heerin Yang, Seung-won Hwang, Jungmin So:
Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models. 2938-2942 - Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix:
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization. 2943-2947 - Soham Deshmukh, Benjamin Elizalde, Huaming Wang:
Audio Retrieval with WavText5K and CLAP Training. 2948-2952 - Umberto Cappellazzo
, Muqiao Yang, Daniele Falavigna, Alessio Brutti:
Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding. 2953-2957 - Jen-Tzung Chien
, Shang-En Li:
Contrastive Disentangled Learning for Memory-Augmented Transformer. 2958-2962
Invariant and Robust Pre-trained Acoustic Models
- Maureen de Seyssel, Marvin Lavechin, Hadrien Titeux, Arthur Thomas, Gwendal Virlet, Andrea Santos Revilla, Guillaume Wisniewski, Bogdan Ludusan, Emmanuel Dupoux:
ProsAudit, a prosodic benchmark for self-supervised speech models. 2963-2967 - Oli Danyi Liu, Hao Tang, Sharon Goldwater:
Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces. 2968-2972 - Mark Hallap, Emmanuel Dupoux, Ewan Dunbar:
Evaluating context-invariance in unsupervised speech representations. 2973-2977 - Chutong Meng, Junyi Ao, Tom Ko, Mingxuan Wang, Haizhou Li:
CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning. 2978-2982 - Heng-Jui Chang, Alexander H. Liu, James R. Glass:
Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering. 2983-2987 - Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li:
Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder. 2988-2992
Pathological Speech Analysis 2
- Mary Paterson
, James Moor, Luisa Cutillo:
A Pipeline to Evaluate the Effects of Noise on Machine Learning Detection of Laryngeal Cancer. 2993-2997 - Pingyue Zhang, Mengyue Wu, Kai Yu:
ReCLR: Reference-Enhanced Contrastive Learning of Audio Representation for Depression Detection. 2998-3002 - José Vicente Egas López, Veronika Svindt
, Judit Bóna, Ildikó Hoffmann, Gábor Gosztolya:
Automated Multiple Sclerosis Screening Based on Encoded Speech Representations. 3003-3007 - Thomas Melistas, Lefteris Kapelonis, Nikolaos Antoniou, Petros Mitseas, Dimitris Sgouropoulos, Theodoros Giannakopoulos, Athanasios Katsamanis, Shrikanth Narayanan:
Cross-Lingual Features for Alzheimer's Dementia Detection from Speech. 3008-3012 - Mario Zusag, Laurin Wagner, Theresa Bloder:
Careful Whisper - leveraging advances in automatic speech recognition for robust and interpretable aphasia subtype classification. 3013-3017 - Jenthe Thienpondt
, Caroline M. Speksnijder, Kris Demuynck:
Behavioral Analysis of Pathological Speaker Embeddings of Patients During Oncological Treatment of Oral Cancer. 3018-3022
Speech Synthesis: Representation Learning
- Hyungchan Yoon, Seyun Um, Changhwan Kim, Hong-Goo Kang:
Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech. 3023-3027 - Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg:
Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers. 3028-3032 - Ramanan Sivaguru, Vasista Sai Lodagala, Srinivasan Umesh:
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis. 3033-3037 - Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon:
UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data. 3038-3042 - Dinh Son Dang, Tung Lam Nguyen, Bao Thang Ta
, Tien Thanh Nguyen, Thi Ngoc Anh Nguyen, Dang Linh Le, Nhat Minh Le, Van Hai Do:
LightVoc: An Upsampling-Free GAN Vocoder Based On Conformer And Inverse Short-time Fourier Transform. 3043-3047 - Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari:
ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings. 3048-3052
Speech Perception, Production, and Acquisition 1
- Jian Gao, Hanbo Sun, Cheng Cao, Zheng Du:
Human Transcription Quality Improvement. 3053-3057 - Olympia Simantiraki, Yannis Pantazis, Martin Cooke:
The effect of masking noise on listeners' spectral tilt preferences. 3058-3062 - Anaïs Tran Ngoc
, Fanny Meunier, Julien Meyer:
The Effect of Whistled Vowels on Whistled Word Categorization for Naive Listeners. 3063-3067 - Puja Bharati, Sabyasachi Chandra, Shayamal Kumar Das Mandal:
Automatic Deep Neural Network-Based Segmental Pronunciation Error Detection of L2 English Speech (L1 Bengali). 3068-3072 - Lixia Hao, Qi Gong, Jinsong Zhang:
The effect of stress on Mandarin tonal perception in continuous speech for Spanish-speaking learners. 3073-3077 - Amélie Elmerich, Jiayin Gao, Angélique Amelot
, Lise Crevier-Buchman, Shinji Maeda:
Combining acoustic and aerodynamic data collection: A perceptual evaluation of acoustic distortions. 3078-3082 - Benjamin Elie, Alice Turk:
Estimating virtual targets for lingual stop consonants using general Tau theory. 3083-3087 - Mark Gibson:
Using Random Forests to classify language as a function of syllable timing in two groups: children with cochlear implants and with normal hearing. 3088-3092