


default search action
ASRU 2021: Cartagena, Colombia
- IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021. IEEE 2021, ISBN 978-1-6654-3739-4

- Christian Huber, Juan Hussain, Sebastian Stüker, Alexander Waibel:

Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition. 1-7 - Maxime Burchi, Valentin Vielzeuf:

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition. 8-15 - Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe

:
A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies. 16-23 - Norbert Braunschweiler, Rama Doddipatla, Simon Keizer, Svetlana Stoyanchev:

A Study on Cross-Corpus Speech Emotion Recognition and Data Augmentation. 24-30 - Sebastian P. Bayerl, Aniruddha Tammewar, Korbinian Riedhammer

, Giuseppe Riccardi:
Detecting Emotion Carriers by Combining Acoustic and Lexical Representations. 31-38 - Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak

:
Beyond Isolated Utterances: Conversational Emotion Recognition. 39-46 - Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

:
A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation. 47-54 - Fu-An Chao, Shao-Wei Fan-Jiang, Bi-Cheng Yan, Jeih-weih Hung, Berlin Chen:

TENET: A Time-Reversal Enhancement Network for Noise-Robust ASR. 55-61 - Liqiang He, Shulin Feng, Dan Su, Dong Yu:

Latency-Controlled Neural Architecture Search for Streaming Speech Recognition. 62-67 - Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

:
Data Augmentation for ASR Using TTS Via a Discrete Representation. 68-75 - Keqi Deng, Songjun Cao, Yike Zhang, Long Ma:

Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models. 76-82 - Linchen Zhu, Wenjie Liu, Linquan Liu, Edward Lin:

Improving ASR Error Correction Using N-Best Hypotheses. 83-89 - Prachi Singh, Sriram Ganapathy:

Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization. 90-97 - Shota Horiguchi, Shinji Watanabe

, Paola García, Yawen Xue, Yuki Takashima, Yohei Kawaguchi:
Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors. 98-105 - Yi Ma, Kong Aik Lee, Ville Hautamäki, Haizhou Li:

PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction. 106-113 - Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, Hosana Kamiyama, Yusuke Ijima:

Robust Speech-Age Estimation Using Local Maximum Mean Discrepancy Under Mismatched Recording Conditions. 114-121 - Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng

, Jianwu Dang:
DeepLip: A Benchmark for Deep Learning-Based Audio-Visual Lip Biometrics. 122-129 - Jeong-Hwan Choi

, Joon-Young Yang, Joon-Hyuk Chang:
Short-Utterance Embedding Enhancement Method Based on Time Series Forecasting Technique for Text-Independent Speaker Verification. 130-137 - Yan Gao, Titouan Parcollet, Nicholas D. Lane:

Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. 138-145 - Biel Tura, Santiago Escuder

, Ferran Diego, Carlos Segura, Jordi Luque:
Efficient Keyword Spotting by Capturing Long-Range Interactions with Temporal Lambda Networks. 146-153 - Mohan Li, Rama Doddipatla:

Improving HS-DACS Based Streaming Transformer ASR with Deep Reinforcement Learning. 154-161 - Xianrui Zheng, Chao Zhang, Philip C. Woodland:

Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition. 162-168 - Jakob Poncelet, Hugo Van hamme

:
Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch. 169-176 - Timo Lohrenz

, Patrick Schwarz, Zhengyang Li, Tim Fingscheidt
:
Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition. 177-184 - Xuechen Liu, Md. Sahidullah, Tomi Kinnunen:

Optimized Power Normalized Cepstral Coefficients Towards Robust Deep Speaker Verification. 185-190 - Pierre Champion, Thomas Thebaud, Gaël Le Lan, Anthony Larcher, Denis Jouvet:

On the Invertibility of a Voice Privacy System Using Embedding Alignment. 191-197 - Jingyu Li, Si Ioi Ng, Tan Lee

:
Improving Text-Independent Speaker Verification with Auxiliary Speakers Using Graph. 198-205 - Li Zhang, Qing Wang, Lei Xie:

Duality Temporal-Channel-Frequency Attention Enhanced Speaker Representation Learning. 206-213 - Fangyuan Wang, Zhigang Song, Hongchen Jiang, Bo Xu:

MACCIF-TDNN: Multi Aspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification. 214-219 - Zhuo Li, Ce Fang, Runqiu Xiao, Wenchao Wang, Yonghong Yan:

SI-Net: Multi-Scale Context-Aware Convolutional Block for Speaker Verification. 220-227 - Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-Wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe

:
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition. 228-235 - Dhanush Bekal, Ashish Shenoy, Monica Sunkara, Sravan Bodapati, Katrin Kirchhoff:

Remember the Context! ASR Slot Error Correction Through Memorization. 236-243 - Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu:

w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. 244-250 - Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro J. Moreno:

Injecting Text in Self-Supervised Speech Pretraining. 251-258 - Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:

TS-RIR: Translated Synthetic Room Impulse Responses for Speech Augmentation. 259-266 - Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter

, Hermann Ney:
On Architectures and Training for Raw Waveform Feature Extraction in ASR. 267-274 - Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw:

Multi-User Voicefilter-Lite via Attentive Speaker Embedding. 275-282 - Midia Yousefi, John H. L. Hansen:

Speaker Conditioning of Acoustic Models Using Affine Transformation for Multi-Speaker Speech Recognition. 283-288 - Szu-Jui Chen, Wei Xia, John H. L. Hansen:

Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora. 289-295 - Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio. 296-303 - Tom O'Malley, Arun Narayanan, Quan Wang, Alex Park, James Walker, Nathan Howard:

A Conformer-Based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation. 304-311 - Arun Narayanan, Chung-Cheng Chiu, Tom O'Malley, Quan Wang, Yanzhang He:

Cross-Attention Conformer for Context Modeling in Speech Enhancement for ASR. 312-319 - Li Fu, Xiaoxiao Li, Libo Zi, Zhengchen Zhang, Youzheng Wu, Xiaodong He, Bowen Zhou:

Incremental Learning for End-to-End Automatic Speech Recognition. 320-327 - Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Leijing Hou, Shilei Zhang:

Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. 328-334 - Jing Zhao, Gui-Xin Shi, Guan-Bo Wang, Wei-Qiang Zhang:

Automatic Speech Recognition for Low-Resource Languages: The Thuee Systems for the IARPA Openasr20 Evaluation. 335-341 - Chandran Savithri Anoop

, Prathosh A. P., A. G. Ramakrishnan
:
Unsupervised Domain Adaptation Schemes for Building ASR in Low-Resource Languages. 342-349 - Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda:

Multimodal Emotion Recognition with High-Level Speech and Text Features. 350-357 - Zhi Zhu, Yoshinao Sato:

Speech Emotion Recognition Using Semi-Supervised Learning with Efficient Labeling Strategies. 358-365 - Jin Li, Nan Yan, Lan Wang:

Unsupervised Cross-Lingual Speech Emotion Recognition Using Pseudo Multilabel. 366-373 - Shi-wook Lee:

Ensemble of Domain Adversarial Neural Networks for Speech Emotion Recognition. 374-379 - Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

:
ASR Rescoring and Confidence Estimation with Electra. 380-387 - Sachin Singh, Ashutosh Gupta, Aman Maghan, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim

:
Comparative Study of Different Tokenization Strategies for Streaming End-to-End ASR. 388-394 - Dhananjaya Gowda, Abhinav Garg, Jiyeon Kim, Mehul Kumar, Sachin Singh, Ashutosh Gupta, Ankur Kumar, Nauman Dawalatabad, Aman Maghan, Shatrughan Singh, Chanwoo Kim

:
HiTNet: Byte-to-BPE Hierarchical Transcription Network for End-to-End Speech Recognition. 395-402 - Nauman Dawalatabad, Tushar Vatsal, Ashutosh Gupta, Sungsoo Kim, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim

:
Two-Pass End-to-End ASR Model Compression. 403-410 - Qinglin Zhang, Qian Chen

, Yali Li, Jiaqing Liu, Wen Wang:
Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation. 411-418 - Bidisha Sharma, Maulik C. Madhavi, Xuehao Zhou, Haizhou Li:

Exploring Teacher-Student Learning Approach for Multi-Lingual Speech-to-Intent Classification. 419-426 - Tan Liu, Wu Guo:

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features. 427-432 - Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura:

Hierarchical Knowledge Distillation for Dialogue Sequence Labeling. 433-440 - Jaeyun Song, Hajin Shim, Eunho Yang:

Learning How Long to Wait: Adaptively-Constrained Monotonic Multihead Attention for Streaming ASR. 441-448 - Wei Liu, Tan Lee

:
Utterance-Level Neural Confidence Measure for End-to-End Children Speech Recognition. 449-456 - Kiran Praveen, Hardik B. Sailor

, Abhishek Pandey:
Warped Ensembles: A Novel Technique for Improving CTC Based End-to-End Speech Recognition. 457-464 - Shun-Po Chuang, Heng-Jui Chang

, Sung-Feng Huang, Hung-yi Lee:
Non-Autoregressive Mandarin-English Code-Switching Speech Recognition. 465-472 - Ashutosh Gupta, Aditya Jayasimha, Aman Maghan, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim

:
Voice to Action: Spoken Language Understanding for Memory-Constrained Systems. 473-479 - Jen-Tzung Chien

, Chih-Jung Tsai:
Variational Sequential Modeling, Learning and Understanding. 480-486 - Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

:
Attention-Based Multi-Hypothesis Fusion for Speech Summarization. 487-494 - Koichiro Ito, Masaki Murata, Tomohiro Ohno, Shigeki Matsubara:

Estimating the Generation Timing of Responsive Utterances by Active Listeners of Spoken Narratives. 495-502 - Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann:

Context-Aware Transformer Transducer for Speech Recognition. 503-510 - Suwa Xu, Jinwon Lee, Jim Steele:

PSVD: Post-Training Compression of LSTM-Based RNN-T Models. 511-517 - Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed:

Kaizen: Continuously Improving Teacher Using Exponential Moving Average for Semi-Supervised Speech Recognition. 518-525 - Rui Zhao, Jian Xue, Jinyu Li

, Wenning Wei, Lei He, Yifan Gong:
On Addressing Practical Challenges for RNN-Transducer. 526-533 - Felix Weninger, Marco Gaudesi, Ralf Leibold, Roberto Gemello, Puming Zhan:

Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition. 534-540 - Mohammad Omar Khursheed, Christin Jose

, Rajath Kumar, Gengshen Fu, Brian Kulis, Santosh Kumar Cheekatmalla:
Tiny-CRNN: Streaming Wakeword Detection in a Low Footprint Setting. 541-547 - Shaojin Ding, Ye Jia, Ke Hu, Quan Wang:

Textual Echo Cancellation. 548-555 - Daniel Escobar-Grisales, Cristian D. Ríos-Urrego, Diego Alexander Lopez-Santander, Jeferson David Gallo-Aristizábal, Juan Camilo Vásquez-Correa

, Elmar Nöth, Juan Rafael Orozco-Arroyave
:
Colombian Dialect Recognition Based on Information Extracted from Speech and Text Signals. 556-563 - Yangyang Xia, Buye Xu, Anurag Kumar:

Incorporating Real-World Noisy Speech in Neural-Network-Based Speech Enhancement Systems. 564-570 - Takuya Higuchi, Anmol Gupta, Chandra Dhir:

Multi-Task Learning with Cross Attention for Keyword Spotting. 571-578 - Xinhao Wang, Christopher Hamill:

Automatic Generation of Diagnostic Content Feedback in Spoken Language Learning and Assessment. 579-586 - Thomas Schaaf, Longxiang Zhang, Alireza Bayestehtashk, Mark C. Fuhs, Shahid Durrani, Susanne Burger, Monika Woszczyna, Thomas Polzin:

Are You Dictating to Me? Detecting Embedded Dictations in Doctor-Patient Conversations. 587-593 - Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li:

Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer. 594-601 - Mengxin Chai, Shaotong Guo, Cheng Gong, Longbiao Wang, Jianwu Dang, Ju Zhang:

Learning Language and Speaker Information for Code-Switch Speech Synthesis with Limited Data. 602-609 - Takuma Okamoto, Tomoki Toda, Hisashi Kawai:

Multi-Stream HiFi-GAN with Data-Driven Waveform Decomposition. 610-617 - Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li:

DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding. 618-625 - Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

:
EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion. 626-633 - Raymond Chung, Brian Mak

:
On-The-Fly Data Augmentation for Text-to-Speech Style Transfer. 634-641 - Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe

, Tomoki Toda:
On Prosody Modeling for ASR+TTS Based Voice Conversion. 642-649 - Ming-Chi Yen, Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Shu-Wei Tsai, Yu Tsao, Tomoki Toda, Jyh-Shing Roger Jang, Hsin-Min Wang

:
Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling. 650-657 - Jiangyu Han

, Wei Rao, Yanhua Long, Jiaen Liang:
Attention-Based Scaling Adaptation for Target Speech Extraction. 658-662 - Huiyu Shi, Xi Chen, Tianlong Kong, Shouyi Yin, Peng Ouyang:

GLMSnet: Single Channel Speech Separation Framework in Noisy and Reverberant Environments. 663-670 - Lu Zhang, Chenxing Li, Feng Deng, Xiaorui Wang:

Multi-Task Audio Source Separation. 671-678 - Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han

, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe
, Zheng-Hua Tan
, Hui Bu, Tao Yu, Shidong Shang:
Conferencingspeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing. 679-686 - Khaled Hechmi, Trung Ngo Trong, Ville Hautamäki, Tomi Kinnunen:

Voxceleb Enrichment for Age and Gender Recognition. 687-693 - Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Carlos Segura:

Enabling Zero-Shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. 694-701 - Neil Zeghidour, Olivier Teboul, David Grangier:

Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding. 702-709 - Damien Ronssin, Milos Cernak:

AC-VC: Non-Parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion. 710-716 - Marvin Borsdorf

, Haizhou Li, Tanja Schultz
:
Target Language Extraction at Multilingual Cocktail Parties. 717-724 - Jose Antonio Lopez Saenz, Md Asif Jalal, Rosanna Milner

, Thomas Hain
:
Attention Based Model for Segmental Pronunciation Error Detection. 725-732 - Elizabeth Salesky

, Julian Mäder, Severin Klinger:
Assessing Evaluation Metrics for Speech-to-Speech Translation. 733-740 - Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng:

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion. 741-748 - Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari:

Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network. 749-756 - Björn Plüster, Cornelius Weber, Leyuan Qu, Stefan Wermter:

Hearing Faces: Target Speaker Text-to-Speech Synthesis from a Face. 757-764 - Bhagyashree Mukherjee, Anusha Prakash, Hema A. Murthy:

Analysis of Conversational Speech with Application to Voice Adaptation. 765-772 - Ruolan Liu, Xue Wen, Chunhui Lu, Liming Song, June Sig Sung:

Vibrato Learning in Multi-Singer Singing Voice Synthesis. 773-779 - Guangzhi Sun, Chao Zhang, Philip C. Woodland:

Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition. 780-787 - Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf Schlüter

, Hermann Ney:
Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures. 788-795 - Dmitriy Serdyuk, Otavio Braga, Olivier Siohan:

Audio-Visual Speech Recognition is Worth $32\times 32\times 8$ Voxels. 796-802 - Andrea Carmantini, Steve Renals, Peter Bell:

Leveraging Linguistic Knowledge for Accent Robustness of End-to-End Models. 803-810 - Abbas Khosravani, Philip N. Garner

, Alexandros Lararidis:
An Evaluation Benchmark for Automatic Speech Recognition of German-English Code-Switching. 811-816 - Abbas Khosravani, Philip N. Garner

, Alexandros Lazaridis:
Learning to Translate Low-Resourced Swiss German Dialectal Speech into Standard German Text. 817-823 - Marco Gaudesi, Felix Weninger, Dushyant Sharma, Puming Zhan:

ChannelAugment: Improving Generalization of Multi-Channel ASR by Training with Input Channel Randomization. 824-829 - Chia-Yu Li, Ngoc Thang Vu:

Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN. 830-836 - Kai Wei, Thanh Tran, Feng-Ju Chang, Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj, Ariya Rastrow, Grant P. Strimel:

Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding. 837-844 - Zheng Gao, Mohamed Abdelhady, Radhika Arava, Xibin Gao, Qian Hu, Wei Xiao, Thahir Mohamed:

X-SHOT: Learning to Rank Voice Applications Via Cross-Locale Shard-Based Co-Training. 845-852 - Akshat Gupta, Olivia Deng, Akruti Kushwaha, Saloni Mittal, William Zeng, Sai Krishna Rallabandi, Alan W. Black:

Intent Recognition and Unsupervised Slot Identification for Low-Resourced Spoken Dialog Systems. 853-860 - Kishan Sachdeva, Joshua Maynez, Olivier Siohan:

Action Item Detection in Meetings Using Pretrained Transformers. 861-868 - Joo-Kyung Kim, Guoyin Wang, Sungjin Lee, Young-Bum Kim:

Deciding Whether to Ask Clarifying Questions in Large-Scale Spoken Language Understanding. 869-876 - Guan-Lin Chao, Ian R. Lane:

Human-Agent Collaboration Strategies for Vision-Grounded Instruction Following. 877-884 - Binghuai Lin, Liyuan Wang:

Uncertainty-Aware Pseudo-Labeling for Spoken Language Assessment. 885-891 - Xuan Ji, Lu Lu, Fuming Fang, Jianbo Ma, Lei Zhu, Jinke Li, Dongdi Zhao, Ming Liu, Feijun Jiang:

An End-to-End Far-Field Keyword Spotting System with Neural Beamforming. 892-899 - Rohith Aralikatti

, Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
Improving Reverberant Speech Separation with Synthetic Room Impulse Responses. 900-906 - Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao:

HASA-Net: A Non-Intrusive Hearing-Aid Speech Assessment Network. 907-913 - Ankita Pasad, Ju-Chieh Chou, Karen Livescu

:
Layer-Wise Analysis of a Self-Supervised Speech Representation Model. 914-921 - Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe

:
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates. 922-929 - Xulong Zhang

, Jianzong Wang
, Ning Cheng, Edward Xiao, Jing Xiao:
Cyclegean: Cycle Generative Enhanced Adversarial Network for Voice Conversion. 930-937 - Huaizhen Tang, Xulong Zhang

, Jianzong Wang
, Ning Cheng, Zhen Zeng, Edward Xiao, Jing Xiao:
TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training. 938-945 - Aolan Sun, Jianzong Wang

, Ning Cheng, Methawee Tantrawenith, Zhiyong Wu, Helen Meng, Edward Xiao, Jing Xiao:
Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples. 946-953 - Chuan-En Hsu, Mahdin Rohmatillah

, Jen-Tzung Chien
:
Multitask Generative Adversarial Imitation Learning for Multi-Domain Dialogue System. 954-961 - Asier López-Zorrilla, M. Inés Torres, Heriberto Cuayáhuitl:

Audio Embeddings Help to Learn Better Dialogue Policies. 962-968 - Christian Geishauser, Songbo Hu

, Hsien-Chin Lin, Nurul Lubis, Michael Heck, Shutong Feng, Carel van Niekerk, Milica Gasic:
What does the User Want? Information Gain for Hierarchical Dialogue Policy Optimisation. 969-976 - Simon Keizer, Norbert Braunschweiler, Svetlana Stoyanchev, Rama Doddipatla

:
Dialogue Strategy Adaptation to New Action Sets Using Multi-Dimensional Modelling. 977-983 - Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

:
Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. 984-988 - Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

:
A Comparison of Streaming Models and Data Augmentation Methods for Robust Speech Recognition. 989-995 - Rongzhi Gu, Shi-Xiong Zhang, Meng Yu, Dong Yu:

3D Spatial Features for Multi-Channel Target Speech Separation. 996-1002 - Yifan Guo, Yifan Chen, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan:

Far-Field Speech Recognition Based on Complex-Valued Neural Networks and Inter-Frame Similarity Difference Method. 1003-1010 - Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

:
Scaling End-to-End Models for Large-Scale Multilingual ASR. 1011-1018 - Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang, Kenneth Church

:
Decoupling Recognition and Transcription in Mandarin ASR. 1019-1025 - Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer:

On Lattice-Free Boosted MMI Training of HMM and CTC-Based Full-Context ASR Models. 1026-1033 - Chengrui Zhu, Keyu An, Huahuan Zheng, Zhijian Ou:

Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings. 1034-1041 - Markus Müller, Samridhi Choudhary, Clement Chung, Athanasios Mouchtaris, Siegfried Kunzmann:

In Pursuit of Babel - Multilingual End-to-End Spoken Language Understanding. 1042-1049 - Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe

, Alan W. Black:
Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity. 1050-1057 - Yasufumi Moriya, Gareth J. F. Jones:

An ASR N-Best Transcript Neural Ranking Model for Spoken Content Retrieval. 1058-1064 - Shao-Wei Fan-Jiang, Bi-Cheng Yan, Tien-Hong Lo, Fu-An Chao, Berlin Chen:

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods. 1065-1070 - Chuanbo Zhu, Ryo Hakoda, Daisuke Saito, Nobuaki Minematsu, Noriko Nakanishi, Tazuko Nishimura:

Multi-Granularity Annotation of Instantaneous Intelligibility of Learners' Utterances Based on Shadowing Techniques. 1071-1078 - Ralph Scheuerer, Tino Haderlein, Elmar Nöth, Tobias Bocklet

:
Applying X-Vectors on Pathological Speech After Larynx Removal. 1079-1086 - Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, Ivan Bulyko:

Multi-Task Language Modeling for Improving Speech Recognition of Rare Words. 1087-1093 - Nay San, Martijn Bartelds

, Mitchell Browne
, Lily Clifford, Fiona Gibson, John Mansfield, David Nash
, Jane Simpson
, Myfany Turpin
, Maria Vollmer
, Sasha Wilmoth
, Dan Jurafsky:
Leveraging Pre-Trained Representations to Improve Access to Untranscribed Speech from Endangered Languages. 1094-1101 - Wentao Zhu, Tianlong Kong, Shun Lu, Jixiang Li, Dawei Zhang, Feng Deng, Xiaorui Wang, Sen Yang, Ji Liu:

SpeechNAS: Towards Better Trade-Off Between Latency and Accuracy for Large-Scale Speaker Verification. 1102-1109 - Mickael Rouvier, Pierre-Michel Bousquet:

Studying Squeeze-and-Excitation Used in CNN for Speaker Verification. 1110-1115 - Woo Hyun Kang, Jahangir Alam, Abderrahim Fathan:

Hybrid Network with Multi-Level Global-Local Statistics Pooling for Robust Text-Independent Speaker Recognition. 1116-1123 - Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke:

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets. 1124-1131 - Xuechen Liu, Md. Sahidullah, Tomi Kinnunen:

Parameterized Channel Normalization for Far-Field Deep Speaker Verification. 1132-1138 - Juan Manuel Coria

, Hervé Bredin, Sahar Ghannay, Sophie Rosset:
Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation. 1139-1146 - Seokhwan Kim, Yang Liu, Di Jin, Alexandros Papangelis, Karthik Gopalakrishnan, Behnam Hedayatnia, Dilek Hakkani-Tür

:
"How Robust R U?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations. 1147-1154 - Sivanand Achanta, Albert Antony, Ladan Golipour, Jiangchuan Li, Tuomo Raitio, Ramya Rasipuram, Francesco Rossi, Jennifer Shi, Jaimin Upadhyay, David Winarsky, Hepeng Zhang:

On-Device Neural Speech Synthesis. 1155-1161 - Amrith Setlur, Aman Madaan, Tanmay Parekh, Yiming Yang, Alan W. Black:

Towards Using Heterogeneous Relation Graphs for End-to-End TTS. 1162-1169 - Mingqiu Wang, Hagen Soltau, Laurent El Shafey, Izhak Shafran:

Word-Level Confidence Estimation for RNN Transducers. 1170-1177 - Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi

, Agha Ali Raza
:
Using Self Attention DNNs to Discover Phonemic Features for Audio Deep Fake Detection. 1178-1184 - Raghavendra Pappagari, Piotr Zelasko, Agnieszka Mikolajczyk, Piotr Pezik

, Najim Dehak
:
Joint Prediction of Truecasing and Punctuation for Conversational Speech in Low-Resource Scenarios. 1185-1191

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














