default search action
23rd Interspeech 2022: Incheon, Korea
- Hanseok Ko, John H. L. Hansen:
23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. ISCA 2022
Speech Synthesis: Toward end-to-end synthesis
- Hyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo:
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech. 1-5 - Hanbin Bae, Young-Sun Joo:
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch. 6-10 - Martin Lenglet, Olivier Perrotin, Gérard Bailly:
Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings. 11-15 - Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe:
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner. 16-20 - Dan Lim, Sunghee Jung, Eesung Kim:
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech. 21-25
Technology for Disordered Speech
- Rosanna Turrisi, Leonardo Badino:
Interpretable dysarthric speaker adaptation based on optimal-transport. 26-30 - Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic:
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs. 31-35 - Luke Prananta, Bence Mark Halpern, Siyuan Feng, Odette Scharenborg:
The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition. 36-40 - Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda:
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition. 41-45 - Chitralekha Bhat, Ashish Panda, Helmer Strik:
Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation. 46-50 - Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier, Seung Hee Yang:
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition. 51-55
Neural Network Training Methods for ASR I
- Mun-Hak Lee, Joon-Hyuk Chang, Sang-Eon Lee, Ju-Seok Seong, Chanhee Park, Haeyoung Kwon:
Regularizing Transformer-based Acoustic Models by Penalizing Attention Weights. 56-60 - David M. Chan, Shalini Ghosh:
Content-Context Factorized Representations for Automated Speech Recognition. 61-65 - Georgios Karakasidis, Tamás Grósz, Mikko Kurimo:
Comparison and Analysis of New Curriculum Criteria for End-to-End ASR. 66-70 - Deepak Baby, Pasquale D'Alterio, Valentin Mendelev:
Incremental learning for RNN-Transducer based speech recognition models. 71-75 - Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio López-Moreno, Rajiv Mathews, Françoise Beaufays:
Production federated keyword spotting via distillation, filtering, and joint federated-centralized training. 76-80
Acoustic Phonetics and Prosody
- Jieun Song, Hae-Sung Jeon, Jieun Kiaer:
Use of prosodic and lexical cues for disambiguating wh-words in Korean. 81-85 - Vinicius Ribeiro, Yves Laprie:
Autoencoder-Based Tongue Shape Estimation During Continuous Speech. 86-90 - Giuseppe Magistro, Claudia Crocco:
Phonetic erosion and information structure in function words: the case of mia. 91-95 - Miran Oh, Yoon-Jeong Lee:
Dynamic Vertical Larynx Actions Under Prosodic Focus. 96-100 - Leah Bradshaw, Eleanor Chodroff, Lena A. Jäger, Volker Dellwo:
Fundamental Frequency Variability over Time in Telephone Interactions. 101-105
Spoken Machine Translation
- Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà:
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation. 106-110 - Jinming Zhao, Hao Yang, Gholamreza Haffari, Ehsan Shareghi:
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation. 111-115 - Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim:
Cross-Modal Decision Regularization for Simultaneous Speech Translation. 116-120 - Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura:
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation. 121-125 - Kirandevraj R, Vinod Kumar Kurmi, Vinay P. Namboodiri, C. V. Jawahar:
Generalized Keyword Spotting using ASR embeddings. 126-130
(Multimodal) Speech Emotion Recognition I
- Youngdo Ahn, Sung Joo Lee, Jong Won Shin:
Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss. 131-135 - Junghun Kim, Yoojin An, Jihie Kim:
Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms. 136-140 - Joosung Lee:
The Emotion is Not One-hot Encoding: Learning with Grayscale Label for Emotion Recognition in Conversation. 141-145 - Andreas Triantafyllopoulos, Johannes Wagner, Hagen Wierstorf, Maximilian Schmitt, Uwe D. Reichel, Florian Eyben, Felix Burkhardt, Björn W. Schuller:
Probing speech emotion recognition transformers for linguistic knowledge. 146-150 - Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann:
End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks. 151-155 - Matthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost:
Mind the gap: On the value of silence representations to lexical-based speech emotion recognition. 156-160 - Huang-Cheng Chou, Chi-Chun Lee, Carlos Busso:
Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. 161-165 - Hira Dhamyal, Bhiksha Raj, Rita Singh:
Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection. 166-170
Dereverberation, Noise Reduction, and Speaker Extraction
- Tuan Vu Ho, Maori Kobayashi, Masato Akagi:
Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion. 171-175 - Tuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki:
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement. 176-180 - Minseung Kim, Hyungchan Song, Sein Cheong, Jong Won Shin:
iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement. 181-185 - Kuo-Hsuan Hung, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin:
Boosting Self-Supervised Embeddings for Speech Enhancement. 186-190 - Seorim Hwang, Youngcheol Park, Sungwook Park:
Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections. 191-195 - Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey:
CycleGAN-based Unpaired Speech Dereverberation. 196-200 - Ashutosh Pandey, DeLiang Wang:
Attentive Training: A New Training Framework for Talker-independent Speaker Extraction. 201-205 - Tyler Vuong, Richard M. Stern:
Improved Modulation-Domain Loss for Neural-Network-based Speech Enhancement. 206-210 - Chiang-Jen Peng, Yun-Ju Chan, Yih-Liang Shen, Cheng Yu, Yu Tsao, Tai-Shih Chi:
Perceptual Characteristics Based Multi-objective Model for Speech Enhancement. 211-215 - Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani:
Listen only to me! How well can target speech extraction handle false alarms? 216-220 - Hao Shi, Longbiao Wang, Sheng Li, Jianwu Dang, Tatsuya Kawahara:
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction. 221-225 - Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann:
Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments. 226-230
Source Separation II
- Nicolás Schmidt, Jordi Pons, Marius Miron:
PodcastMix: A dataset for separating music and speech in podcasts. 231-235 - Kohei Saijo, Robin Scheibler:
Independence-based Joint Dereverberation and Separation with Neural Source Model. 236-240 - Kohei Saijo, Robin Scheibler:
Spatial Loss for Unsupervised Multi-channel Source Separation. 241-245 - Samuel Bellows, Timothy W. Leishman:
Effect of Head Orientation on Speech Directivity. 246-250 - Kohei Saijo, Tetsuji Ogawa:
Unsupervised Training of Sequential Neural Beamformer Using Coarsely-separated and Non-separated Signals. 251-255 - Marvin Borsdorf, Kevin Scheck, Haizhou Li, Tanja Schultz:
Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by Language. 256-260 - Mateusz Guzik, Konrad Kowalczyk:
NTF of Spectral and Spatial Features for Tracking and Separation of Moving Sound Sources in Spherical Harmonic Domain. 261-265 - Jack Deadman, Jon Barker:
Modelling Turn-taking in Multispeaker Parties for Realistic Data Simulation. 266-270 - Christoph Böddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach:
An Initialization Scheme for Meeting Separation with Spatial Mixture Models. 271-275 - Seongkyu Mun, Dhananjaya Gowda, Jihwan Lee, Changwoo Han, Dokyun Lee, Chanwoo Kim:
Prototypical speaker-interference loss for target voice separation using non-parallel audio samples. 276-280
Embedding and Network Architecture for Speaker Recognition
- Pierre-Michel Bousquet, Mickael Rouvier, Jean-François Bonastre:
Reliability criterion based on learning-phase entropy for speaker recognition with neural network. 281-285 - Bei Liu, Zhengyang Chen, Yanmin Qian:
Attentive Feature Fusion for Robust Speaker Verification. 286-290 - Bei Liu, Zhengyang Chen, Yanmin Qian:
Dual Path Embedding Learning for Speaker Verification with Triplet Attention. 291-295 - Bei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian:
DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design. 296-300 - Ruida Li, Shuo Fang, Chenguang Ma, Liang Li:
Adaptive Rectangle Loss for Speaker Verification. 301-305 - Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung-yi Lee, Helen Meng:
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. 306-310 - Leying Zhang, Zhengyang Chen, Yanmin Qian:
Enroll-Aware Attentive Statistics Pooling for Target Speaker Verification. 311-315 - Yusheng Tian, Jingyu Li, Tan Lee:
Transport-Oriented Feature Aggregation for Speaker Embedding Learning. 316-320 - Mufan Sang, John H. L. Hansen:
Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning. 321-325 - Linjun Cai, Yuhong Yang, Xufeng Chen, Weiping Tu, Hongyang Chen:
CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolution. 326-330 - Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang:
Reliable Visualization for Deep Speaker Recognition. 331-335 - Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, Guanglu Wan:
Unifying Cosine and PLDA Back-ends for Speaker Verification. 336-340 - Yuheng Wei, Junzhao Du, Hui Liu, Qian Wang:
CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding Extractor. 341-345
Speech Representation II
- Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du:
SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech. 346-350 - David Feinberg:
VoiceLab: Software for Fully Reproducible Automated Voice Analysis. 351-355 - Joel Shor, Subhashini Venugopalan:
TRILLsson: Distilled Universal Paralinguistic Speech Representations. 356-360 - Nan Li, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li, Jianwu Dang:
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network. 361-365 - Mostafa Sadeghi, Paul Magron:
A Sparsity-promoting Dictionary Model for Variational Autoencoders. 366-370 - Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, Li Zhao:
Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition. 371-375 - John H. L. Hansen, Zhenyu Wang:
Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning. 376-380 - Boris Bergsma, Minhao Yang, Milos Cernak:
PEAF: Learnable Power Efficient Analog Acoustic Features for Audio Recognition. 381-385 - Gasser Elbanna, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar, Mikolaj Kegler, Pierre Beckmann, Milos Cernak:
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load. 386-390 - Shijun Wang, Hamed Hemati, Jón Guðnason, Damian Borth:
Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition. 391-395 - Sarthak Yadav, Neil Zeghidour:
Learning neural audio features without supervision. 396-400 - Yixuan Zhang, Heming Wang, DeLiang Wang:
Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy Speech. 401-405 - Abu Zaher Md Faridee, Hannes Gamper:
Predicting label distribution improves non-intrusive speech quality estimation. 406-410 - Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka:
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models. 411-415 - Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza:
Dataset Pruning for Resource-constrained Spoofed Audio Detection. 416-420
Speech Synthesis: Linguistic Processing, Paradigms and Other Topics II
- Jaesung Tae, Hyeongju Kim, Taesu Kim:
EdiTTS: Score-based Editing for Controllable Text-to-Speech. 421-425 - Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng:
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information. 426-430 - Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi:
SpeechPainter: Text-conditioned Speech Inpainting. 431-435 - Song Zhang, Ken Zheng, Xiaoxu Zhu, Baoxiang Li:
A polyphone BERT for Polyphone Disambiguation in Mandarin Chinese. 436-440 - Mutian He, Jingzhou Yang, Lei He, Frank K. Soong:
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge. 441-445 - Jian Zhu, Cong Zhang, David Jurgens:
ByT5 model for massively multilingual grapheme-to-phoneme conversion. 446-450 - Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, Jiuxiang Gu, Ani Nenkova, Vlad I. Morariu, Rajiv Jain, Dinesh Manocha:
DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis. 451-455 - Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao:
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech. 456-460 - Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson:
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition. 461-465 - Tho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Steven Hung Quoc Truong:
An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis. 466-470 - Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter:
Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks. 471-475 - Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin:
An Automatic Soundtracking System for Text-to-Speech Audiobooks. 476-480 - Daxin Tan, Guangyan Zhang, Tan Lee:
Environment Aware Text-to-Speech Synthesis. 481-485 - Artem Ploujnikov, Mirco Ravanelli:
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation. 486-490 - Evelina Bakhturina, Yang Zhang, Boris Ginsburg:
Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization. 491-495 - Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote:
Prosodic alignment for off-screen automatic dubbing. 496-500 - Qibing Bai, Tom Ko, Yu Zhang:
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis. 501-505 - Hirokazu Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka:
CAUSE: Crossmodal Action Unit Sequence Estimation from Speech. 506-510 - Binu Nisal Abeysinghe, Jesin James, Catherine I. Watson, Felix Marattukalam:
Visualising Model Training via Vowel Space for Text-To-Speech Systems. 511-515
Other Topics in Speech Recognition
- Aaqib Saeed:
Binary Early-Exit Network for Adaptive Inference on Low-Resource Devices. 516-520 - Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka:
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings. 521-525 - Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura:
Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data. 526-530 - Yi-Kai Zhang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan:
Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation. 531-535 - Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide:
Federated Domain Adaptation for ASR with Full Self-Supervision. 536-540 - Longfei Yang, Wenqing Wei, Sheng Li, Jiyi Li, Takahiro Shinozaki:
Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection. 541-545 - Zvi Kons, Hagai Aronowitz, Edmilson da Silva Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon:
Extending RNN-T-based speech recognition systems with emotion and language classification. 546-549 - Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg:
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization. 550-554 - Yeonjin Cho, Sara Ng, Trang Tran, Mari Ostendorf:
Leveraging Prosody for Punctuation Prediction of Spontaneous Speech. 555-559 - Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie:
A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings. 560-564
Audio Deep PLC (Packet Loss Concealment) Challenge
- Yuansheng Guan, Guochen Yu, Andong Li, Chengshi Zheng, Jie Wang:
TMGAN-PLC: Audio Packet Loss Concealment using Temporal Memory Generative Adversarial Network. 565-569 - Jean-Marc Valin, Ahmed Mustafa, Christopher Montgomery, Timothy B. Terriberry, Michael Klingbeil, Paris Smaragdis, Arvindh Krishnaswamy:
Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model. 570-574 - Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang:
PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network. 575-579 - Lorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner, Ross Cutler:
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge. 580-584 - Nan Li, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu:
End-to-End Multi-Loss Training for Low Delay Packet Loss Concealment. 585-589
Robust Speaker Recognition
- Ju-ho Kim, Jungwoo Heo, Hye-jin Shim, Ha-Jin Yu:
Extended U-Net for Speaker Verification in Noisy Environments. 590-594 - Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun:
Domain Agnostic Few-shot Learning for Speaker Verification. 595-599 - Qiongqiong Wang, Kong Aik Lee, Tianchi Liu:
Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA? 600-604