


default search action
SLT 2024: Macao
- IEEE Spoken Language Technology Workshop, SLT 2024, Macao, December 2-5, 2024. IEEE 2024, ISBN 979-8-3503-9225-8
- Chih-Kai Yang, Kuan-Po Huang, Hung-Yi Lee:
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper. 1-8 - Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai:
Temporal Order Preserved Optimal Transport-Based Cross-Modal Knowledge Transfer Learning for ASR. 1-8 - Xiaoxue Gao, Nancy F. Chen:
Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models. 1-8 - Yoshiki Masuyama, Koichi Miyazaki, Masato Murata:
Mamba-Based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition. 1-6 - Dominik Wagner, Ilja Baumann, Thomas Ranzenberger, Korbinian Riedhammer, Tobias Bocklet:
Personalizing Large Sequence-to-Sequence Speech Foundation Models With Speaker Representations. 1-6 - Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg:
Label-Looping: Highly Efficient Decoding For Transducers. 7-13 - Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei Zhang, Dong Yu:
Advancing Multi-Talker ASR Performance With Large Language Models. 14-21 - Gil Keren, Wei Zhou, Ozlem Kalinli:
Token-Weighted RNN-T For Learning From Flawed Data. 22-29 - Hukai Huang, Jiayan Lin, Kaidi Wang, Yishuang Li, Wenhao Guan, Lin Li, Qingyang Hong:
Enhancing Code-Switching Speech Recognition With LID-Based Collaborative Mixture of Experts Model. 30-36 - Edward Storey, Naomi Harte, Peter Bell:
Language Bias in Self-Supervised Learning For Automatic Speech Recognition. 37-42 - Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji Watanabe:
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts. 43-48 - Shaoshi Ling, Guoli Ye, Rui Zhao, Yifan Gong:
Hybrid Attention-Based Encoder-Decoder Model for Efficient Language Model Adaptation. 49-55 - Yiwen Shao, Yong Xu, Sanjeev Khudanpur, Dong Yu:
Spatialemb: Extract and Encode Spatial Information for 1-Stage Multi-Channel Multi-Speaker ASR on Arbitrary Microphone Arrays. 56-63 - Yingyi Ma, Zhe Liu, Ozlem Kalinli:
Effective Text Adaptation For LLM-Based ASR Through Soft Prompt Fine-Tuning. 64-69 - Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe:
Contextualized Automatic Speech Recognition With Dynamic Vocabulary. 78-85 - Yi-Cheng Wang, Li-Ting Pai, Bi-Cheng Yan, Hsin-Wei Wang, Chi-Han Lin, Berlin Chen:
An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition. 94-101 - Geeticka Chauhan, Steve Chien, Om Thakkar, Abhradeep Thakurta, Arun Narayanan:
Training Large ASR Encoders With Differential Privacy. 102-109 - Cindy Tseng, Yun Tang, Vijendra Raj Apsingekar:
Transducer Consistency Regularization For Speech to Text Applications. 110-117 - Liang-Hsuan Tseng, Zih-Ching Chen, Wei-Shun Chang, Cheng-Kuang Lee, Tsung-Ren Huang, Hung-Yi Lee:
Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation For Code-Switching ASR Using Realistic Data. 118-125 - Guanrou Yang, Ziyang Ma, Zhifu Gao, Shiliang Zhang, Xie Chen:
CTC-Assisted LLM-Based Contextual ASR. 126-131 - Dongcheng Jiang, Chao Zhang, Philip C. Woodland:
Automatic Time Alignment Generation For End-to-End ASR Using Acoustic Probability Modelling. 132-139 - Kwok Chin Yuen, Jia Qi Yip, Eng Siong Chng:
Continual Learning With Embedding Layer Surgery and Task-Wise Beam Search Using Whisper. 140-146 - Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Zelasko, Jagadeesh Balam, Boris Ginsburg:
Bestow: Efficient and Streamable Speech Language Model with The Best of Two Worlds in GPT and T5. 147-154 - Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach:
Combining TF-GridNet And Mixture Encoder For Continuous Speech Separation For Meeting Transcription. 155-162 - Ryan Whetten, Titouan Parcollet, Adel Moumen, Marco Dinarelli, Yannick Estève:
An Analysis of Linear Complexity Attention Substitutes With Best-RQ. 169-176 - Narla John Metilda Sagaya Mary, Srinivasan Umesh:
Lite ASR Transformer: A Light Weight Transformer Architecture For Automatic Speech Recognition. 185-192 - Hao Shi, Yuan Gao, Zhaoheng Ni, Tatsuya Kawahara:
Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition. 193-199 - Jakob Poncelet, Yujun Wang, Hugo Van hamme:
Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models. 200-207 - Vyas Raina, Mark J. F. Gales:
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Multi-Task Automatic Speech Recognition Models. 208-215 - Yash Jogi, Vaibhav Aggarwal, Shabari S. Nair, Yash Verma, Aayush Kubba:
Improving Rare-Word Recognition of Whisper in Zero-Shot Settings. 216-223 - Robin Amann, Zhaolin Li, Barbara Bruno, Jan Niehues:
Augmenting Automatic Speech Recognition Models With Disfluency Detection. 224-231 - Yuting Yang, Yuke Li, Lifeng Zhou, Binbin Du, Haoqi Zhu:
Enhancing Unified Streaming and Non-Streaming ASR Through Curriculum Learning With Easy-To-Hard Tasks. 232-239 - Hang Shao, Bei Liu, Wei Wang, Xun Gong, Yanmin Qian:
DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition. 240-246 - Shih-Heng Wang, Jiatong Shi, Chien-Yu Huang, Shinji Watanabe, Hung-Yi Lee:
Fusion Of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition. 247-254 - Nithin Rao Koluguri, Travis M. Bartley, Hainan Xu, Oleksii Hrinchuk, Jagadeesh Balam, Boris Ginsburg, Georg Kucsko:
Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation. 255-262 - Yu Xi, Wen Ding, Kai Yu, Junjie Lai:
Semi-Supervised Learning For Code-Switching ASR With Large Language Model Filter. 263-270 - Peter Plantinga, Jaekwon Yoo, Abenezer Girma, Chandra Dhir:
Parameter Averaging Is All You Need To Prevent Forgetting. 271-278 - Zeyu Zhao, Peter Bell:
Advancing CTC Models for Better Speech Alignment: A Topological Approach. 279-285 - Ziqian Wang, Jiayao Sun, Zihan Zhang, Xingchen Li, Jie Liu, Lei Xie:
Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation. 286-293 - Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Shujie Liu, Long Zhou, Yanmin Qian:
DDTSE: Discriminative Diffusion Model for Target Speech Extraction. 294-301 - Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao:
An Investigation of Incorporating Mamba For Speech Enhancement. 302-308 - Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang:
Effective Noise-Aware Data Simulation For Domain-Adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation. 309-316 - Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu:
SMRU: Split-And-Merge Recurrent-Based UNet For Acoustic Echo Cancellation And Noise Suppression. 317-324 - Junjie Li, Ke Zhang, Shuai Wang, Haizhou Li, Man-Wai Mak, Kong Aik Lee:
On the Effectiveness of Enrollment Speech Augmentation For Target Speaker Extraction. 325-332 - Chenda Li, Samuele Cornell, Shinji Watanabe, Yanmin Qian:
Diffusion-Based Generative Modeling With Discriminative Guidance for Streamable Speech Enhancement. 333-340 - Dashanka De Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li:
Neurospex: Neuro-Guided Speaker Extraction With Cross-Modal Fusion. 341-348 - Jiahe Wang, Shuai Wang, Junjie Li, Ke Zhang, Yanmin Qian, Haizhou Li:
Enhancing Speaker Extraction Through Rectifying Target Confusion. 349-356 - Da-Hee Yang, Joon-Hyuk Chang:
Diff-PLC: A Diffusion-Based Approach For Effective Packet Loss Concealment. 357-363 - Yun Liu, Xuechen Liu, Junichi Yamagishi:
Improving Curriculum Learning For Target Speaker Extraction With Synthetic Speakers. 364-370 - Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Zelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke:
Large Language Model Based Generative Error Correction: A Challenge and Baselines For Speech Recognition, Speaker Tagging, and Emotion Recognition. 371-378 - Han Jiang, Wenyu Wang, Yiquan Zhou, Hongwu Ding, Jiacheng Xu, Jihua Zhu:
FGCL: Fine-Grained Contrastive Learning For Mandarin Stuttering Event Detection. 379-384 - Hongfei Xue, Rong Gong, Mingchen Shao, Xin Xu, Lezhi Wang, Lei Xie, Hui Bu, Jiaming Zhou, Yong Qin, Jun Du, Ming Li, Binbin Zhang, Bin Jia:
Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge. 385-392 - Shangkun Huang, Dejun Zhang, Jing Deng, Rong Zheng:
Enhanced ASR FOR Stuttering Speech: Combining Adversarial and Signal-Based Data Augmentation. 393-400 - Tzu-Quan Lin, Guan-Ting Lin, Hung-Yi Lee, Hao Tang:
Property Neurons in Self-Supervised Speech Transformers. 401-408 - Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner:
Privacy Versus Emotion Preservation Trade-Offs in Emotion-Preserving Speaker Anonymization. 409-414 - Sung-Lin Yeh, Hao Tang:
Estimating the Completeness of Discrete Speech Units. 415-422 - Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, Hiroshi Sato:
Investigation of Speaker Representation for Target-Speaker Speech Processing. 423-430 - Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai:
Crossmodal ASR Error Correction With Discrete Speech Units. 431-438 - Yi-Cheng Lin, Tzu-Quan Lin, Chih-Kai Yang, Ke-Han Lu, Wei-Chih Chen, Chun-Yi Kuan, Hung-Yi Lee:
Listen and Speak Fairly: a Study on Semantic Gender Bias in Speech Integrated Large Language Models. 439-446 - Sungnyun Kim, Kangwook Jang, Sangmin Bae, Hoirin Kim, Se-Young Yun:
Learning Video Temporal Dynamics With Cross-Modal Attention For Robust Audio-Visual Speech Recognition. 447-454 - Lemeng Wu, Zhaoheng Ni, Bowen Shi, Gaël Le Lan, Anurag Kumar, Varun Nagaraja, Xinhao Mei, Yunyang Xiong, Bilge Soran, Raghuraman Krishnamoorthi, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra:
Data Efficient Reflow for Few Step Audio Generation. 455-461 - Roger Hsiao, Liuhui Deng, Erik McDermott, Ruchir Travadi, Xiaodan Zhuang:
Optimizing Byte-Level Representation For End-To-End ASR. 462-467 - Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg:
Romanization Encoding For Multilingual ASR. 468-475 - Tzu-Ting Yang, Hsin-Wei Wang, Yi-Cheng Wang, Berlin Chen:
Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection. 476-481 - Chang Liu, Zhen-Hua Ling, Ya-Jun Hu:
Language-Independent Prosody-Enhanced Speech Representations For Multilingual Speech Synthesis. 482-488 - Shahar Elisha, Andrew McDowell, Mariano Beguerisse-Díaz, Emmanouil Benetos:
Classification Of Spontaneous And Scripted Speech For Multilingual Audio. 489-495 - Yu Pan, Yuguang Yang, Yuheng Huang, Tiancheng Jin, Jingjing Yin, Yanni Hu, Heng Lu, Lei Ma, Jianjun Zhao:
GMP-TL: Gender-Augmented Multi-Scale Pseudo-Label Enhanced Transfer Learning For Speech Emotion Recognition. 496-501 - Huang-Cheng Chou, Haibin Wu, Lucas Goncalves, Seong-Gyun Leem, Ali Salman, Carlos Busso, Hung-Yi Lee, Chi-Chun Lee:
Embracing Ambiguity And Subjectivity Using The All-Inclusive Aggregation Rule For Evaluating Multi-Label Speech Emotion Recognition Systems. 502-509 - Haibin Wu, Huang-Cheng Chou, Kai-Wei Chang, Lucas Goncalves, Jiawei Du, Jyh-Shing Roger Jang, Chi-Chun Lee, Hung-Yi Lee:
Open-Emotion: A Reproducible EMO-Superb For Speech Emotion Recognition Systems. 510-517 - Yuanchao Li, Peter Bell, Catherine Lai:
Speech Emotion Recognition With ASR Transcripts: a Comprehensive Study on Word Error Rate and Fusion Techniques. 518-525 - Ariadna Sanchez, Alice Ross, Nina Markl:
Beyond The Binary: Limitations and Possibilities of Gender-Related Speech Technology Research. 526-532 - Shi-wook Lee:
Enhancing Domain Generalization in Speech Emotion Recognition by Combining Domain-Variant Representations and Domain-Invariant Classifiers. 533-539 - Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Ye-Xin Lu, Zhen-Hua Ling:
MDCTCodec: A Lightweight MDCT-Based Neural Audio Codec Towards High Sampling Rate and Low Bitrate Scenarios. 540-547 - Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng:
Addressing Index Collapse of Large-Codebook Speech Tokenizer With Dual-Decoding Product-Quantized Variational Auto-Encoder. 548-553 - Jiaqi Li, Dongmei Wang, Xiaofei Wang, Yao Qian, Long Zhou, Shujie Liu, Midia Yousefi, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yanqing Liu, Junkun Chen, Sheng Zhao, Jinyu Li, Zhizheng Wu, Michael Zeng:
Investigating Neural Audio Codecs For Speech Language Model-Based Speech Generation. 554-561 - Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-Weon Jung, Jia Qi Yip, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen Alharthi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H. Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe:
ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs For Audio, Music, and Speech. 562-569 - Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kaiwei Chang, Jiawei Du, Ke-Han Lu, Alexander H. Liu, Ho-Lam Chung, Yuan-Kuei Wu, Dongchao Yang, Songxiang Liu, Yi-Chiao Wu, Xu Tan, James R. Glass, Shinji Watanabe, Hung-Yi Lee:
Codec-Superb @ SLT 2024: A Lightweight Benchmark For Neural Audio Codec Models. 570-577 - Shuiyun Liu, Yuxiang Kong, Pengcheng Guo, Weiji Zhuang, Peng Gao, Yujun Wang, Lei Xie:
Optimizing Dysarthria Wake-Up Word Spotting: an End-to-End Approach For SLT 2024 LRDWWS Challenge. 578-585 - Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin:
PB-LRDWWS System For the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. 586-591 - Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Ming Li, Chin-Hui Lee:
Summary of Low-Resource Dysarthria Wake-Up Word Spotting Challenge. 592-599 - Ada Defne Tur, Adel Moumen, Mirco Ravanelli:
Progres: Prompted Generative Rescoring on ASR N-Best. 600-607 - Moreno La Quatra, Valerio Mario Salerno, Yu Tsao, Sabato Marco Siniscalchi:
FlanEC: Exploring Flan-T5 for Post-ASR Error Correction. 608-615 - Zhipeng Li, Xiaofen Xing, Jun Wang, Shuaiqi Chen, Guoqiao Yu, Guanglu Wan, Xiangmin Xu:
As-Speech: Adaptive Style For Speech Synthesis. 616-622 - Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng:
Room Impulse Responses Help Attackers to Evade Deep Fake Detection. 623-629 - Hankun Wang, Chenpeng Du, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu:
Attention-Constrained Inference For Robust Decoder-Only Text-to-Speech. 630-637 - Fei Liu, Yang Ai, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, Zhen-Hua Ling:
Stage-Wise and Prior-Aware Neural Speech Phase Prediction. 638-644 - Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng:
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec For Efficient Language Model Based Text-to-Speech Synthesis. 645-651 - Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-Yi Lee, Szu-Wei Fu:
Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits. 652-659 - Hiroaki Hyodo, Shinnosuke Takamichi, Tomohiko Nakamura, Junya Koguchi, Hiroshi Saruwatari:
DNN-Based Ensemble Singing Voice Synthesis With Interactions Between Singers. 660-667 - Sotirios Karapiperis, Nikolaos Ellinas, Alexandra Vioni, Junkwang Oh, Gunu Jho, Inchul Hwang, Spyros Raptis:
Investigating Disentanglement in a Phoneme-Level Speech Codec for Prosody Modeling. 668-674 - Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen:
Instructsing: High-Fidelity Singing Voice Generation Via Instructing Yourself. 675-681 - Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda:
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS. 682-689 - Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda:
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-To-Speech. 690-697 - Zhengyang Chen, Shuai Wang, Mingyang Zhang, Xuechen Liu, Junichi Yamagishi, Yanmin Qian:
Disentangling The Prosody And Semantic Information With Pre-Trained Model For In-Context Learning Based Zero-Shot Voice Conversion. 698-704 - Zhikang Niu, Sanyuan Chen, Long Zhou, Ziyang Ma, Xie Chen, Shujie Liu:
NDVQ: Robust Neural Audio Codec With Normal Distribution-Based Vector Quantization. 705-710 - Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna Anumanchipalli:
Fast, High-Quality and Parameter-Efficient Articulatory Synthesis Using Differentiable DSP. 711-718 - Yifeng Yu, Jiatong Shi, Yuning Wu, Yuxun Tang, Shinji Watanabe:
Visinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation. 719-726 - Waris Quamer, Ricardo Gutierrez-Osuna:
End-To-End Streaming Model For Low-Latency Speech Anonymization. 727-734 - Raymond Chung:
Emotion-Coherent Speech Data Augmentation And Self-Supervised Contrastive Style Training For Enhancing Kids's Story Speech Synthesis. 735-741 - Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman:
Discrete Unit Based Masking For Improving Disentanglement in Voice Conversion. 742-749 - Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari:
Cross-Dialect Text-to-Speech In Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level Bert. 750-757 - Xueyao Zhang, Zihao Fang, Yicheng Gu, Haopeng Chen, Lexiao Zou, Junan Zhang, Liumeng Xue, Zhizheng Wu:
Leveraging Diverse Semantic-Based Audio Pretrained Models for Singing Voice Conversion. 758-765 - Christoph Minixhofer, Ondrej Klejch, Peter Bell:
TTSDS - Text-to-Speech Distribution Score. 766-773 - Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang:
Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CTRSVDD) Challenge 2024. 774-781 - You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan:
SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge. 782-787 - Qishan Zhang, Shuangbing Wen, Fangke Yan, Tao Hu, Jun Li:
XWSB: A Blend System Utilizing XLS-R and Wavlm With SLS Classifier Detection System for SVDD 2024 Challenge. 788-794 - Yankai Wang, Yuxuan Du, Dejun Zhang, Rong Zheng, Jing Deng:
Integrating Self-Supervised Pre-Training With Adversarial Learning for Synthesized Song Detection. 795-802 - Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao:
The Voicemos Challenge 2024: Beyond Speech Quality Prediction. 803-810 - Yu-Fei Shi, Yang Ai, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling:
Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion. 811-817 - Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari:
The T05 System for the voicemos challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech. 818-824 - Jiun-Ting Li, Bi-Cheng Yan, Tien-Hong Lo, Yi-Cheng Wang, Yung-Chang Hsu, Berlin Chen:
Automated Speaking Assessment of Conversation Tests with Novel Graph-Based Modeling on Spoken Response Coherence. 825-832 - Luca Becker, Philip Pracht, Peter Sertdal, Jil Uboreck, Alexander Bendel, Rainer Martin:
Conditional Label Smoothing For LLM-Based Data Augmentation in Medical Text Classification. 833-840 - Mengfei Guo, Si Chen, Yi Huang, Junlan Feng:
Plan, Generate and Optimize: Extending Large Language Models for Dialogue Systems Via Prompt-Based Collaborativec Method. 841-848 - Mahdin Rohmatillah, Jen-Tzung Chien:
Taming NLU Noise: Student-Teacher Learning for Robust Dialogue Policy. 849-856 - Stanislaw Kacprzak, Konrad Kowalczyk:
Heightceleb - An Enrichment of Voxceleb Dataset With Speaker Height Information. 857-862 - Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe:
ESPnet-EZ: Python-Only ESPnet For Easy Fine-Tuning And Integration. 863-870 - Yi-Cheng Lin, Wei-Chih Chen, Hung-Yi Lee:
Spoken Stereoset: on Evaluating Social Bias Toward Speaker in Speech Large Language Models. 871-878 - Xueyao Zhang, Liumeng Xue, Yicheng Gu, Yuancheng Wang, Jiaqi Li, Haorui He, Chaoren Wang, Songting Liu, Xi Chen, Junan Zhang, Zihao Fang, Haopeng Chen, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, Zhizheng Wu:
Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit. 879-884 - Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu:
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation. 885-890 - William Chen, Brian Yan, Chih-Chen Chen, Shinji Watanabe:
Floras 50: A Massively Multilingual Multitask Benchmark for Long-Form Conversational Speech. 891-898 - Hirofumi Inaguma, Ilia Kulikov, Zhaoheng Ni, Sravya Popuri, Paden Tomasello:
Massively Multilingual Forced Aligner Leveraging Self-Supervised Discrete Units. 899-905 - Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul:
Speech Recognition For Analysis of Police Radio Communication. 906-912 - Taaha Kazi, Ruiliang Lyu, Sizhe Zhou, Dilek Hakkani-Tür, Gokhan Tur:
Large Language Models as User-Agents For Evaluating Task-Oriented-Dialogue Systems. 913-920 - Jiawei Du, I-Ming Lin, I-Hsiang Chiu, Xuanjun Chen, Haibin Wu, Wenze Ren, Yu Tsao, Hung-Yi Lee, Jyh-Shing Roger Jang:
DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset. 921-928 - Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu:
SPMIS: An Investigation of Synthetic Spoken Misinformation Detection. 929-936 - Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath:
Self-Supervised Speech Models For Word-Level Stuttered Speech Detection. 937-944 - Wen-Hsuan Peng, Sally Chen, Berlin Chen:
Enhancing Automatic Speech Assessment Leveraging Heterogeneous Features and Soft Labels For Ordinal Classification. 945-952 - Yerin Choi, Jeehyun Lee, Myoung-Wan Koo:
Speech Recognition-Based Feature Extraction For Enhanced Automatic Severity Classification in Dysarthric Speech. 953-960 - Andy T. Liu, Yi-Cheng Lin, Haibin Wu, Stefan Winkler, Hung-Yi Lee:
Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget. 961-968 - Xinhu Zheng, Anbai Jiang, Bing Han, Yanmin Qian, Pingyi Fan, Jia Liu, Wei-Qiang Zhang:
Improving Anomalous Sound Detection Via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models. 969-974 - Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard:
Exploring ASR-Based WAV2VEC2 for Automated Speech Disorder Assessment: Insights and Analysis. 975-982 - Chang Feng, Yiyang Zhao, Guangzhi Sun, Zehua Chen, Shuai Wang, Chao Zhang, Mingxing Xu, Thomas Fang Zheng:
Hierarchical Multi-Path and Multi-Model Selection For Fake Speech Detection. 983-990 - Huayun Zhang, Jeremy H. M. Wong, Geyu Lin, Nancy F. Chen:
Semi-Supervised Learning for Robust Speech Evaluation. 991-998 - Pai Zhu, Jacob W. Bartel, Dhruuv Agarwal, Kurt Partridge, Hyun Jin Park, Quan Wang:
GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-Shot Keyword Spotting. 999-1006 - Gene-Ping Yang, Hao Tang:
A Simple HMM with Self-Supervised Representations for Phone Segmentation. 1007-1014 - Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogério Feris, James R. Glass:
DASS: Distilled Audio State Space Models are Stronger and More Duration-Scalable Learners. 1015-1022 - David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He:
Rand: Robustness Aware Norm Decay for Quantized Neural Networks. 1023-1030 - Ziyang Zhang, Andrew Thwaites, Alexandra Woolgar, Brian Moore, Chao Zhang:
SWIM: Short-Window CNN Integrated With Mamba for EEG-Based Auditory Spatial Attention Decoding. 1031-1038 - Xuanru Zhou, Cheol Jun Cho, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Boon Lead Tee, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli:
Stutter-Solver: End-To-End Multi-Lingual Dysfluency Detection. 1039-1046 - Huadong Lin, Yirong Chen, Wenyu Tao, Mingyu Chen, Xiangmin Xu, Xiaofen Xing:
Domain Adaption and Unified Knowledge Base Motivate Better Retrieval Models in Dialog Systems With RAG. 1047-1052 - Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani:
SSAMBA: Self-Supervised Audio Representation Learning With Mamba State Space Model. 1053-1059 - Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-Yi Lee:
Speech-Copilot: Leveraging Large Language Models for Speech Processing Via Task Decomposition, Modularization, and Program Generation. 1060-1067 - Rui Zhao, Jinyu Li, Ruchao Fan, Matt Post:
CTC-GMM: CTC Guided Modality Matching For Fast and Accurate Streaming Speech Translation. 1068-1075 - Peter Polák, Ondrej Bojar:
Long-Form End-To-End Speech Translation VIA Latent Alignment Segmentation. 1076-1082 - Yi-Jyun Sun, Suvodip Dey, Dilek Hakkani-Tür, Gokhan Tur:
Confidence Estimation For LLM-Based Dialogue State Tracking. 1083-1090 - Yucheng Cai, Si Chen, Yuxuan Wu, Yi Huang, Junlan Feng, Zhijian Ou:
The 2nd Futuredial Challenge: Dialog Systems With Retrieval Augmented Generation (Futuredial-RAG). 1091-1098 - Mengjie Qian, Rao Ma, Adian Liusie, Erfan Loweimi, Kate M. Knill, Mark J. F. Gales:
Zero-Shot Audio Topic Reranking Using Large Language Models. 1099-1106 - Henry Li Xinyuan, Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak, Sanjeev Khudanpur:
Clean Label Attacks Against SLU Systems. 1107-1114 - Mohan Li, Cong-Thanh Do, Simon Keizer, Youmna Farag, Svetlana Stoyanchev, Rama Doddipatla:
WHISMA: A Speech-LLM to Perform Zero-Shot Spoken Language Understanding. 1115-1122 - Vishal Sunder, Eric Fosler-Lussier:
Improving Transducer-Based Spoken Language Understanding With Self-Conditioned CTC and Knowledge Transfer. 1123-1130 - Ryota Komatsu, Takahiro Shinozaki:
Self-Supervised Syllable Discovery Based on Speaker-Disentangled Hubert. 1131-1136 - Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf:
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify And Understand Speaker in Spoken Dialogue. 1137-1143 - Zhiyong Chen, Zhiqi Ai, Xinnuo Li, Shugong Xu:
Enhancing Open-Set Speaker Identification Through Rapid Tuning With Speaker Reciprocal Points and Negative Sample. 1144-1149 - Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi:
Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches. 1150-1157 - Yibo Bai, Xiao-Lei Zhang, Xuelong Li:
Adversarial Purification For Speaker Verification By Two-Stage Diffusion Models. 1158-1164 - Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney:
Measuring Sound Symbolism In Audio-Visual Models. 1165-1172 - Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki:
Meta-Learning Approaches For Improving Detection of Unseen Speech Deepfakes. 1173-1178 - Chenyang Guo, Liping Chen, Zhuhai Li, Kong Aik Lee, Zhen-Hua Ling, Wu Guo:
On The Generation and Removal of Speaker Adversarial Perturbation For Voice-Privacy Protection. 1179-1184 - Tianchi Liu, Ivan Kukanov, Zihan Pan, Qiongqiong Wang, Hardik B. Sailor, Kong Aik Lee:
Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing. 1185-1192 - Min Ma, Gary Wang, Kyle Kastner, Isaac Caswell, Charles Yoon, Andrew Rosenberg:
Enhancing Low-Resource Spoken Language Identification Via Cross-Modality Retrieval and Cross-Lingual Text-to-Speech Synthesis. 1193-1200 - Shota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix:
Recursive Attentive Pooling For Extracting Speaker Embeddings From Multi-Speaker Recordings. 1201-1208 - Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj:
PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification. 1209-1216 - Narla John Metilda Sagaya Mary, S. Umesh:
Inx-Speakerhub: A 2000-Hour Indian Multiligual Speaker Verification Corpus. 1217-1223 - Weiqing Wang, Kunal Dhawan, Taejin Park, Krishna C. Puvvada, Ivan Medennikov, Somshubra Majumdar, He Huang, Jagadeesh Balam, Boris Ginsburg:
Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR. 1224-1231 - Sreekanth Sankala:
Exploring Self-Supervised Representations for Text-Dependent Speaker Verification. 1232-1239 - Xinlei Ma, Wenhuan Lu, Ruiteng Zhang, Junhai Xu, Xugang Lu, Jianguo Wei:
Distillation-Based Feature Extraction Algorithm For Source Speaker Verification. 1240-1246 - Qing Wang, Hongmei Guo, Jian Kang, Mengjie Du, Jie Li, Xiao-Lei Zhang, Lei Xie:
Speaker Contrastive Learning For Source Speaker Tracing. 1247-1253 - Ze Li, Yuke Lin, Tian Yao, Hongbin Suo, Pengyuan Zhang, Yanzhen Ren, Zexin Cai, Hiromitsu Nishizaki, Ming Li:
The Database and Benchmark For the Source Speaker Tracing Challenge 2024. 1254-1261

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.