查看论文信息

题名：	针对目标说话人的语音增强模型研究
作者：	刘思行
学号：	SZ2216038
保密级别：	公开
语种：	chi
学科代码：	085404
学科：	工学 - 电子信息 - 计算机技术
学生类型：	硕士
学位：	工程硕士
入学年份：	2022
学校：	南京航空航天大学
院系：	计算机科学与技术学院/人工智能学院
专业：	电子信息（专业学位）
研究方向：	语音增强
导师姓名：	杨群
导师单位：	计算机科学与技术学院/人工智能学院
完成日期：	2025-03-31
答辩日期：	2025-03-10
外文题名：	Research on Speech Enhancement Model for Target Speaker
关键词：	目标说话人语音增强 ; 扩散模型 ; 漂移项 ; 噪声去除 ; 非目标说话人语音去除
外文关键词：	Target Speaker Speech Enhancement ; Diffusion Model ; Drift Term ; Noise Removal ; Non-Target Speaker Speech Removal
摘要：	︿语音增强技术旨在降低语音中的各种干扰声，并从中提取有用的语音信号。传统的语音增强方法主要研究去除环境噪声和声学混响。近年来，复杂多说话人场景提出了针对多说话人混合语音中目标说话人语音增强的需求，这对传统的语音增强方法产生了挑战。扩散模型作为一种生成式方法，具有泛化性能强的优点，因此本文基于扩散模型开展目标说话人语音增强研究。根据任务特点，本文将目标说话人语音增强任务划分为噪声去除子任务和非目标说话人语音去除子任务。本文创新工作如下：（1）在去除噪声子任务中，当前基于分数扩散模型的语音增强方法注重随机微分方程中的扩散项，而忽视漂移项，这导致其对扩散模型的逆向过程建模存在不足，从而影响了语音增强的效果。针对此问题，本文提出了一种基于扩散模型的语音去噪方法。首先，本文通过引入可学习的漂移项，实现对于扩散模型逆向过程的精准建模。接着，本文提出一个新的语音增强框架 Drift-DiffuSE，该框架使用分数子模块和漂移子模块分别对扩散项和漂移项进行建模。然后，本文引入可变漂移步长，实现对于扩散模型逆向去噪过程的控制。此外，本文还设计了漂移损失，实现分数子模块和漂移子模块的优化。实验结果证明，本文所提方法增强出的语音信号在感知质量达到了生成式模型上的最优；在泛化性能上，也优于现有的基于扩散模型的语音增强方法。（2）在非目标说话人语音去除子任务中，现有的基于扩散模型的非目标说话人语音去除方法通常依赖于从干净语音中提取固定的目标说话人特征，并以此指导语音增强模型去除非目标说话人语音信号。然而，这种方法与扩散模型的迭代去除特性不相符合，限制了扩散模型对非目标说话人语音的去除能力。针对此问题，本文提出了一种基于扩散模型的非目标说话人语音去除方法。首先，本文设计了一个说话人特征提取子模块，其融合了注意力机制和时序建模网络，可以生成与扩散模型特性相符的、可变长度的目标说话人特征。然后，本文引入了对比损失和说话人分类损失，探索其对非目标说话人语音去除任务的影响。实验结果证明，本文提出的方法较现有的基于扩散模型的非目标说话人语音去除方法，在各项指标上都达到了最优，同时，本文所提损失函数也对非目标说话人语音去除任务有促进作用。﹀
外摘要要：	︿ Speech enhancement aim to mitigating various interfering sounds and extracting useful speech signals. Traditional approaches in this domain have predominantly focused on the elimination of environmental noise and acoustic reverberation. However, the emergence of complex multi-speaker scenarios has underscored the necessity for enhancing the speech of target speakers within mixed multi-speaker environments, thereby presenting a significant challenge to conventional speech enhancement methodologies. The diffusion model, recognized for its robust generalization capabilities as a generative method, serves as the foundation for this study's exploration into target speaker speech enhancement. This research delineates the target speaker speech enhancement task into two distinct subtasks: the removal of background noise and the elimination of non-target speaker interference. The contributions of this thesis are as follows: (1) In the context of background noise removal, existing speech enhancement techniques based on fractional diffusion models have predominantly emphasized the diffusion term at the expense of the drift term. This imbalance has led to inadequacies in the reverse modeling of the diffusion process, adversely affecting the quality of speech enhancement. To rectify this, the present study introduces an innovative noise reduction strategy grounded in diffusion models. Initially, a learnable drift term is integrated to accurately model the reverse process of diffusion models. Subsequently, a novel speech enhancement framework, Drift-DiffuSE, is proposed, which independently models the diffusion and drift terms through fractional and drift sub-modules, respectively. Moreover, this study introduces a variable drift step length to regulate the reverse denoising process of the diffusion model. Additionally, a drift loss is devised to optimize both the fractional and drift sub-modules. Empirical results indicate that the proposed method significantly elevates the quality of speech signals to the optimal levels observed in generative models and surpasses existing diffusion model-based speech enhancement techniques in terms of generalization performance. (2) Regarding the elimination of non-target speaker speech, extant methods based on diffusion models typically depend on the extraction of fixed features of target speakers from clean speech to guide the speech enhancement model in removing non-target speaker signals. However, this approach is not well aligned with the iterative removal characteristics of diffusion models, thereby constraining their efficacy in eliminating non-target speaker speech. To address this limitation, this thesis proposes a novel method for the removal of non-target speaker speech based on diffusion models. Initially, a speaker feature extraction submodule is designed, incorporating attention mechanisms and temporal modeling networks, capable of generating variable-length target speaker features that are congruent with the characteristics of diffusion models. Additionally, this study introduces contrastive loss and speaker classification loss to investigate their impact on the task of non-target speaker speech removal. Experimental results demonstrate that the proposed method outperforms existing diffusion model-based methods for non-target speaker speech removal across various metrics, and the introduced loss functions also contribute positively to the effectiveness of this task. ﹀
参考文献：	︿ [1] Zhang Q, Qian X, Ni Z, et al. A Time-Frequency Attention Module for Neural Speech Enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31:462–475. [2] Zheng C, Zhang H, Liu W, et al. Sixty Years of Frequency-domain Monaural Speech Enhancement: From traditional to deep learning methods[J]. Trends in Hearing, 2023, 27:23312165231209913. [3] Tai W, Zhou F, Trajcevski G, et al. Revisiting denoising diffusion probabilistic models for speech enhancement: Condition collapse, efficiency and refinement[C]. Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2023. 13627–13635. [4] Richter J, Welker S, Lemercier J M, et al. Speech Enhancement and Dereverberation With Diffusion-Based Generative Models[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 31:2351–2364. [5] Kamo N, Delcroix M, Nakatan T. Target Speech Extraction with Conditional Diffusion Model[C]. Interspeech, 2023. 176–180. [6] Zhang L, Qian Y, Yu L, et al. DDTSE: Discriminative diffusion model for target speech extraction[C]. 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024. 294–301. [7] 陈欢, 邱晓晖. 改进谱减法语音增强算法的研究 [J]. 计算机技术与发展, 2014, 24(4):69–71. [8] 白帆, 李含雁, 李勇滔, et al. 基于改进谱减法的工地语音增强方法 [J]. 装备制造技术, 2023, (08):45–47. [9] 郭莉莉, 陈永红. 一种改进的谱减法语音增强算法 [J]. 通信技术, 2021, 54(06):1350–1355. [10] 涂井先, 冀占江, 覃桂茳. 降低多通道维纳滤波语音增强方法计算复杂度的新策略 [J]. 计算机应用与软件, 2023, 40(11):149–155. [11] 董胡, 徐雨明, 马振中, et al. 基于小波包与自适应维纳滤波的语音增强算法 [J]. 计算机技术与发展, 2020, 30(01):50–53. [12] Dendrinos M N, Bakamidis S G, Carayannis G. Speech enhancement from noise: A regenerative approach[J]. Speech Commun., 1991, 10:45–57. [13] Lim J S, Oppenheim A V. Enhancement and bandwidth compression of noisy speech[J]. Proceedings of the IEEE, 1979, 67:1586–1604. [14] Roman N, Wang D, Brown G J. Speech segregation based on sound localization[J]. The Journal of the Acoustical Society of America, 2003, 114(4):2236–2252. [15] Vincent E, Virtanen T, Gannot S. Audio source separation and speech enhancement[M]. John Wiley & Sons, 2018. [16] Pandey A, Wang D. A new framework for CNN-based speech enhancement in the time domain[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(7):1179–1188. [17] Gao T, Du J, Dai L R, et al. Densely connected progressive learning for lstm-based speech enhancement[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2018. 5054–5058. [18] Zhang L, Pei K, Li W, et al. A New U-Net Speech Enhancement Framework Based on Correlation Characteristics of Speech[R]. Technical report, SAE Technical Paper, 2024. [19] Tan K, Wang D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement[C]. Interspeech, volume 2018, 2018. 3229–3233. [20] Hu Y, Liu Y, Lv S, et al. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement[C]. Interspeech, 2020. 2472–2476. [21] Chen J, Wang Z, Tuo D, et al. FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement[C]. ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022. 7857–7861. [22] Zheng C, Peng X, Zhang Y, et al. Interactive speech and noise modeling for speech enhancement[C]. Proceedings of the AAAI conference on artificial intelligence, volume 35, 2021. 14549–14557. [23] Pascual S, Bonafonte A, Serrà J. SEGAN: Speech Enhancement Generative Adversarial Network[J]. ArXiv, 2017, abs/1703.09452. [24] Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]. ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2019. 106–110. [25] Fu S W, Yu C, Hsieh T A, et al. MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement[C]. Interspeech, 2021. 201–205. [26] Guo C, Huang Z, Li H, et al. MetricGAN+ Speech Enhancement Model Combining Multi-scale and Multi-resolution Feature Network and Joint Perception Loss[C]. 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML). IEEE, 2023. 510–514. [27] Su J, Jin Z, Finkelstein A. HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks[J]. ArXiv, 2020, abs/2006.05694. [28] Cao R, Abdulatif S, Yang B. CMGAN: Conformer-based Metric GAN for Speech Enhancement[J].ArXiv, 2022, abs/2203.15149. [29] Leglaive S, Girin L, Horaud R. A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement[C]. 2018 IEEE 28th international workshop on machine learning for signal processing (MLSP). IEEE, 2018. 1–6. [30] Richter J, Carbajal G, Gerkmann T. Speech Enhancement with Stochastic Temporal Convolutional Networks[C]. Interspeech, 2020. 4516–4520. [31] Strauss M, Edler B. A Flow-Based Neural Network for Time Domain Speech Enhancement[C]. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. 5754–5758. [32] Welker S, Richter J, Gerkmann T. Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain[C]. Proc. Interspeech 2022, 2022. 2928–2932. [33] Lu Y J, Wang Z Q, Watanabe S, et al. Conditional Diffusion Probabilistic Model for Speech Enhancement[C]. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee, 2022. 7402–7406. [34] Reddy C K A, Dubey H, Gopal V, et al. Icassp 2022 Deep Noise Suppression Challenge[J]. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. 9271–9275. [35] Ju Y, Rao W, Yan X, et al. TEA-PSE: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System for ICASSP 2022 DNS Challenge[C]. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. 9291–9295. [36] Desplanques B, Thienpondt J, Demuynck K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification[J]. ArXiv, 2020, abs/2005.07143. [37] Ju Y, Zhang S, Rao W, et al. TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement[C]. 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023. 472– 479. [38] Dubey H, Gopal V, Cutler R, et al. ICASSP 2023 Deep Noise Suppression Challenge[J]. IEEE Open Journal of Signal Processing, 2023, 5:725–737. [39] Ju Y, Chen J, Zhang S, et al. TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 Dns-Challenge[C]. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023. 1–2. [40] Wang Q, López-Moreno I, Saglam M, et al. VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition[J]. ArXiv, 2020,abs/2009.04323. [41] Schröter H, Maier A, Escalante-B A N, et al. Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio[C]. 2022 international workshop on acoustic signal enhancement (IWAENC). IEEE, 2022. 1–5. [42] Serre T, Fontaine M, Benhaim É, et al. A Lightweight Dual-Stage Framework for Personalized Speech Enhancement Based on Deepfilternet2[J]. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024. 780–784. [43] Pärnamaa T, Saabas A. Personalized Speech Enhancement Without a Separate Speaker Embedding Model[J]. ArXiv, 2024, abs/2406.09928. [44] Nguyen T, Sun G, Zheng X, et al. Conditional Diffusion Model for Target Speaker Extraction[J]. ArXiv, 2023, abs/2310.04791. [45] Wang D. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis[C]. Speech Separation by Humans and Machines, 2005. 181–197. [46] Narayanan A, Wang D. Ideal ratio mask estimation using deep neural networks for robust speech recognition[J]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013. 7092–7096. [47] Williamson D S, Wang Y, Wang D. Complex Ratio Masking for Monaural Speech Separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24:483–492. [48] Rix A W, Beerends J G, Hollier M, et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs[J]. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat.No.01CH37221), 2001, 2:749–752. [49] Jensen J H, Taal C H. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24:2009–2022. [50] Le Roux J, Wisdom S, Erdogan H, et al. SDR–half-baked or well done?[C]. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. 626–630. [51] Lu Y J, Tsao Y, Watanabe S. A study on speech enhancement based on diffusion probabilistic model[C]. 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021. 659–666. [52] Chen C, Hu Y, Weng W, et al. Metric-oriented speech enhancement using diffusion probabilistic model[C]. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023. 1–5. [53] Hu Y, Chen C, Li R, et al. Noise-aware Speech Enhancement using Diffusion Probabilistic Model[J]. ArXiv, 2023, abs/2307.08029. [54] Zhang C, Zhang C, Zheng S, et al. A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI[J]. ArXiv, 2023, abs/2303.13336. [55] Lemercier J M, Richter J, Welker S, et al. StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31:2724–2737. [56] Shi H, Shimada K, Hirano M, et al. Diffusion-based speech enhancement with joint generative and predictive decoders[C]. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. 12951–12955. [57] Serrà J, Pascual S, Pons J, et al. Universal Speech Enhancement with Score-based Diffusion[J]. ArXiv, 2022, abs/2206.03065. [58] Li Y, Turner R E. Gradient Estimators for Implicit Models[J]. ArXiv, 2017, abs/1705.07107. [59] Richter J, Frintrop S, Gerkmann T. Audio-Visual Speech Enhancement with Score-Based Generative Models[J]. ArXiv, 2023, abs/2306.01432. [60] Kim D, Yang D H, Kim D, et al. Guided conditioning with predictive network on score-based diffusion model for speech enhancement[C]. Proc. Interspeech 2024, 2024. 1190–1194. [61] Zhang Q, Chen Y. Diffusion Normalizing Flow[J]. Advances in neural information processing systems, 2021, 34:16280–16291. [62] Gerkmann T, Martin R. Empirical Distributions of DFT-domain Speech Coefficients Based on Estimated Speech Variances[C]. Proc. Int. Workshop Acoust. Echo Noise Control, 2010. 1–4. [63] Song Y, Sohl-Dickstein J N, Kingma D P, et al. Score-Based Generative Modeling through Stochastic Differential Equations[J]. ArXiv, 2020, abs/2011.13456. [64] Brock A, Donahue J, Simonyan K. Large Scale GAN Training for High Fidelity Natural Image Synthesis[J]. ArXiv, 2018, abs/1809.11096. [65] Wu Y, He K. Group Normalization[J]. International Journal of Computer Vision, 2018, 128:742 – 755. [66] Zhang R. Making convolutional networks shift-invariant again[C]. International conference on machine learning. PMLR, 2019. 7324–7334. [67] Ramachandran P, Zoph B, Le Q V. Swish: a self-gated activation function[J]. arXiv preprint arXiv:1710.05941, 2017, 7(1):5. [68] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010. [69] Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution[J]. Advances in neural information processing systems, 2019, 32:11918–11930. [70] Gilks W R, Richardson S, Spiegelhalter D. Markov chain Monte Carlo in practice[M]. CRC press, 1995. [71] Botinhao C V, Wang X, Takaki S, et al. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]. 9th ISCA speech synthesis workshop, 2016. 159–165. [72] Veaux C, Yamagishi J, King S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database[C]. 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (OCOCOSDA/CASLRE). IEEE, 2013. 1–4. [73] Thiemann J, Ito N, Vincent E. Diverse environments multichannel acoustic noise database (demand), 2013. [74] Garofolo J, Graff D, Paul D, et al. CSR-I (WSJ0) Complete LDC93S6A. Linguistic Data Consortium, Philadelphia, 1993. [75] Snyder D, Chen G, Povey D. MUSAN: A Music, Speech, and Noise Corpus, 2015. arXiv:1510.08484v1. [76] Luo Y, Mesgarani N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2019,27(8):1256–1266. [77] Li A, Zheng C, Zhang L, et al. Glance and gaze: A collaborative learning framework for single channel speech enhancement[J]. Applied Acoustics, 2022, 187:108499. [78] Bie X, Leglaive S, Alameda-Pineda X, et al. Unsupervised speech enhancement using dynamical variational autoencoders[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30:2993–3007. [79] Scheibler R, Fujita Y, Shirahata Y, et al. Universal Score-based Speech Enhancement with High Content Preservation[J]. ArXiv, 2024, abs/2406.12194. [80] Xu Y, Chen H, Yu J, et al. SECap: Speech Emotion Captioning with Large Language Model[J]. ArXiv, 2023, abs/2312.10381. [81] Thoidis I, Goehring T. Using deep learning to improve the intelligibility of a target speaker in noisy multi-talker environments for people with normal hearing and hearing loss.[J]. The Journal of the Acoustical Society of America, 2024, 156(1):706–724. [82] Shan Y, Zhu Z, Long T, et al. Contrastive Diffuser: Planning Towards High Return States via Contrastive Learning[J]. ArXiv, 2024, abs/2402.02772. [83] Gutmann M U, Hyvärinen A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models[C]. International Conference on Artificial Intelligence and Statistics, 2010. 307–361. [84] Hershey J R, Chen Z, Roux J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[J]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. 31–35. [85] Cosentino J, Pariente M, Cornell S, et al. Librimix: An open-source dataset for generalizable speech separation[J]. arXiv preprint, 2020, arXiv:2005.11262. [86] Panayotov V, Chen G, Povey D, et al. Librispeech: An ASR corpus based on public domain audio books[J]. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. 5206–5210. [87] Žmolíková K, Delcroix M, Kinoshita K, et al. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13:800–814. [88] Xu C, Rao W, Chng E S, et al. SpEx: Multi-Scale Time Domain Speaker Extraction Network[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28:1370–1384. [89] Ge M, Xu C, Wang L, et al. SpEx+: A Complete Time Domain Speaker Extraction Network[C]. Interspeech, 2020. 1406–1410. [90] Liu K, Du Z, Wan X, et al. X-SEPFORMER: End-To-End Speaker Extraction Network with Explicit Optimization on Speaker Confusion[J]. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. 1–5. [91] Nguyen T, Sun G, Zheng X, et al. Conditional Diffusion Model for Target Speaker Extraction[J]. ArXiv, 2023, abs/2310.04791. ﹀
中图分类号：	TP391
馆藏号：	2025-016-0161
开放日期：	2025-09-29

附件下载