Open Access Open Access  Restricted Access Subscription Access

Multichannel Speech Enhancement of Target Speaker Based on Wakeup Word Mask Estimation with Deep Neural Network


Affiliations
1 Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
 

In this paper, we address a multichannel speech enhancement method based on wakeup word mask estimation using Deep Neural Network (DNN). It is thought that the wakeup word is an important clue for target speaker. We use a DNN to estimate the wakeup word mask and noise mask and apply them to separate the mixed wakeup word signal into target speaker’s speech and background noise. Convolutional Recurrent Neural Network (CRNN) is used to exploit both short and long term time-frequency dependencies of sequences such as speech signals. Generalized Eigen Vector (GEV) beamforming estimates the spatial filter by using the masks to enhance the following speech command of target speaker and reduce undesirable noise. Experiment results show that the proposal provides more robust to noise, so that improves the Signal-to-Noise Ratio (SNR) and speech recognition accuracy.

Keywords

Multichannel Speech Enhancement, Wakeup Word, Mask Estimation, Beamforming, Deep Neural Network (DNN).
User
Notifications
Font Size

  • B.Y. Xia, and C.C. Bao, Speech enhancement with weighted denoising auto-encoder, Proc. 14th Annual Conf. of the International Speech Communication Association, Lyon, France, 2013, 3411–3415.
  • J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, BLSTM supported GEV beamformer front-end for the 3rd CHIME challenge, Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AR, 2015, 444-451.
  • B.D. Van Veen, and K.M. Buckly, Beamforming: a versatile approach to spatial filtering, IEEE Acoustic, Speech and Signal Processing Magazine, 5(2), 1988, 4-24.
  • S. Doclo, W. Kellermann, S. Makino, and S. Nordholm, Multichannel signal enhancement algorithms for assisted listening devices, IEEE Signal Processing Magazine, 32(2), 2015, 18-30.
  • T. Hori, Z. Chen, H. Erdogan, J.R. Hershey, J. Le Roux, V. Mitra, and S. Watanabe, Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend, Computer Speech and Language, 46, 2017, 401-418.
  • Y. Kida, D. Tran, M. Omachi, T. Taniguchi, and Y. Fujita, Speaker selective beamformer with keyword mask estimation, Proc. 2018 IEEE Workshop on Spoken Language Technology, Athens, Greece, 2018, 528-534.
  • E. Warsitz, and R. Haeb-Umbach, Blind acoustic beamforming based on generalized eigenvalue decomposition, IEEE Transactions on Audio Speech & Language Processing, 15(5), 2007, 1529-1539.
  • J. Heymann, L. Drude, and R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, Proc. 41st IEEE International Conf. on Acoustics, Speech and Signal Processing, Shanghai, PRC, 2016, 196–200.
  • J. Heymann, L. Drude, and R. Haeb-Umbach, A generic neural acoustic beamforming architecture for robust multi-channel speech processing, Computer Speech & Language, 46, 2017, 374-385.
  • L. Yin, H. Ying, L.D. Kun, L. Rui, and Y.M. Hao, Chinese sign language recognition based on two-stream CNN and LSTM network, International Journal of Advanced Networking and Applications, 14(6), 2023, 5666-5671.
  • P. Elechi, E. Okowa, and O.P. Illuma, Analysis of a SONAR detecting system using multi-beamforming algorithm, International Journal of Advanced Networking and Applications, 14(5), 2023, 5596-5601.
  • D. Amodei, S. Ananthanarayan, R. Anubhai, J.L. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, and Q. Cheng, Deep speech 2: End-to-end speech recognition in English and Mandarin, Proc. 33rd International Conf. on Machine Learning, New York, NY, 2016.
  • Y.B. Zhou, C.M. Xiong, and R. Socher, Regularization techniques for end-to-end speech recognition, Patent, San Francisco, CA, US, US20190130896A1, 2019.
  • F.Y. Hou, L. Xie, and Z.H. Fu, Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese, Proc. 10th International Symposium on Chinese Spoken Language Processing, Tianjin, PRC, 2017.
  • G.G.Chen, C. Parada, and G. Heigold, Small-footprint keyword spotting using deep neural networks, Proc. 2014 IEEE International Conf. on Acoustics, Speech and Signal Processing, Florence, Italy, 2014.
  • Y.D. Zhang, N. Suda, L.Z. Lai, and V. Chandra, Hello Edge: Keyword spotting on microcontrollers, arXiv: 1711.07128, 2017.
  • T.N. Sainath, and C. Parada, Convolutional neural networks for small-footprint keyword spotting, Proc. 16th Annual Conf. of the International Speech Communication Association, Dresden, Germany, 2015.
  • A. Krueger, E. Warsitz, and R. Haeb-Umbach, Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation, IEEE Transactions on Audio, Speech and Language Processing, 19(1), 2011, 206–219.
  • H. Lucy, The MagPi (Raspberry Pi Trading Ltd, 30 Station Road, Cambridge, 2018).

Abstract Views: 156

PDF Views: 0




  • Multichannel Speech Enhancement of Target Speaker Based on Wakeup Word Mask Estimation with Deep Neural Network

Abstract Views: 156  |  PDF Views: 0

Authors

Chol Nam Om
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Hyok Kwak
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Chong Il Kwak
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Song Gum Ho
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Hyon Gyong Jang
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of

Abstract


In this paper, we address a multichannel speech enhancement method based on wakeup word mask estimation using Deep Neural Network (DNN). It is thought that the wakeup word is an important clue for target speaker. We use a DNN to estimate the wakeup word mask and noise mask and apply them to separate the mixed wakeup word signal into target speaker’s speech and background noise. Convolutional Recurrent Neural Network (CRNN) is used to exploit both short and long term time-frequency dependencies of sequences such as speech signals. Generalized Eigen Vector (GEV) beamforming estimates the spatial filter by using the masks to enhance the following speech command of target speaker and reduce undesirable noise. Experiment results show that the proposal provides more robust to noise, so that improves the Signal-to-Noise Ratio (SNR) and speech recognition accuracy.

Keywords


Multichannel Speech Enhancement, Wakeup Word, Mask Estimation, Beamforming, Deep Neural Network (DNN).

References