Open Access Open Access  Restricted Access Subscription Access

A Gated Recurrent Unit Based Robust Voice Activity Detector


Affiliations
1 Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
 

Voice activity detection (VAD), which identifies speech and non-speech durations in speech signals, is a challenging task under noisy environment for various speech applications. In this paper, we propose a Gated Recurrent Unit (GRU) based VAD using MFCCs augmented delta and delta-delta features under the low signal-to-noise ratios (SNRs) environments to overcome the shortages of the traditional VAD models. We compare the proposed method with the traditional methods by using speech signals smeared with 10 types of noise at low SNRs. Experimental results reveal that the proposed method based on GRU is superior to traditional method under all the considered noisy environments, indicating that the network based on GRU improve the performance of speech detection.

Keywords

voice activity detection, deep neural network, recurrent neural network, gated recurrent unit.
User
Notifications
Font Size

  • S.F. Boll, Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2), 1979, 113-120.
  • A. Benyassine, E. Shlomot, H.Y. Su, D. Massaloux, C. Lamblin, and J.P. Petit, ITU-T Recommendation G.729 Annex B: a Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications, IEEE Communications Magazine, 35(9), 1997,64-73.
  • S.B. Tong, N.X. Chen, Y.M. Qian, and K. Yu, Evaluating Vad for Automatic Speech Recognition, Proc. 12th International Conf. on Signal Processing, Hangzhou, PRC, 2014,2308–2314.
  • L. Rabiner, and M.R. Sambur, An Algorithm for Determining the Endpoints of Isolated Utterances, Bell System Technical Journal, 54(2), 1975,297-315.
  • J. Ramirez, J.C. Segura, C. Benitez, A.de la Torre, and A. Rubio, Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information, Speech Communication, 42(3-4), 2004,271-287.
  • X.K. Yang, L.He, D. Qu, and W.Q. Zhang, Voice Activity Detection Algorithm Based on Long-Term Pitch Information, EURASIP Journal on Audio, Speech and Music Processing, 2016:14, 2016,1-9.
  • Y.N. Ma, and A. Nishihara, Efficient Voice Activity Detection Algorithm Using Long-Term Spectral Flatness Measure, EURASIP Journal on Audio, Speech and Music Processing, 2013:21, 2013,1-18.
  • K. Ishizuka, T. Nakatani, M. Fujimoto, and N. Miyazak, Noise Robust Voice Activity Detection Based on Periodic to Aperiodic Component Ratio, Speech Communication, 52(1), 2010,41-60.
  • J.S. Sohn, N.S. Kim, and W.Y. Sung, A Statistical Model-Based Voice Activity Detection, IEEE Signal Processing Letters, 6(1), 1999,1-3.
  • E.Q. Dong, G.Z. Liu, Y.T. Zhou, and X.D. Zhang, Applying Support Vector Machines to Voice Activity Detection, Proc. 6th International Conf. on Signal Processing, Beijing, PRC, 2002,1124–1127.
  • T. Kinnunen, E. Chernenko, M. Tuononen, P. Fränti, and H.Z. Li,Voice Activity Detection Using MFCC Features and Support Vector Machine,Proc. International Conf. on Speech and Computer, 2007,556–561.
  • Q.H. Jo, J.H. Chang, J.W. Shin, and N. S. Kim, Statistical Model-Based Voice Activity Detection Using Support Vector Machine, IET Signal Processing, 3(3), 2009,205-210.
  • G. Ferroni, R. Bonfigli, E. Principi, S. Squartini, and F. Piazza, A Deep Neural Network Approach for Voice Activity Detection in Multi-Room Domestic Scenarios, Proc. International Joint Conf. on Neural Networks, Killarney, IRELAND, 2015,1–8.
  • X.L. Zhang, and D.L. Wang, Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection, IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(2), 2016,252-264.
  • M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, Exploiting Spectro-Temporal Locality in Deep Learning Based Acoustic Event Detection, EURASIP Journal on Audio Speech and Music Processing, 2015:26, 2015,1-12.
  • S.M. Valentin, N.P. Tatiana, and A.P. Alexey, Robust Voice Activity Detection with Deep Maxout Neural Networks, Modern Applied Science, 9(8), 2015,153-159.
  • X.L. Zhang, and J.Wu, Deep Belief Networks Based Voice Activity Detection, IEEE Transactions on Audio, Speech and Language Processing, 21(4), 2013,697-710.
  • S.Y. Chang, B.Li, G. Simko, T.N. Sainath, A. Tripathi, A. van den Oord, and O. Vinyals, Temporal Modeling using Dilated Convolution and Gating for Voice-Activity-Detection,Proc. IEEE International Conf. on Acoustics, Speech and Signal Processing, Calgary, CANADA, 2018,5549–5553.
  • A. Sehgal, and N. Kehtarnavaz, A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection, IEEE Access, 21, 2018,9017-9026.
  • M. Lavechin, M.P. Gill, R. Bousbib, H. Bredin, and L.P. Garcia-Perera, End-to-End Domain-Adversarial Voice Activity Detection,Proc. Conference of the International Speech Communication Association, Shanghai, PRC, 2020,3685–3689.
  • T.J. Xu, H. Zhang, and X.L. Zhang, Polishing the Classical Likelihood Ratio Test by Supervised Learning for Voice Activity Detection,Proc. Conference of the International Speech Communication Association, Shanghai, PRC, 2020,3675–3679.
  • Z.P. Zheng, J.Z. Wang, N. Cheng, J. Luo, and J. Xiao, MLNET: an Adaptive Multiple Receptive-Field Attention Neural Network for Voice Activity Detection,Proc. Conference of the International Speech Communication Association, Shanghai, PRC, 2020,3695–3699.
  • T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, Recurrent Neural Network Based Language Model,Proc. Conference of the International Speech Communication Association, Makuhari, JAPAN, 2010,1045–1048.
  • S. Dwijayanti, K. Yamamori, and M. Miyoshi, Enhancement of Speech Dynamics for Voice Activity Detection using DNN, EURASIP Journal on Audio, Speech and Music Processing, 2018:10, 2018,1-15.
  • K.H. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning Phrase Representations using RNN Encoder-decoder for Statistical Machine Translation, arXiv preprint, arXiv:1406.1078, 2014.
  • S. Hochreiter, and J. Schmidhuber, Long Short-Term Memory, Neural computation, 9(8), 1997,1735-1780.
  • F.A. Gers, N.N. Schraudolph, and J. Schmidhuber, Learning Precise Timing with LSTM Recurrent Networks, Journal of Machine Learning Research, 3(1), 2003,115-143.
  • J.Y. Chung, C. Gulcehre, K.H. Cho, and Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv preprint, arXiv:1412.3555, 2014.
  • M. Schuster, and K. K Paliwal, Bidirectional Recurrent Neural Networks, IEEE Transactions on Signal Processing, 45(11), 1997,2673-2681.
  • Noisex-92 Database, Rice University, Available at: http://spib.linse.ufsc.br/noise.html. Accessed on 22 Feb 2017.
  • J.L. Ba, J.R. Kiros, and G.E. Hinton, Layer Normalization, arXiv preprint, arXiv:1607.06450, 2016.
  • J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and N.L. Dahlgren, Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM, NIST Interagency/Internal Report, NISTIR-4930, NIST, Gaithersburg, 1993.
  • 100 Nonspeech Environmental Sounds, Available at: http://www.pudn.com/Download/item/id/3457634.html ,2018.
  • D. Kingma, and J. Ba, Adam: a Method for Stochastic Optimization, arXiv preprint, arXiv:1412.6980, 2014.
  • R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd edn, Wiley-Interscience, New York, 2001.

Abstract Views: 60

PDF Views: 3




  • A Gated Recurrent Unit Based Robust Voice Activity Detector

Abstract Views: 60  |  PDF Views: 3

Authors

Il Han
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Chol Nam Om
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Un Il Kim
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of
Jang Su Kim
Institute of Information Technology, Hightech Research & Development Center Kim Il Sung University, Pyongyang, Korea, Democratic People's Republic of

Abstract


Voice activity detection (VAD), which identifies speech and non-speech durations in speech signals, is a challenging task under noisy environment for various speech applications. In this paper, we propose a Gated Recurrent Unit (GRU) based VAD using MFCCs augmented delta and delta-delta features under the low signal-to-noise ratios (SNRs) environments to overcome the shortages of the traditional VAD models. We compare the proposed method with the traditional methods by using speech signals smeared with 10 types of noise at low SNRs. Experimental results reveal that the proposed method based on GRU is superior to traditional method under all the considered noisy environments, indicating that the network based on GRU improve the performance of speech detection.

Keywords


voice activity detection, deep neural network, recurrent neural network, gated recurrent unit.

References