Open Access Open Access  Restricted Access Subscription Access

Estimating Social Background Profiling of Indian Speakers by Acoustic Speech Features


Affiliations
1 Faculty of Integrated Technologies, Universiti Brunei Darussalam, Jalan Tungku Link, Brunei Darussalam
 

Social background profiling of speakers refers to estimating the geographical origin of speakers by their speech features. Methods for accent profiling that use linguistic features, require phoneme alignment and transcription of the speech samples. This paper proposes a purely acoustic accent profiling model, composed of multiple convolutional networks with global average-pooling layers, to classify the temporal sequence of acoustic features. The bottleneck representations of the convolutional networks, trained with the original signals and their low-pass filtered copies, are fed to a Support Vector Machine classifier for final prediction. The model has been analysed for a speech dataset of Indian speakers from social backgrounds spread across India. It has been shown that up to 85% accuracy is achievable for classifying the geographic origin of speakers corresponding to regional Indian languages; 17% higher than the benchmark deep learning model using the same features. Results have also indicated that classification of accents is easier using the second language of the speakers, as compared to their native language.

Keywords

Accent identification, Low pass filtering, Ensemble learning, Native language identification, Speaker profiling.
User
Notifications
Font Size

  • Hansen J H L & Hasan T, Speaker recognition by machines and humans: A tutorial review, IEEE Signal Process Mag, 32(6) (2015) 74–99, doi:10.1109/MSP.2015.2462851.
  • Kalluri S B, Vijayasenan D & Ganapathy S, Automatic speaker profiling from short duration speech data, Speech Commun, 121(2019) 16–28, doi:10.1016/j.specom. 2020.03.008.
  • Brown G, Automatic accent recognition systems and the effects of data on performance, Odyssey 2016 Speak Lang Recognit Work, (2016) 94–100, doi:10.21437/Odyssey.2016-14.
  • Najafian M, Safavi S, Weber P & Russell, M. Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems, Odyssey 2016: Speaker and Language Recognition Workshop, (2016) 132–139, doi:10.21437/Odyssey.2016-19.
  • Brown G, Exploring forensic accent recognition using the Y-ACCDIST system, Sixt Annu Conf Int Speech Commun Assoc, (2016) 305–308.
  • Najafian M, Khurana S, Shan S, Ali A & Glass J, Exploiting Convolutional Neural Networks for Phonotactic Based Dialect Identification, IEEE Proc Int Conf Acoustics, Speech Signal Process (IEEE) 2018, 5174–5178, doi:10.1109/ ICASSP.2018.8461486.
  • Ferragne E & Pellegrino F, Vowel systems and accent similarity in the British Isles: Exploiting multidimensional acoustic distances in phonetics, J Phon, 38(4) (2010) 526–539, doi:10.1016/j.wocn.2010.07.002.
  • Rajpal A, Patel T B, Sailor H B, Madhavi M C, Patil H A & Fujisaki H, Native language identification using spectral and source-based features, Proc Interspeech 2016, 2383–2387, doi:10.21437/Interspeech.2016-1100.
  • Tahir F, Saleem S & Ahmad A, Extracting accent information from Urdu speech for forensic speaker recognition, Turkish J Electr Eng Comput Sci, 27(5) (2019) 3763–3778, doi:10.3906/elk-1812-152.
  • Kalluri S B, Vijayasenan D, Ganapathy S M R R & Krishnan P, NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling, IEEE Int Conf Acoustics, Speech Signal Process (IEEE) 2021, 6953–6957, doi:10.1109/icassp 39728.2021.9414349.
  • Ververidis D & Kotropoulos C, Emotional speech recognition: Resources, features, and methods, Speech Commun, 48(9) (2006) 1162–1181, doi:10.1016/j.specom. 2006.04.003.
  • Nguyen T H, Chng E S & Li H, Speaker Diarization: An Emerging Research, Speech and audio processing for coding, enhancement and recognition, (2014) 229–277, doi:10.1007/978-1-4939-1456-2_8.
  • Snyder D, Garcia-Romero D, Povey D & Khudanpur, S. Deep neural network embeddings for text-independent speaker verification, in Interspeech, 2017 (2017) 999–1003, doi:10.21437/Interspeech.2017-620.
  • Snyder D, Garcia-Romero D, Sell G, Povey D & Khudanpur S. X-Vectors: Robust DNN Embeddings for Speaker Recognition, IEEE Proc Int Conf Acoust Speech Signal Process (IEEE) 2018, 5329–5333, doi:10.1109/ICASSP. 2018.8461375.
  • Goel N K, Sarma M, Kushwah T S, Agrawal D K, Iqbal Z & Chauhan S, Extracting speaker's gender, accent, age and emotional state from speech, Proc Annu Conf Int Speech Commun Assoc (Interspeech) 2018, 2384–2385, doi:10.21437/Interspeech.2018-3036.
  • Singh R, Raj B & Baker J, Short-term analysis for estimating physical parameters of speakers, 4th Int Conf Biomet Forensic (IEEE) 2016, 1–6.
  • Ximenes A B, Shaw J A & Carignan C, A comparison of acoustic and articulatory methods for analyzing vowel differences across dialects: Data from American and Australian English, J AcoustSoc Am, 142(1) (2017) 363–377, doi:10.1121/1.4991346.
  • Lundmark M S, Ambrazaitis G & Ewald O, Exploring multidimensionality: Acoustic and articulatory correlates of Swedish word accents, Proc Annu Conf Int Speech Commun Assoc, in Interspeech (The International Speech Communication Association) 2017, 3236–3240, doi:10.21437/Interspeech.2017-1502.
  • Weninger F, Sun Y, Park J, Willett D & Zhan P, Deep learning based Mandarin accent identification for accent robust ASR, Proc Annu Conf Int Speech Commun Assoc, in Interspeech 2019, 510–514, doi:10.21437/Interspeech. 2019-2737.
  • Brown G, Segmental content effects on text-dependent automatic accent recognition (ISCA) 2018, 9–15, doi: 10.21437/odyssey.2018-2.
  • Hughes V & Wormald J, Sharing innovative methods, data and knowledge across sociophonetics and forensic speech science, Linguist Vanguard, 6(s1) (2020) 20180062, doi:10.1515/lingvan-2018-0062.
  • Ge Z, Improved accent classification combining phonetic vowels with acoustic features, Proc - 2015 8th Int Congress Image Signal Process (IEEE) (2015) 1204–1209, doi:10.1109/CISP.2015.7408064.
  • Jiao Y, Tu M, Berisha V & Liss J, Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features, Proc Annu Conf Int Speech Commun Assoc, in Interspeech 2016, 2388–2392, doi:10.21437/Interspeech.2016-1148.
  • Soorajkumar R, Girish G N, Ramteke P B, Joshi S S & Koolagudi S G, Text-independent automatic accent identification system for Kannada language, in Advances in Intelligent Systems and Computing, 469 (2017) 411–418, doi:10.1007/978-981-10-1678-3_40.
  • Chen J, Huang Q & Wu X, Frequency importance function of the speech intelligibility index for Mandarin Chinese, Speech Commun, 83 (2016) 94–103, doi: 10.1016/j.specom.2016.07.009.

Abstract Views: 89

PDF Views: 47




  • Estimating Social Background Profiling of Indian Speakers by Acoustic Speech Features

Abstract Views: 89  |  PDF Views: 47

Authors

Mohammad Ali Humayun
Faculty of Integrated Technologies, Universiti Brunei Darussalam, Jalan Tungku Link, Brunei Darussalam
Hayati Yassin
Faculty of Integrated Technologies, Universiti Brunei Darussalam, Jalan Tungku Link, Brunei Darussalam
Pg Emeroylariffion Abas
Faculty of Integrated Technologies, Universiti Brunei Darussalam, Jalan Tungku Link, Brunei Darussalam

Abstract


Social background profiling of speakers refers to estimating the geographical origin of speakers by their speech features. Methods for accent profiling that use linguistic features, require phoneme alignment and transcription of the speech samples. This paper proposes a purely acoustic accent profiling model, composed of multiple convolutional networks with global average-pooling layers, to classify the temporal sequence of acoustic features. The bottleneck representations of the convolutional networks, trained with the original signals and their low-pass filtered copies, are fed to a Support Vector Machine classifier for final prediction. The model has been analysed for a speech dataset of Indian speakers from social backgrounds spread across India. It has been shown that up to 85% accuracy is achievable for classifying the geographic origin of speakers corresponding to regional Indian languages; 17% higher than the benchmark deep learning model using the same features. Results have also indicated that classification of accents is easier using the second language of the speakers, as compared to their native language.

Keywords


Accent identification, Low pass filtering, Ensemble learning, Native language identification, Speaker profiling.

References