Laboratoire d'Informatique et Systèmes LIS - UMR CNRS 7020 Stage master recherche Deep learning for speech perception (Modélisation de la perception de la parole par apprentissage profond) Length of internship: 5-6 months Start date: between January and March Contact: Ricard Marxer Context Recent deep learning (DL) developments have been key to breakthroughs in many artificial intelligence (AI) tasks such as automatic speech recognition (ASR) [1] and speech enhancement [2]. In the past decade the performance of such systems on reference corpora has consistently increased driven by improvements in data-modeling and representation learning techniques. However our understanding of human speech perception has not benefited from such advancements. This internship sets the ground for a project that proposes to gain knowledge about our perception of speech by means of large-scale data-driven modeling and statistical methods. By leveraging modern deep learning techniques and exploiting large corpora of data we aim to build models capable of predicting human comprehension of speech at a higher level of detail than any other existing approaches [3]. This internship is funded by the ANR JCJC project MIM (Microscopic Intelligibility Modeling). It aims at predicting and describing speech perception at the stimuli, listener and sub-word level. The project will also fund a PhD position, the call for applications will be published in the coming months. A potential followup in PhD could be foreseen for the successful candidate of this internship. Subject The goal of this internship is to produce the first DL-based models that predict human intelligibility at the sublexical level. This translates into predicting the positions in the audio stimuli where confusions will occur, and the type of confusions, in other words, which phonemes are confused with which others on an individual basis. We will use data from a public corpus of consistent confusions that we have produced (http://spandh.dcs.shef.ac.uk/ECCC/): speech-in-noise stimuli that evoke the same misrecognition among multiple listeners. In order to simplify this first approach to microscopic intelligibility prediction, we restrict ourselves to single-word data. This reduces the lexical factors to aspects such as usage frequency and neighborhood density, significantly limiting the complexity of the required language model. Consistent confusions are valuable experimental data about the human speech perception process. They provide targets for how intelligibility models should dif-ferentiate from automatic speech recognition (ASR) systems. While ASR models are optimised to recognise what has been uttered, the proposed models should output what has been perceived by a set of listeners. A sub-task encompasses creating baseline models that predict listeners' responses to the noisy speech stimuli. We will target predictions at different levels of granularity such as predicting the type of confusion, which phones are misperceived or how a particular phone is confused. Several models regularly used in speech recognition tasks will be trained and evaluated in predicting the misperceptions of the consistent confusion corpora. We will first focus on well established models such as GMM-HMM and/or simple deep learning architectures. Advanced neural topologies such as TDNNs, CTC-based or attention-based models (CPC, wav2vec2) will also be explored, even though the relatively small amount of training data in the corpora is likely to be a limiting factor. As a starting point we envisage solving the 3 tasks described in [3] consisting of 1) predicting the probability of occurrence of misrecognitions at each position of the word, 2) given the position, predicting a distribution of particular phone misperceptions, and 3) predicting the words and the number of times they have been perceived among a set of listeners. Predictions will be evaluated using the metrics also defined in [3] and random and oracle predictions will be used as references. These baseline models will be trained using only in-domain data and optimized on word recognition tasks. Profile The candidate shall have the following profile: - Master 2 level or equivalent in one of the following fields: machine learning, computer science, applied mathematics, statistics, signal processing - Good English written and spoken language skills - Programming skills, preferably in Python Furthermore the ideal candidate would have: - Experience in one of the main DL frameworks (e.g. PyTorch, Tensorflow) - Notions in speech or audio processing Application procedure If interested please contact ricard.marxer@lis-lab.fr References [1] Barker, J., Marxer, R., Vincent, E., & Watanabe, S. (2017). The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes. Computer Speech & Language, 46, 605-626. [2] Marxer, R., & Barker, J. (2017). Binary Mask Estimation Strategies for Constrained Imputation-Based Speech Enhancement. In Proc. Interspeech 2017 (pp. 1988-1992). [3] Marxer, R., Cooke, M., & Barker, J. (2015). A framework for the evaluation of microscopic intelligibility models. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 2558-2562).