The University of Bordeaux invites applications for a 2 year full-time
postdoctoral researcher in Automatic Speech Recognition. The position
is part of the FVLLMONTI project on efficient speech-to-speech
translation for embedded autonomous devices, funded by the European
Community.

To apply, please send by email a single PDF file containing a full CV
(including publication list), cover letter (describing your personal
qualifications, research interests and motivation for applying),
evidence for software development experience (active Github/Gitlab
profile or similar), two of your key publications, contact information
of two referees and academic certificates (PhD, Diploma/Master,
Bachelor certificates).

Details on the position are given below:

Job description: Post-doctoral position in Automatic Speech Recognition 

Duration: 24 months

Starting date: as early as possible (from March 1st 2021)

Project: European FETPROACT project FVLLMONTI (starts January 2021)

Location: Bordeaux Computer Science Lab. (LaBRI CNRS UMR 5800),
Bordeaux, France (Image and Sound team)

Salary: from 2 086,45EUR to 2 304,88EUR/month (estimated net salary
after taxes, according to experience)

Contact: jean-luc.rouas@labri.fr

Short description:

The applicant will be in charge of developing state-of-the-art
Automatic Speech Recognition systems for English and French as well as
related Machine Translation systems using Deep Neural Networks. The
objective is to provide the exact specifications of the designed
systems to the other partners of the project specialized in
hardware. Adjustments will have to be made to take into account the
hardware constraints (i.e. memory and energy consumption impacting the
number of parameters, computation time, ...) while keeping an eye on
performance metrics (WER and BLEU scores). When a satisfactory
trade-off is reached, more exploratory work is to be carried out on
using emotion/attitude/affect recognition on the speech samples to
supply additional information to the translation system.


Context of the project:

The aim of the FVLLMONTI project is to build a lightweight autonomous
in-ear device allowing speech-to-speech translation. Today,
pocket-talk devices integrate IoT products requiring internet
connectivity which, in general, is proven to be energy
inefficient. While machine translation (MT) and Natural Language
Processing (NLP) performances have greatly improved, an embedded
lightweight energy-efficient hardware remains elusive. Existing
solutions based on artificial neural networks (NNs) are
computation-intensive and energy-hungry requiring server-based
implementations, which also raises data protection and privacy
concerns. Today, 2D electronic architectures suffer from "unscalable"
interconnect and are thus still far from being able to compete with
biological neural systems in terms of real-time information-processing
capabilities with comparable energy consumption. Recent advances in
materials science, device technology and synaptic architectures have
the potential to fill this gap with novel disruptive technologies that
go beyond conventional CMOS technology. A promising solution comes
from vertical nanowire field-effect transistors (VNWFETs) to unlock
the full potential of truly unconventional 3D circuit density and
performance.

Role:

The tasks assigned to the Computer Science lab are the design of the
Automatic Speech Recognition (for French and English) and the Machine
Translation (English to French and French to English) systems. Speech
synthesis will not be explored in the project but an open-source
implementation will be used for demonstration purposes. Both ASR and
MT tasks benefit from the use of Transformer architectures over
Convolutional (CNNs) or Recurrent (RNNs) neural network
architectures. Thus, the role of the applicant will be to design and
implement state-of-the-art systems for ASR using Transformer networks
(e.g. with the ESPNET toolkit) and to assist another post-doctorate
for the MT systems. Once the performances reached by these baseline
systems are satisfactory, details on the network will be given to our
hardware designers partners (e.g. number of layers, value of the
parameters, etc.). With the feedback of these partners, adjustments
will be made to the network considering the hardware constraints while
trying not to degrade the performances too much.

The second part of the project will focus on keeping up with the
latest innovations and translating them into hardware
specifications. For example, recent research suggest that adding
convolutional layers to the transformer architecture (i.e. the
"conformer" network) can help reduce the number of parameters of the
model which is critical regarding the memory usage of the hardware
system.

Finally, more exploratory work on the detection of social affects
(i.e. the vocal expression of the intent of the speaker: 'politeness',
'irony', etc) will be carried out. The additional information gathered
using this detection will be added to the translation system for
potential usage in the future speech synthesis system.

Required skills:

- PhD in Automatic Speech Recognition (preferred) or Machine
  Translation using deep neural networks

- Knowledge of most widely used toolboxes/frameworks (tensorflow,
  pytorch, espnet for example)

- Good programming skills (python)

- Good communication skills (frequent interactions with hardware specialists)

- Interest in hardware design will be a plus

Selected references:

S. Karita et al., "A Comparative Study on Transformer vs RNN in Speech
Applications," 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), SG, Singapore, 2019, pp. 449-456, doi:
10.1109/ASRU46091.2019.9003750.

Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer
for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020).

Rouas, Jean-Luc, et al. "Categorisation of spoken social affects in
Japanese: human vs. machine." ICPhS. 2019.