Internship : syntactic analysis of speech without transcription Advisers: Alexis Nasr, Ricard Marxer & Benoit Favre LIS/CNRS, Aix-Marseille University Spring-Summer 2022 Description Context Syntactic analysis, or syntactic parsing, consists in predicting a tree representation of the syntactic relationship between words of a sentence. A range of paradigms and methods have been proposed over the years for solving that task (see among others the methods presented by Zhang (2020)). Beyond text, parsing speech recordings is an important task for developing pervasive applications with spoken interactions (Tur and De Mori, 2011; Damonte et al., 2019; Tran and Ostendorf, 2021). It is also very difficult because of two main reasons: idealized models of language that were developed for text do not apply completely to speech, and automatically generated transcriptions are not devoid of errors. The later is problematic because syntax is deeply linked to the representation of linguistic content as a sequence of words. Objectives The goal of this internship is to reconsider this axiom: what if we could perform syntactic parsing of speech recordings without relying on an explicit transcription. This study will explore an alternate representation of the speech signal as a sequence of automatically extracted symbols representing sub-lexical units. The extraction of these units will be performed using discrete representations learned from audio signal, such as the VQ-WAV2VEC model (Baevski et al., 2019). The quantized speech segments will be fed to a transition-based parser, that typically considers attaching the current word to a partial syntax tree, with additional transitions that accumulate sub-lexical units to form tokens. Such parser can be trained with regular procedures for transition-based parsing (Dary and Nasr, 2021; Nivre, 2013). Scientific program In order to learn a syntactic analyzer able to input this type of representations, we will use the ORFEO corpus 1 (Benzitoun et al., 2016). That corpus is composed of transcriptions of speech recordings annotated with syntactic analyses. The speech signal and alignments to the word transcripts are also available (when a word begins and ends in the speech signal). The idea is to map the speech signal to discrete units using a model such as VQ-WAV2VEC mentioned above and to project the syntactic annotations on sequences of such symbols. At the end of this step, it is possible to train a syntactic analyzer which inputs sequences of symbols originating from VQ-WAV2VEC and outputs a dependency tree derived from a sequence of transitions. This process can be divided into the following steps: 1. Learning discrete representations from a large set of speech recordings such as Mozilla common voice (Ardila et al., 2019), the EPAC corpus (Esteve et al., 2010) or the non-annotated part of the Orfeo corpus 2. Extract these representations on the part of Orfeo with syntax annotations 3. Transfer syntax annotations from words to the discrete representations 4. Create a new transition system to allow for sub-lexical units 5. Train and evaluate a dependency parser in those conditions 1 https://repository.ortolang.fr/api/content/cefc-orfeo/11/documentation/site-orfeo/index.html In addition, it will be interesting to explore the possibility to self-train a parser on large quantities of unannotated speech transcripts, and to explore variations of the discretization strategies and transition systems. Additional information - Expected skills: Master-level computer science, interest for linguistics, python programming, deep learning, Pytorch, rigorous mind, tenacity. - Location: the internship will take place at LIS/CNRS on the Luminy campus of Aix-Marseille University. - Dates: Spring-summer 2022, duration 5-6 months. - Wages: regulatory internship salary (about 500 euros/month). - Computation: the intern will have access to the Jean-Zay GPU cluster for running experiments. Send a CV and cover letter to benoit.favre@lis-lab.fr, alexis.nasr@lis-lab.fr and ricard.marxer@lis-lab.fr before November 1st, 2022 References Meishan Zhang. A survey of syntactic-semantic parsing based on constituent and dependency structures. Science China Technological Sciences, pages 1-23, 2020. Gokhan Tur and Renato De Mori. Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons, 2011. Marco Damonte, Rahul Goel, and Tagyoung Chung. Practical semantic parsing for spoken language understanding. arXiv preprint arXiv:1903.04521, 2019. Trang Tran and Mari Ostendorf. Assessing the use of prosody in constituency parsing of imperfect transcripts. arXiv preprint arXiv:2106.07794, 2021. Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019. Franck Dary and Alexis Nasr. The reading machine: a versatile framework for studying incremental parsing strategies. In The 17th International Conference on Parsing Technologies, 2021. Joakim Nivre. Transition-based parsing. Uppsala universitet, 2013. Christophe Benzitoun, Jeanne-Marie Debaisieux, and Henri-José Deulofeu. Le projet orféo: un corpus d'étude pour le français contemporain. Corpus, (15), 2016. Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019. Yannick Esteve, Thierry Bazillon, Jean-Yves Antoine, Frédéric Béchet, and Jérôme Farinas. The epac corpus: Manual and automatic annotations of conversational speech in french broadcast news. In LREC. Citeseer, 2010.