Internship: joint speech segmentation and syntactic analysis Advisors: Alexis Nasr & Benoit Favre LIS/CNRS, Aix-Marseille University Spring-Summer 2022 Description Context Segmenting texts into sentences is a standard task in natural language processing which does not pose great difficulty, in particular thanks to punctuation marks at the end of sentences (Read et al., 2012). The situation is more complex in the case of speech transcriptions where punctuation is generally absent, in particular if they are the result of automatic speech recognition. The segmentation process can be performed on the sole basis the word sequence, but the results usually are not very good 1 (Zelasko et al., 2018). In order to improve over lexical-only models, one can add prosody (in the form of F0, energy, pause duration...) and syntax (Favre et al., 2008). There is a chicken and egg problem in adding syntactic features to segmentation as syntactic parsers cannot handle unsegmented inputs and segmenting speech requires the result of parsing. Objectives The goal of this internship is to develop a joint model of syntactic parsing and sentence segmentation for spoken recordings, based on lexical and prosodic features. The problem of the vicious dependency cycle between syntactic parsing and segmentation can be handled by using online transition-based parsing which does not assume a sentence boundary, as proposed for example in (Nasr et al., 2020). Compared to traditional transition-based parsing, this kind of parser adds a special transition for predicting sentence boundaries which flushes the current tree and starts a new one. In this context, speech-derived features, such as prosody, could be added to the classifier to inform its segmentation decisions. A potential benefit is that speech features might also help with predicting syntactic structures in addition to performing more accurate segmentation. Scientific program The work will be carried out on a corpus of speech transcriptions annotated with syntactic trees, such as for example the data from the ORFEO project. First, speech-derived features, such as F0, energy and pause duration will be extracted using a standard toolkit. Then, the parser model will be adapted to handle this new source of information (Dary and Nasr, 2021). The resulting system will be trained jointly to perform both syntactic parsing and segmentation, and evaluated on both tasks. Different ways of extracting speech features, such as simple features from kaldi 2 , more advanced representations from OpenSmile Eyben et al. (2010) or unsupervised pre-trained representations such as huBert (Hsu et al., 2021), will be evaluated. Different models for integrating speech features will also be compared. Additional information - Skills: Master-level computer science, an interest for linguistics, python programming, deep learning, Pytorch, rigor and tenacity. - Location: the internship will take place at LIS/CNRS on the Luminy campus of Aix-Marseille University. 1 See results reported at https://github.com/benob/recasepunc 2 https://kaldi-asr.org/ Dates: Spring-summer 2022, duration 5-6 months. Wages: regulatory internship salary (about 500 euros/month). Computation: the intern will have access to the Jean-Zay GPU cluster for running experiments. Send a CV and cover letter to benoit.favre@lis-lab.fr & alexis.nasr@lis-lab.fr before November 1st, 2021. References Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters, pages 985-994, 2012. Piotr Zelasko, Piotr Szymanski, Jan Mizgajski, Adrian Szymczak, Yishay Carmiel, and Najim Dehak. Punctuation prediction model for conversational speech. arXiv preprint arXiv:1807.00543, 2018. Benoit Favre, Dilek Hakkani-Tur, Slav Petrov, and Dan Klein. Efficient sentence segmentation using syntactic features. In 2008 IEEE Spoken Language Technology Workshop, pages 77-80. IEEE, 2008. Alexis Nasr, Franck Dary, Frédéric Bechet, and Benoît Fabre. Annotation syntaxique automatique de la partie orale du orféo. Langages, (3):87-102, 2020. Franck Dary and Alexis Nasr. The reading machine: a versatile framework for studying incremental parsing strategies. In The 17th International Conference on Parsing Technologies, 2021. Florian Eyben, Martin Wöllmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459-1462, 2010. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447, 2021.