Title: Recurrent neural network models of the textualization process Location: LIPN (UMR CNRS 7030), Université Sorbonne Paris Nord (99 Ave nue Jean Baptiste Clément, 93430 Villetaneuse) Advisors: Nadi Tomeh (tomeh@lipn.fr), Joseph Le Roux (leroux@lipn.fr) Context ====== In the context of the ANR project Pro-TEXT (https://pro-text.huma-num.fr/) we aim to conduct a comprehensive linguistic analysis of the textualization process, i.e. the real-time progressive construction of a text. We will study "bursts of writing", which are textual segments produced between "pauses", in order to provide insight into the relation between regularities of language performance and the cognitive and contextual constraints. The aim is to understand some of the layout mechanisms that allow language to give rise to novelty out of known and prefabricated structures. The project's aim is to build on recent developments in linguistics, cognitive psychology and machine learning to analyze burst data. Internship proposal =============== Burst data are collected through real-time recording of the writing process using keystroke logging tools Inputlog (http://www.inputlog.net/) and Scriptlog, which produces temporal data (chronology and duration of text production and pause length), language data (language sequences produced continuously between two pauses), and topological textual data (dynamics of text planning, revising already written text). In this internship, we propose to explore the use of recurrent neural networks such as LSTMs in modeling the three types of sequences mentioned above. We plan to study multiple modeling tasks. For instance, pause prediction can be cast as an LSTM-based sequence labeling task where input text is labeled with pause position and duration. A different network can be used to model the sequence of text editing actions including insertion and deletion of characters but also the intervening pauses. The network is then trained to predict the next action as in language models. Furthermore, we plan to explore the use of data structures inspired by the stack LSTM (Dyer et al., 2015) to use both action and word embeddings when predicting pauses. Such a structure should allow for updating state representations in a non-linear and non-local way while maintaining a low complexity. Candidate profile ============= - Master in Computer Science or related field - Strong programming skills - Good reading and writing of English - Experience in training deep learning models (with Pytorch/Jax) How to apply ? =========== Send your CV and transcript of Master grades to tomeh@lipn.fr and leroux@lipn.fr References ========= (Dyer et al., 2015) https://aclanthology.org/P15-1033/ Transition-Based Dependency Parsing with Stack Long Short-Term Memory. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith.