Title: Identification of discontinuous variants of compound terms Duration 5-6 months Proposers: Salah Aït-Mokhtar Vassilina Nikoulina Start date: January-February 2015 Description The main theme of the internship is the identification of terms and concepts in domain-specific texts, with a focus on medical texts in the context of the EURECA project (http://eurecaproject.eu/). We have a dictionary-based term identification system capable of identifying occurrences of terms in free texts, including non-listed term variants (e.g. inflected or misspelled terms). The task of the internship will consist in contributing to the extension of types of variations that the term identifier can handle. In particular, the intern will work on the identification and normalization of discontinuous compound terms that are involved in specific syntactic structures (e.g. coordination), using distant supervision with existing domain terminologies. An example of discontinuous compound terms is "abdominal distention" in the expression "abdominal bloating or distention". Requirements The ideal candidate is a student (MSc or PhD) in computational linguistics, or computer science with a good background in NLP. S/he has a good knowledge of syntactic structures and parsing. Good programming skills, preferably in Java, are also required. Prior experience in NLP for the healthcare domain or in terminologies/ontologies is a plus. During the internship the candidate will acquire a significant knowledge and practice in the use of hybrid methods for term identification, including distant supervision based on rich terminologies and ontologies. As well, s/he will work closely with researchers and engineers in an international research environment. You can find more details about this offer at http://www.xrce.xerox.com/About-XRCE/Internships/Identification-of-discontinuous-variants-of-compound-terms