Titre : Web structure extraction Description : a. ideactiv is a young start-up heavily relying on NLP in order to read that data of cultural events on the website of cultural organisations (theatres, show venues, etc.). This allows these organisations to share their events with cultural media and promote them without any manual operation. ideactiv reads ~15 000 events per 24 hours, on the website of ~10 000 organisations. Data can be seen on www.ideactiv.com ideactiv has beencreated by Thomas Chenevier, an alumnus of Ecole Polytechnique. It is used daily by hundreds of cultural organisations, and by national cultural media. Working for ideactiv is a great opportunity to get experience on "deep search engines", ie search engines that index objects according to their meaning and not just web pages according to their content. b. Topic. For the reading process to be efficient, ideactiv first generates, for each website, a series of rules that describe where each field (title, date, address, description, price...) of the event should be found in the DOM of the website of the cultural organization. Here are very simple examples: - "the title is the content of the HTML tag

" - "the date begins after the tag beginning with
" This requires to beable to detect the presence of cultural events in the organization's website, then to detect its structure, then to detect the nature of the nodes of this structure, and finally to generate these rules. The internship will focus on this last step (rules generation). It is a process that takes 5 to 20 minutes for a trained adult, depending of the difficulty, and that is not trivial. For example: i. Many different rules can be chosen. The solution must favour rules whichmake sense from a "human" perspective: - rules which are "semantically rich". For example; it is better to chose a tag like than a purely style tag like . - rules which are simple and which make sense from a "html" perspective. For example, it is better to choose a rule like "the whole content of the html component beginning with
" than "the excerpt than begins with