Titre : Web structure extraction Description : a. ideactiv is a young start-up heavily relying on NLP in order to read that data of cultural events on the website of cultural organisations (theatres, show venues, etc.). This allows these organisations to share their events with cultural media and promote them without any manual operation. ideactiv reads ~15 000 events per 24 hours, on the website of ~10 000 organisations. Data can be seen on www.ideactiv.com ideactiv has beencreated by Thomas Chenevier, an alumnus of Ecole Polytechnique. It is used daily by hundreds of cultural organisations, and by national cultural media. Working for ideactiv is a great opportunity to get experience on "deep search engines", ie search engines that index objects according to their meaning and not just web pages according to their content. b. Topic. For the reading process to be efficient, ideactiv first generates, for each website, a series of rules that describe where each field (title, date, address, description, price...) of the event should be found in the DOM of the website of the cultural organization. Here are very simple examples: - "the title is the content of the HTML tag