Course code TPT33
Course title Information Extraction
Institution TELECOM ParisTech
Course address TelecomParisTech, 46, rue barrault 75013
City Paris
Minimum year of study 3rd year
Minimum level of English Good
Minimum level of French None
Key words

Information Extraction, Semantic Web, Knowledge Bases, Web

Language english
Professor responsible Fabian M. Suchanek
Telephone +33 14581 8062
Participating professors
Number of places Minimum: 5, Maximum: 30, Reserved for local students: 6

In this course, students will learn the basics of semantic information extraction, i.e. the art and science of extracting facts from natural language documents. This includes algorithms for extraction from the Web, as well as the essentials of natural language processing and knowledge representation. We will also touch upon the Semantic Web. The goal is to understand the technology behind today's large knowledge bases such as Google's Knowledge Graph, NELL, DBpedia, and YAGO.

Programme to be followed

The course will consist of lectures and practical exercises (labs). The lectures will be interactive, with small quizzes to check the understanding of the topics. The course will cover:

  • Knowledge representation (RDF, RDFS, OWL)
  • Named Entity Recognition (Regular Expressions, Tries)
  • Named Entity Annotation (Rule-based and statistical)
  • Design of extraction algorithms and evaluation
  • Disambiguation (context-based, coherence-based)
  • Instance Extraction (Hearst extraction, set expansion, iteration)
  • Fact extraction from structured sources (Wrapper induction, extraction from Wikipedia)
  • Fact extraction from text (DIPRE algorithm, POS annotation)
  • Dependency Grammars
  • Extraction by reasoning

·      Basics of Predicate Logic

·      Basics of Probability Theory

·      Very good programming skills in Java, in particular data structures and Input/Output


Course exam  evaluation by labs