Tabular Information Extraction

Master project in WS17/18 by Tobias Matysiak

Supervised by Niklas Schnelle

Overview

This project aims at simplifying the creation of SPARQL-queries for the knowledge base Freebase. Instead of finding out the relevant Freebase types and relations by hand, the user specifies table columns in a simple table description format.

Introduction

In the knowledge base Freebase, Writing a SPARQL-query for Freebase to extract some information can be quite challenging, especially when one does not know the respective Freebase types and relations. To obtain a list of all cities their corresponding countries and their population, the relevant Freebase types would be location.citytown, location.country and location.statistical_region.population. Furthermore, one has to know that city and country have to be linked by the relation location.location.containedby and that the population is obtained via the mediator "measurement_unit.dated_integer. So the designed query should look like the following:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?city_name ?country_name ?population WHERE {
  ?city fb:type.object.type fb:location.citytown .
  ?city fb:type.object.name ?city_name .
  ?country fb:type.object.type fb:location.country .
  ?country fb:type.object.name ?country_name .
  ?city fb:location.location.containedby ?country .
  ?city fb:location.statistical_region.population ?dated_int .
  ?dated_int fb:measurement_unit.dated_integer.number ?population .
}

Wouldn’t it be much easier for the user to just specify the columns of the desired result? So the following table specification should lead to the same above-mentioned SPARQL-query:

<location.citytown> | <location.country> | <location.statistical_region.population>

This project even goes a little bit further and tries to solve the situation where a user doesn't have to know the exact Freebase types and relations. For example, when the user only knows the fuzzy column names:

City | Country | Population

While typing a fuzzy definition, the program provides suggestions for relevant types taking the number of occurences in the database into account.

Table description format

In the table description format columns are separated by "|". The columns can either be exactly or fuzzily specified whereby fuzzy definitions are not supported for translation into queries. Additionally, for any column filters and an order can be set. A column can also be explicitly linked to another column, when the query translator's column linking is not as intended.

Fuzzy column definitions

City | Country | Population

Exact column definitions

<location.citytown> | Country | <location.statistical_region.population>

Explicitly linking a column to another

Linking to another column defined by index (starting with 0)

City | Country | Population -> 0

Setting filters

Supported relational operators: ==, !=, <, <=, >, >=

City(== "Berlin"@en) | Country | Population(<= 50000)

Define an order

Single order:

City | Country | Population [DESC]

Multi-order (a rank in applying the orders has to be specified):

City [ASC, 1] | Country | Population [DESC, 0]

Translation into SPARQL queries

In order to translate a given table description, the program first generates pairs of the parsed columns such that all columns are paired with each other. For any pair the program checks how the columns could match by using predefined rules and templates. To speed up the checks, the following data from the project Aqqu is used:

List of mediator types, list of mediator relations (relations leading to a mediator), list of relations' expected types, list of relations' target types distribution, list of relations with a reverse relation

This data was extracted from a Freebase dump.

Autocomplete fuzzily defined columns

While typing column definitions, the user gets type and relation suggestions. For this, the autocomplete function performs a fuzzy prefix search, allowing a maximum prefix edit distance of 1. The relevant results are ranked according to the following criteria (in descending priority):

  1. Types and relations that do not start with "base" or "user" are preferred
  2. BM25-score (k=1.75, b=0.75)
  3. Types are preferred over relations
  4. Count of the type/relation
Screenshot of autocomplete function

Templates

Type-Type-Template (TT)

The pair (<location.citytown>, <location.country>) can be matched with this template. Removing the "DISTINCT" in the query will result in the counts for each candidate relation. The one with the highest count in this case is <location.location.containedby>.

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation WHERE {
  ?col1 fb:type.object.type [TYPE_COL1] .
  ?col2 fb:type.object.type [TYPE_COL2] .
  ?col1 ?candidate_relation ?col2 .
}

Type-Mediator-Type-Template(TMT)

The pair (<film.film>, <film.actor>) can be matched with this template. The types are connected via the mediator <film.performance>. The best matching candidate relations are (<film.film.starring>, <film.performance.actor>).

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation1 ?candidate_relation2 WHERE {
  ?col1 fb:type.object.type [TYPE_COL1] .
  ?col2 fb:type.object.type [TYPE_COL2] .
  ?col1 ?candidate_relation1 ?mediator .
  ?mediator ?candidate_relation2 ?col2 .
}

Type-Relation-Template (TR)

This template matches a type with a direct relation like (<olympics.olympic_games>, <time.event.start_date>). Only the total resultsize can be considered here, because no candidate relation has to be find.

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?col2 WHERE {
  ?col1 fb:type.object.type [TYPE_COL1] .
  ?col1 [REL_COL2] ?col2 .
}

Type-Mediator-Relation-Template (TMR)

This template matches a type with an indirect relation like (<location.citytown>, <location.geocode.latitude>) via the mediator <location.geocode> and matches the relation <location.location.geolocation> best.

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation WHERE {
  ?col1 fb:type.object.type [TYPE_COL1] .
  ?col1 ?candidate_relation ?mediator .
  ?mediator [REL_COL2] ?col2 .
}

Type-Relation-Mediator-Template (TRM)

This template matches a type with a direct mediator relation like (<location.citytown>, <location.statistical_region.population>). The mediator is <measurement_unit.dated_integer>.

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?candidate_relation WHERE {
  ?col1 fb:type.object.type [TYPE_COL1] .
  ?col1 [REL_COL2] ?mediator .
  ?mediator ?candidate_relation ?col2 .
}

Results

The program simplifies the creation of desired queries for Freebase by helping to explore types and relations and automatically finding linking relations between columns. Furthermore, the user does not need to know if a type has a name or not (each named type requires an additional triple in the query like ?citytown fb:type.object.name ?citytown_name) and whether a mediator has to be used for linking or not.

Largest cities

Table description:

<location.citytown> | <location.country> | <location.statistical_region.population> [DESC] | <location.geocode.latitude> | <location.geocode.longitude>

Generated query:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?citytown_name ?country_name ?population_name ?latitude_name ?longitude_name WHERE {
  ?citytown fb:type.object.type fb:location.citytown .
  ?citytown fb:type.object.name ?citytown_name .
  ?country fb:type.object.type fb:location.country .
  ?country fb:type.object.name ?country_name .
  ?citytown fb:location.location.containedby ?country .
  ?citytown fb:location.statistical_region.population ?dated_integer .
  ?dated_integer fb:measurement_unit.dated_integer.number ?population_name .
  ?citytown fb:location.location.geolocation ?geocode .
  ?geocode fb:location.geocode.latitude ?latitude_name .
  ?geocode fb:location.geocode.longitude ?longitude_name .
}
ORDER BY DESC(?population_name)

Query delivers the intended results.

Largest cities

Table description:

<film.film> | <film.film.initial_release_date> [DESC, 0] | <film.film_genre> [ASC, 1] | <film.film.country>(=="Germany"@en) | <film.director>

Generated query:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?film_name ?initial_release_date_name ?film_genre_name ?country_name ?director_name WHERE {
  ?film fb:type.object.type fb:film.film .
  ?film fb:type.object.name ?film_name .
  ?film fb:film.film.initial_release_date ?initial_release_date_name .
  ?film_genre fb:type.object.type fb:film.film_genre .
  ?film_genre fb:type.object.name ?film_genre_name .
  ?film fb:film.film.genre ?film_genre .
  ?film fb:film.film.country ?country .
  ?country fb:type.object.name ?country_name .
  ?director fb:type.object.type fb:film.director .
  ?director fb:type.object.name ?director_name .
  ?film fb:film.film.directed_by ?director .
  FILTER(?country_name == "Germany"@en) .
}
ORDER BY DESC(?initial_release_date_name) ASC(?film_genre_name)

Query delivers the intended results.

Future work

It is planned to continue the project in a master thesis. For this purpose, the following parts allow further research and improvement: