Bachelor Project: Relation Extraction

Extracting relations from the GeoNames database


Overview

This project dealt with improving the location-based data found in the YAGO knowledge base, which is used by the the semantic full-text search engine Broccoli.

In order to solve this task we used the data provided by the GeoNames geographical database and composed the program GeoReader which extracts the relevant information and creates valid relation files in the format used by Broccoli.

Team

Anton Stepan
stepana [at] informatik [dot] uni-freiburg [dot] de

Marius Bethge
bethgem [at] informatik [dot] uni-freiburg [dot] de

Supervisors

Prof. Dr. Hannah Bast
Björn Buchhold

Institute

University of Freiburg
Department of Computer Science
Chair of Algorithms and Data Structures

Geographical Information

Free Gazetteer Data from Geonames

Code

Documentation

link: Code overview generated by doxygen.

Compile and Run

You need a GCC version ≥4.7 and gtest installed on your system to compile our code.
After unpacking the code you can use make compile to compile the code.

Download the Code

from SVN

Info

programming language: C++
lines of code: ~1500

Libraries

  • GTest for testing our program
  • Bootstrap for this infosite

Usage Info and Parameters

./GeoReaderMain [options]


required options

--all-countries-file [a]: <filename>
--no-locations-file [n]: <filename>
--country-info-file [c]: <filename>
--admin1-file [1]: <filename>
--admin2-file [2]: <filename>

optional options:
--output-dir [o]: <directory>
--population-limit [p]: <number>
--utf8 [u]

Data Files from GeoNames

allCountries.zip

- File containing the bulk of the data for all locations.

countryInfo.txt

- File containing information about countries.

admin1CodesASCII.txt

- File needed for breaking down first level administrative division mappings.

admin2Codes.txt

- File needed for breaking down second level administrative division mappings.

blacklist.txt

- Entities that are being used in non-location contexts in Broccoli and are not considered to be of the class Location.

Result

With the data from geonames we managed to greatly extend the location based data of Broccoli.

But hrough different denotations of the location names it's difficult to match those two data sets.

We extracted ~130k locations out of geonames. There are about 173k locations in Broccoli, which occure with a relation.
With our version we are able to match 18% with ascii names, or 23% with UTF8 names of the geonames location to yago entities.

Our unification method is a very naiv version and further improvement in the unification process could greatly increase the matching percentage.
Entity or Relation Broccoli Broccoli with geonames Change
Cities 28663 73485 +44822
Country 3714 5000 +1286
Administrative District 161063 180388 +19325
Locations occur-with 'nuclear power plant' located-in Germany 3 25 +22
Location loccated-in Location located-in Location 17 123865 +123848