Extracting relations from the GeoNames database
This project dealt with improving the location-based data
found in the YAGO knowledge base, which is used by the the
semantic
full-text search engine Broccoli.
In order to solve this task we used the data provided by the GeoNames geographical database and composed the program GeoReader which extracts the relevant information and creates valid relation files in the format used by Broccoli.
Anton Stepan
stepana [at] informatik [dot] uni-freiburg [dot] de
Marius Bethge
bethgem [at] informatik [dot] uni-freiburg [dot] de
make compile
to compile the code.
./GeoReaderMain [options]
required options
--all-countries-file [a]: <filename>
--no-locations-file [n]: <filename>
--country-info-file [c]: <filename>
--admin1-file [1]: <filename>
--admin2-file [2]: <filename>
optional options:
--output-dir [o]: <directory>
--population-limit [p]: <number>
--utf8 [u]
- File containing the bulk of the data for all locations.
countryInfo.txt- File containing information about countries.
admin1CodesASCII.txt- File needed for breaking down first level administrative division mappings.
admin2Codes.txt- File needed for breaking down second level administrative division mappings.
blacklist.txt- Entities that are being used in non-location contexts in Broccoli and are not considered to be of the class Location.
With the data from geonames we managed to greatly extend the location based data of Broccoli.
But hrough different denotations of the location names it's difficult to match those two data sets.Entity or Relation | Broccoli | Broccoli with geonames | Change |
---|---|---|---|
Cities | 28663 | 73485 | +44822 |
Country | 3714 | 5000 | +1286 |
Administrative District | 161063 | 180388 | +19325 |
Locations occur-with 'nuclear power plant' located-in Germany | 3 | 25 | +22 |
Location loccated-in Location located-in Location | 17 | 123865 | +123848 |