Automatic recognition of values in wikipedia articles

Project type: ESE Project
Project name: Automatic recognition of values in wikipedia articles
Author: Regina König
Supervisor: Prof. Dr. Hannah Bast, Björn Buchold

Overview

1. Project description

The goal was to find automatically values in wikipedia articles and convert them into metric units for the semantic search engine broccoli. The value finding component runs in a chain of a UIMA Pipeline.

2. Explanations

2.1 UIMA Framework

UIMA is an Unstructured Information Managment Architecture. This project was developed by IBM in 2005 and is supervised by Apache since 2006. The concept is to implement a pipeline, which reads in unstructured information, proceeds various analysis steps where the information gets marked with specific annotations, and finally delivers the results to consumers, which proceed the information.

3. Implementation

3.1 Prework

To assess the frequency and syntax of various value types, a statistical analysis of various wikipedia articles was carried out. In this case, not only real units are interesting, but also words, which can be an indication for a type of value. For example indicates the word "in" followed by a number, that the value is very probable a year ("in 1975").

The most frequent value syntaxes where:
Pattern Relative Quantity in % Example
Unit Value 15.6AD 1980
Value Unit 14.330 km
Value% 13.17%
Value without Unit5.81886
s after Value 1.61980s

3.2 Implementation

The code is written in Java.
The ValueAnnotator checks the UTF8 coded Wikipedia articles Token for Token for the occurence of a number. If a number is found, an object of the class ReadValue is created, which contains the actual value-reading function. The search-algorithm is based on the statistical analysis of the frequency of the different value patterns. As soon as the unit is found, the value gets converted to metric unit and the ValueAnnotator creates an annotation.
The value types searched for, can be determined in config.txt.

4. Results

In the search exact values (eg 30 km) and ranges (eg 1986 - 1992) can be distinguished.
The type "value" has 5 features:

int begin - the position in the text, where the value begins
int end - the position in the text, where the value ends
float value - the value itself
string unit - unit of the value
string type - type of the value (eg length)

The type "range" has two value features: the beginning and the end value of the range.