Bachelorprojekt: SynonymFinder

Niklas Baumert

29.01.2018 – 28.04.2018

Last Update: 29.10.2018

A collection of tools which allow the extraction of synonym data from a Wikipedia XML dump, convertion of the CrossWikis dataset, interacting with both and evaluating them against each other. The web interface is live on tapoa (accessible from the university network).

Requirements
Usage
How it works
Database Structure
Comparison with CrossWikis
Possible Improvements
Code References

Requirements

Python 3.5
rocksDB 4
pyrocksdb v0.4 (the newer python-rocksdb doesn't work as of 10. April 2018, as it has a bug with the VectorMemtables which isn't fixed in the master branch yet)
wikitextparser
Wikipedia dump in XML format ( Wikipedia:Database_download, pages-articles-multistream.xml.bz2).
Optionals:
- Flask (for the web interface)
- The CrossWikis dataset if you want to run the comparison benchmark or use the web interface to query it (lnrm.dict and inv.dict are used).
- WebQuestions Semantic Parses Dataset is used for the quality benchmark. You can provide your own data as a json file of the following format: [{"question": string, "answer": [string,...]}, ...].
- A Freebase to Wikipedia URL mapping in the form [FreebaseID]\t\t[WikipediaURL]\n. This is required if you want to use the WebQSP dataset and use my webqsp_parser.py to extract the data and create the benchmark.json with it.

The project comes with a dockerfile that installs the software requirements for you.

Successfully tested with Ubuntu 16.04 LTS and 17.10.

Usage

In the following I will describe how to use the tools to create and use the synonym databases. If you want to know all the details about the flags and options of each script check out the Code References.

Preparation

I recommend you create a directory data with the sub-folders data/input and data/databases. Put the Wikipedia XML into the data/input folder. If you want to use CrossWikis, put the lnrm.dict and the inv.dict there too. Your benchmark.json goes into data/input too. If you don't have one and want to use the WebQSP dataset put any of the datasets into data/input (I have used WebQSP.train.json for the benchmarks).

With all the input data now at one convenient location it will be easier to run the tools later.

I recommend using docker.

docker build -t synfinder .
docker run -it -v [data]:/data/ -p [host-port]:5000 --rm --name synonymfinder synfinder

Where [data] is the directory from the preparation. And [host-port] is any free port on your machine.

You will find yourself in the /opt.

In the following I will assume you use the Docker container. If you don't want to use docker make sure you have all the requirements installed and translate the locations of the files to yours.

Creating the Synonym Databases

The first step into using the SynonymFinder is to create the synonym databases. This is done with the synonymfinder.py

python3 src/synonymfinder.py -x /data/input/wikipedia.xml -o /data/databases/ --inplace

This will parse the Wikipedia XML dump, collect the raw synonym data and convert it to usable synonym data. As an result you will have the two rocksDB databases data/databases/dict.db/ and data/databases/inv.dict.db/. The --inplace options does all that in one set of databases, if you omit it the raw synonym data will be collected and saved in their own database and the usable synonym data will then be put into their own database.

This will take a while!

Create the CrossWikis Databases (Optional)

python3 src/build_crosswikis_db.py /data/input/lnrm.dict /data/databases/ cw.dict.db
python3 src/build_crosswikis_db.py /data/input/inv.dict /data/databases/ cw.inv.dict.db

This will create the dictionary and the inverse dictionary of the CrossWikis data respectively.

Run the Web Query Tool

The web interface is a simple Flask application which takes a word or a set of words and returns the Wikipedia URLs sorted by their rating.

python3 src/web_interface.py /data/databases/dict.db /data/databases/inv.dict.db

If you want to be able to query the CrossWikis dataset too, start the web interface the following way.

python3 src/web_interface.py /data/databases/dict.db /data/databases/inv.dict.db -cwd /data/databases/cw.dict.db -cwid /data/databases/cw.inv.dict.db

The web interface will run at 0.0.0.0:[host-port], where [host-port] is the port you set when running the Docker container.

Benchmarks (Optional)

This step requires that you have created the CrossWikis databases!

Create the Benchmark Data

To use the webqsp_parser.py you need a Freebase ID to Wikipedia URL mapping.

python3 src/webqsp_parser.py /data/input/WebQSP.train.json /data/input/freebase_wikipedia_titles.txt /data/input/

This will generate data/input/benchmark.json.

Run the Benchmarks

python3 src/benchmark.py -sqf /data/databases/dict.db /data/databases/inv.dict.db /data/databases/cw.dict.db /data/databases/cw.inv.dict.db /data/input/benchmark.jso

The -s flag runs the speed tests. The -q flag runs the quality tests. And the -f flag runs the filesize test.

The results will be printed to the terminal. The benchmarks use a best of 5 approach for the speed tests and rating cutoff of 1% for the quality tests. You can change that at the beginning of the benchmark.py

How it works

The synonym finder is a two-step process.

Extracting raw synonym data
Build usable synonym data by calculating ratings.

Extracting Raw Synonym Data

The basic idea is from CrossWikis. You take any links from a given Wikipedia page and say that the descriptor text is a synonym to the URL the link is pointing to.

In the MediaWiki Markup links are written as [[URL_s]] or [[URL_s|Text]], where URL_s is a Wikipedia URL ( wikipedia.org/wiki/URL) with spaces instead of underscores (ex. Prince Charles) and Text is the descriptor. Then URL_u is the real URL (ex. Prince_Charles).

It then can be said that Text is a synonym for URL_u. If no explicit descriptor is given, URL_s acts as an implicit descriptor and therefore as synonym to URL_u.

To keep with the example of Prince Charles. If the link is [[Prince Charles|Prince of Wales]] then Prince of Wales is a synonym for Prince_Charles. But if the link is [[Prince Charles]] then Prince Charles is a synonym for Prince_Charles

In the first step all this data is collected as a dictionary where the keys are the descriptors (explicit or implicit) and the values are all the URL_u that have been described by this descriptor. Also, the inverse is collected; there all keys are the URL_u and the values are all descriptors that have been used to describe said URL. These lists can and will contain repeats.

This already creates a lot of decent data and is what CrossWikis does (they just do it on a bigger scale), but to differentiate my synonym finder from CrossWikis I also extract data from Infoboxes. Those are the little boxes on the side of the article that contain useful information such as full names of artists or authors for books. Since the infoboxes have many types which are specific to a usage (like books, movies or people) they can also be extended (athlete uses person as a baseline and adds more information to it), I only handle a tiny subset of infoboxes, namely:

Books
Movies
Songs
TV Shows
Person
Musical Artists
Royalty
sportsperson
Company

And for each of them just a fraction of possible information (you can find what exactly is parsed in the extract_infoboxes function in wiki_parser.py).

One last consideration in the first step are redirects. My earlier example of Prince Charles is such a case. The page /wiki/Prince_Charles redirects to /wiki/Charles,_Prince_of_Wales but more than half of all links that use the descriptor Prince Charles link to the Prince_Charles page, which is not what is expected, as the result should be the real page Charles,_Prince_of_Wales.

The tricky part with redirects is, that you can only know that a page is a redirecting page once you parsed it. But links to this page can exist before. Therefore, I decided to just create a mapping that saves all URLs of the pages which redirect and their redirect targets. This mapping then is used as a lookup in the second step.

Build Usable Synonym Data

Now for this step the raw data is collected. So I take each key-value pair and handle it. Remember that the value is a list of (repeating) URLs in the normal way and a list of (repeating) descriptors in the inverse way. I will only describe the normal way as the inverse way is done the exact same.

At first, I take all the URLs and check if any of them is a redirecting page, if so I replace this URL with the one the redirect is pointing to. Like Wikipedia, I only resolve redirects on the first level, if a link goes to a redirect-page which redirects to another redirect page it does not get resolved.

After this is done I count how many URLs are in the list, this is the total occurrence of the descriptor (key). Then I count how often each URL is in the list as score[URL].

With the total occurrence and how often each URL is in the list I then can calculate the rating, which is just score[URL] / total_occurrence[key]. Each data point is saved as a list of [URL, score[URL], total_occurrence[key]. After all URLs for the current key are done, they get sorted by this rating and saved to the database.

Database Structure

This project uses rocksDB as a key-value database back-end.

dict.db: The primary dictionary. It maps (normalized) strings to Wikipedia URLs. The data contains a list of list. Each sub-list contains a URL, how often this URL has been described by the string (num) and the rating, which is simple calculated by dividing num through the total occurrences of the string. The normalized strings are prefixed with lnrm__ while the raw strings are prefixed with wiki__. It also saves the total occurrence under the to__ prefix.
inv.dict.db: The inverse of the dictionary. It maps Wikipedia URLs to the strings that have been used to describe the URL in the text of Wikipedia pages. The data contains a list of list. Each sub-list contains a string, how often this string has been used to describe by the URL (num) and the rating, which is simple calculated by dividing num through the total occurrences of the URL. The URLs are prefixed with wiki__. It also saves the total occurrence under the to__ prefix
redirect.map.db: This database generally maps URL → URL. It maps Wikipedia pages that redirect to another page to said page. This database is created while the Wikipedia dump is parsed but is only used when the final synonym data is created. This is done to make sure all redirects have been collected before to attempt to replace them. It also handles the filtering of disambiguation pages. For this instead of the URL → URL it stores URL → __DISAMBIGUATION__, which is not a valid Wikipedia URL. Every link that points to such a disambiguation page will just be ignored.
str_to_urls.db: This is a temporary database which stores a string as the key and as value collects all URLs that this string links to in a list which contains duplicates. By default, the synonymfinder does this in-place in the dict.db but you can force the creation of this database. For more information on how look at the usage of synonymfinder.py or wiki_parser.py.
url_to_strs.db: This is a temporary database which stores a URL as the key and as value collects all strings that this URL is described by in a list which contains duplicates. By default, the synonymfinder does this in-place in the inv.dict.db but you can force the creation of this database. For more information on how look at the usage of synonymfinder.py or wiki_parser.py.
cw.dict.db: The equivalent of dict.db for the CrossWikis dataset. But the data does only contain the URLs and their ratings. Keys are in LNRM form and not prefixed.
cw.inv.dict.db: The equivalent of inv.dict.db for the CrossWikis dataset. But the data does only contain the strings and their ratings. Keys are Wikipedia URLs and not prefixed.

Comparison with CrossWikis

Quality

This benchmark uses the WebQSP dataset to evaluate the quality of the query results. For that the PotentialTopicEntityMention is extracted and used as query. The gold answer for the query is the respective TopicEntityMid, which is mapped to the Wikipedia URL. This results in a dataset with a total of 3083 queries.

The code for the evaluation has also been adopted from the WebQSP dataset to fit the changed data.

The following charts are with regard to a Rating Cutoff. This is a value that determines which result of a query is given as an answer. A cutoff of 0% means that all results are considered answers while a cutoff of 25% means that only results that have a rating of above 25% will be taken as an answer. With that it is obvious that a higher cutoff means fewer answers. I decided to stop at 25% because I thought if a result is rated with a relevance of ¼ it is already pretty important.

The precision develops as expected. With a rising cutoff value the precision goes up as less and less false positives are in the set of found answers. On average the precision of my solution is 1.66% higher than the precision of CrossWikis. Both solutions improve their precision with a growing cutoff, meaning that the extra results that have a low rating are not a relevant answer to the query.

The recall behaves as expected. The recall drops as less results are considered answers. On average the recall of my solution is 1.94% higher than the recall of CrossWikis. Interestingly CrossWikis does not have a higher recall, even though it is a larger dataset.

Now the more interesting part: the F1-score. On average the F1-score of my solution is 1.54% higher than the F1-score of CrossWikis. We can see that the F1-score dramatically increases up to the 10%-cutoff, after which the performance continues to increase, but not as fast as at the lower cutoff values.

Dataset	Metric	Rating Cutoff
Dataset	Metric	0%	1%	5%	10%	15%	20%	25%
My Solution	Precision	0.70%	30.30%	66.10%	77.20%	83.60%	86.90%	89.20%
	Recall	96.90%	95.80%	93.90%	93.00%	91.80%	91.10%	90.30%
	F1-Score	1.40%	46.00%	77.60%	84.40%	87.50%	89.00%	89.80%
CrossWikis	Precision	<0.01%	31.00%	63.70%	75.20%	81.50%	83.90%	87.10%
	Recall	93.60%	93.10%	92.20%	91.20%	90.60%	89.80%	88.70%
	F1-Score	0.10%	46.50%	75.40%	82.50%	85.80%	86.70%	87.90%

Speed Test

The speed test comes in two parts. The first test is the database loading times and the second is the database query times. For the query times I use the first 1000 queries from the same dataset that the accuracy test uses. All speed tests are done with Pythons timeit.repeat function and runs 5 times.

Here the charts do not contain the rating cutoff as it does not affect the search, but only which elements are returned. Meaning all time differences would be either caused by system resource allocation or how Python does one ffloating-point comparison, appending to lists and returning larger lists.

We can see that CrossWikis rocksDB databases are load from disk significantly faster than my solution is. While the times don't vary much between the best case and the average case, the difference surprised me. My guess for the dramatic difference has to do with how the databases are build. My solution uses rocksDB settings that are tweaked to be faster with random writes. CrossWikis, on the other hand, is a presorted dataset, so the rocksDB settings are tweaked such that sequential writes are faster. Thus I guess that plays a role while loading from disk.

		Dictionary Loadtime (in s)		Inverse Dictionary Loadtime (in s)
Storage Media	Dataset	Best of 5	Average	Best of 5	Average
HDD	My Solution	3.7206	3.7760	8.4941	8.5726
HDD	CrossWikis	0.1659	0.1688	0.1321	0.1383
SSD	My Solution	4.4675	4.5152	10.2273	10.3825
SSD	CrossWikis	0.1971	0.2009	0.1639	0.1646

The query times on the other hand are the complete opposite to the database load times. Here my solution beats the CrossWikis hands down. I'm not completely sure what causes this dramatic difference here. My first guess is, that the rocksDB settings are not optimal for the CrossWikis dataset. My second, more unlikely, guess is that the size of the CrossWikis dataset plays a role here.

Make 1000 Queries to the Database
Storage Media	Dataset	Best of 5 (in s)	Average (in s)
HDD	My Solution	0.8747	0.8911
HDD	CrossWikis	3.7579	3.8059
SSD	My Solution	0.9082	0.9566
SSD	CrossWikis	3.9001	3.9729

Surprisingly the load speed and the query speed seemed to decrease with the SSDs, arguably the changes are within the margin of error, but still makes no sense. I take a guess here: Either the settings are to blame or it has to do with the fact, that the data was created on HDDs and rocksDB would structure them differently if they were created directly on the SSDs.

Filesize

Keep in mind that the filesizes are not directly comparable. My solution's dictionary database for example saves its keys in both raw and normalized forms and contains the total occurrences too. But even then my dataset has just slightly more than half the filesize of CrossWikis.

Filesize
Dataset	Dictionary Database	Inverse Dictionary Database	Total Filesize
My Solution	2.30GB (2467737417 Bytes)	1.32GB (1419169724 Bytes)	3.62GB (3886907141 Bytes)
CrossWikis	3.89GB (4180006883 Bytes)	2.49GB (2670077874 Bytes)	6.38GB (6850084757 Bytes)

Currentness & Openness

Two minor benefits of my solution are, that it is a more recent snapshot of similar data. CrossWikis has been built once and with the help of data that is not accessible to everyone, while my solution can be build by everyone.

Conclusion

After presenting the data of my comparison between my solution and CrossWikis, I now come to the conclusion.

From the data we can see that my solution was performing overall better than CrossWikis (at least with a 10%-cutoff value). We can assume that the data extraction improvements, like the detailed parsing of the infoboxes, did make the difference. I want to exclude the speed test results here, because I think it really only comes down to sub-optimal rocksDB settings.

I think my solution is competitive to CrossWikis. The dictionary database of my solution is 41% smaller than the CrossWikis one, which could be improved even more if it was tweaked to only contain the normalized strings. If disk space is a concern, my solution beats CrossWikis. And, as said before, the precision is 1.66%, recall is 1.94% and F1-score is 1.54% higher.

The next reason to use my solution over CrossWikis is, if you need synonyms that have a bit more context. If you for example need to resolve an artist names to their real names or ISBN to book titles my solution should give you results. While CrossWikis might provide results too, they are not explicitly collected.

And the final reason to use my solution over CrossWikis is, if you need the latest data. As stated before, my solution can be easily build on the most recent Wikipedia dump. Theoretically it could be built on a daily basis to ensure the most recent data.

Possible Improvements

In the following I will go over some ways how the performance of my solution could potentially be improved or—at least—how more data could be collected from the dataset.

Tweak the rocksDB settings to get faster build and read times. This is a major problem in my current implementation. I simply don't understand rocksDB enough to tweak it optimally.
Parse more Infoboxes and even more information out of each.
Extract names from infoboxes and save each possible permutation as synonyms.
Handle linktails. A linktail is a Wikimedia feature that allows links like [[Help]], which displays as Help and links to the page wiki/Help, to be appended like [[Help]]ing which appears as Helping and links to wiki/Help. This feature has not been exploited because it is highly language specific. In English for example you can add -ly or -ing, but those would do nothing in German.
A nicer web interface.

benchmark.py

The benchmark tool requires you to have created the dictionary and inverse dictionary databases for both Wikipedia and CrossWikis as well as create the benchmark.json with webqsp_parser.py

You can change some options like the rating cutoff value (default: 1%) and how many loops the speed tests take (default: 5) at the beginning of the file.

usage: benchmark.py [-h] [-s] [-f] [-q]
                    wiki_dict wiki_inv_dict crosswikis_dict
                    crosswikis_inv_dict benchmark_file

Benchmark utility to evaluate the performance of my Wikipedia-backed synonym
finder solution compared to a CrossWikis-backed one.

positional arguments:
  wiki_dict            path to the dict.db of the Wikipedia dataset
  wiki_inv_dict        path to the inv.dict.db of the Wikipedia dataset
  crosswikis_dict      path to the dict.db of the CrossWikis dataset
  crosswikis_inv_dict  path to the inv.dict.db of the CrossWikis dataset
  benchmark_file       a json file that contains the benchmark data. required
                       for the speed and quality tests.

optional arguments:
  -h, --help           show this help message and exit
  -s, --speed-test     run the speed test
  -f, --filesize-test  run the filesize test
  -q, --quality-test   run the quality test

build_crosswikis_db.py

This tool is used to create the dictionary and inverse dictionary databases from the CrossWikis dataset.

usage: build_crosswikis_db.py [-h] [-b BATCH_SIZE] [-l LOG_LEVEL]
                    input output name

Load a CrossWikis dictionary or inv.dict txt dataset into a rocksDB
database(s).

positional arguments:
input                 the input CrossWikis dataset.
output                the output directory where a new database will be
              created.
name                  the name of the database.

optional arguments:
-h, --help            show this help message and exit
-b BATCH_SIZE, --batch-size BATCH_SIZE
              determines how large the batches grow before writing
              to disk. Default is 300
-l LOG_LEVEL, --log-level LOG_LEVEL
              determines the logging level. Possible values are:
              debug, info, warning, error (warning is default).

data_converter.py

This tool is part of the main synonymfinder.py. It takes the raw synonym data that wiki_parser.py collects, counts the values and calculates the respective ratings. It can be executed as a standalone tool.

usage: data_converter.py [-h] [-l LOG_LEVEL] -i INPUT -o OUTPUT -r
                    REDIRECT_MAP

Tool to build the dictionary and inv.dict databases from a Wikipedia XML Dump.

optional arguments:
-h, --help            show this help message and exit
-l LOG_LEVEL, --log-level LOG_LEVEL
                   determines the logging level. Possible values are:
                   debug, info, warning, error (warning is default).
-i INPUT, --input INPUT
                   rocksDB input database. This has to be the result of
                   wiki_parser.
-o OUTPUT, --output OUTPUT
                   Path and name of the resulting rocksDB database.
-r REDIRECT_MAP, --redirect-map REDIRECT_MAP
                   rocksDB database that contains the mapping of all
                   redirecting wikipedia page to their targets.

database.py

This module is a wrapper for the pyrocksdb library. It provides an easier usage by en- and decoding appropriately and offers support for write batches.

You can change the rocksDB settings in here too.

query_tool.py

This tool provides the query functions to get the synonym data from the database (both Wikipedia and CrossWikis) and sort it by ratings.

It also provides a command line interface to do interactive queries.

usage: query_tool.py [-h] [-cwd CROSSWIKIS_DICT] [-cwid CROSSWIKIS_INV_DICT]
                    wikipedia_dict wikipedia_inv_dict

Commandline query tool that allows to query both datasets.

positional arguments:
 wikipedia_dict        path to the dict.db of the Wikipedia dataset
 wikipedia_inv_dict    path to the inv.dict.db of the Wikipedia dataset

optional arguments:
 -h, --help            show this help message and exit
 -cwd CROSSWIKIS_DICT, --cw-dict CROSSWIKIS_DICT
                       path to the dict.db of the CrossWikis dataset
 -cwid CROSSWIKIS_INV_DICT, --cw-inv-dict CROSSWIKIS_INV_DICT
                       path to the inv.dict.db of the CrossWikis dataset

synonymfinder.py

The main tool of the project. It uses the functionality wiki_parser.py and data_converter.py to create the dictionary and the inverse dictionary database from the Wikipedia dump.

usage: synonymfinder.py [-h] --xml XML [-l LOG_LEVEL] [--inplace]
                    [--output OUTPUT_PATH] [--batch-size BATCH_SIZE]

Tool to build the dict and inv.dict databases from a Wikipedia XML Dump.

optional arguments:
-h, --help            show this help message and exit
--xml XML, -x XML     the XML Wikipedia dump file.
-l LOG_LEVEL, --log-level LOG_LEVEL
                    determines the logging level. Possible values are:
                    debug, info, warning, error (warning is default).
--inplace             operate inplace, meaning all the initial data
                    extracted from the XML will be put into the dict.db
                    and inv.dict.db respectivly and then replace when the
                    data is converted.
--output OUTPUT_PATH, -o OUTPUT_PATH
                    where the databases should be created. Same directory
                    as the script by default.
--batch-size BATCH_SIZE, -b BATCH_SIZE
                    determines how many commits a batch receives before it
                    is written to the database.

utils.py

A collection of utility functions used throughout the whole codebase.

web_interface.py

This tool brings the functionality of query_tool.py into a Flask powered web interface. It can query the Wikipedia and optionally the CrossWikis databases.

usage: web_interface.py [-h] [-cwd CROSSWIKIS_DICT]
                    [-cwid CROSSWIKIS_INV_DICT]
                    wikipedia_dict wikipedia_inv_dict

Launch a Flask web interface that allows to query the databases. If the
CrossWikis databases are omitted only the Wikipedia databases can be queried.

positional arguments:
wikipedia_dict        path to the dict.db of the Wikipedia dataset
wikipedia_inv_dict    path to the inv.dict.db of the Wikipedia dataset

optional arguments:
-h, --help            show this help message and exit
-cwd CROSSWIKIS_DICT, --cw-dict CROSSWIKIS_DICT
                    path to the dict.db of the CrossWikis dataset
-cwid CROSSWIKIS_INV_DICT, --cw-inv-dict CROSSWIKIS_INV_DICT
                    path to the inv.dict.db of the CrossWikis dataset

webqsp_parser.py

This utility extracts question, answer pairs from the WebQSP dataset. It takes the processed question, takes all sub-sequences and uses them as the questions. The answers are the TopicEntityMid, which by default is a Freebase ID but it gets mapped to a valid Wikipedia URL.

The data is not really fitting for the task of finding synonyms but it does the job to get data that is comparable.

usage: webqsp_parser.py [-h] input mapping output

Tool to convert data from the WebQSP dataset to be used by my benchmark.

positional arguments:
    input       location of a json file from the WebQSP dataset
    mapping     location of a file that contais Freebase ID to Wikipedia URL
                mappings
    output      location where the output benchmark.json will be written to

optional arguments:
    -h, --help  show this help message and exit

wiki_parser.py

This tool is part of the main synonymfinder.py. It takes a Wikipedia XML dump, parses it and extracts the raw synonym data.

usage: wiki_parser.py [-h] [-l LOG_LEVEL] [-i INPUT] [-o OUTPUT]
                    [--batch-size BATCH_SIZE]

Extract raw data from a XML Wikipedia dump. Creates key, values databases
where values is a list of (repeating) strings.

optional arguments:
-h, --help            show this help message and exit
-l LOG_LEVEL, --log-level LOG_LEVEL
                      determines the logging level. Possible values are:
                      debug, info, warning, error (warning is default).
-i INPUT, --input INPUT
                      Location of the XML Wikipedia dump.
-o OUTPUT, --output OUTPUT
                      Where the databases should be created. Same directory
                      as the script by default.
--batch-size BATCH_SIZE, -b BATCH_SIZE
                      determines how many commits a batch receives before it
                      is written to the database.

Contents