Bachelorprojekt: SynonymFinder
Niklas Baumert
29.01.2018 – 28.04.2018
Last Update: 29.10.2018
A collection of tools which allow the extraction of synonym data from a Wikipedia XML dump, convertion of the CrossWikis dataset, interacting with both and evaluating them against each other. The web interface is live on tapoa (accessible from the university network).
Requirements
- Python 3.5
- rocksDB 4
- pyrocksdb v0.4 (the newer python-rocksdb doesn't work as of 10. April 2018, as it has a bug with the VectorMemtables which isn't fixed in the master branch yet)
- wikitextparser
- Wikipedia dump in XML format ( Wikipedia:Database_download, pages-articles-multistream.xml.bz2).
-
Optionals:
- Flask (for the web interface)
- The CrossWikis dataset if you want to run the comparison benchmark or use the web interface to query it (lnrm.dict and inv.dict are used).
-
WebQuestions Semantic Parses Dataset is used for the quality benchmark. You can provide
your own data as a json file of the following format:
[{"question": string, "answer": [string,...]}, ...]
. - A Freebase to Wikipedia URL mapping in the form [FreebaseID]\t\t[WikipediaURL]\n. This is required if you want to use the WebQSP dataset and use my webqsp_parser.py to extract the data and create the benchmark.json with it.
The project comes with a dockerfile that installs the software requirements for you.
Successfully tested with Ubuntu 16.04 LTS and 17.10.
Usage
In the following I will describe how to use the tools to create and use the synonym databases. If you want to know all the details about the flags and options of each script check out the Code References.
Preparation
I recommend you create a directory
data
with the sub-folders
data/input
and
data/databases
. Put the Wikipedia XML into the
data/input
folder. If you want to use CrossWikis, put the lnrm.dict and the inv.dict there too. Your benchmark.json
goes into
data/input
too. If you don't have one and want to use the WebQSP dataset put any of the datasets into
data/input
(I have used WebQSP.train.json for the benchmarks).
With all the input data now at one convenient location it will be easier to run the tools later.
I recommend using docker.
docker build -t synfinder .
docker run -it -v [data]:/data/ -p [host-port]:5000 --rm --name synonymfinder synfinder
Where [data] is the directory from the preparation. And [host-port] is any free port on your machine.
You will find yourself in the /opt.
In the following I will assume you use the Docker container. If you don't want to use docker make sure you have all the requirements installed and translate the locations of the files to yours.
Creating the Synonym Databases
The first step into using the SynonymFinder is to create the synonym databases. This is done with the synonymfinder.py
python3 src/synonymfinder.py -x /data/input/wikipedia.xml -o /data/databases/ --inplace
This will parse the Wikipedia XML dump, collect the raw synonym data and convert it to usable synonym data.
As an result you will have the two rocksDB databases
data/databases/dict.db/ and
data/databases/inv.dict.db/. The
--inplace
options does all that in one set of databases, if you omit it the raw synonym data will be collected
and saved in their own database and the usable synonym data will then be put into their own database.
This will take a while!
Create the CrossWikis Databases (Optional)
python3 src/build_crosswikis_db.py /data/input/lnrm.dict /data/databases/ cw.dict.db
python3 src/build_crosswikis_db.py /data/input/inv.dict /data/databases/ cw.inv.dict.db
This will create the dictionary and the inverse dictionary of the CrossWikis data respectively.
Run the Web Query Tool
The web interface is a simple Flask application which takes a word or a set of words and returns the Wikipedia URLs sorted by their rating.
python3 src/web_interface.py /data/databases/dict.db /data/databases/inv.dict.db
If you want to be able to query the CrossWikis dataset too, start the web interface the following way.
python3 src/web_interface.py /data/databases/dict.db /data/databases/inv.dict.db -cwd /data/databases/cw.dict.db -cwid /data/databases/cw.inv.dict.db
The web interface will run at 0.0.0.0:[host-port], where [host-port] is the port you set when running the Docker container.
Benchmarks (Optional)
This step requires that you have created the CrossWikis databases!
Create the Benchmark Data
To use the webqsp_parser.py you need a Freebase ID to Wikipedia URL mapping.
python3 src/webqsp_parser.py /data/input/WebQSP.train.json /data/input/freebase_wikipedia_titles.txt /data/input/
This will generate data/input/benchmark.json.
Run the Benchmarks
python3 src/benchmark.py -sqf /data/databases/dict.db /data/databases/inv.dict.db /data/databases/cw.dict.db /data/databases/cw.inv.dict.db /data/input/benchmark.jso
The -s flag runs the speed tests. The -q flag runs the quality tests. And the -f flag runs the filesize test.
The results will be printed to the terminal. The benchmarks use a best of 5 approach for the speed tests and rating cutoff of 1% for the quality tests. You can change that at the beginning of the benchmark.py
How it works
The synonym finder is a two-step process.
- Extracting raw synonym data
- Build usable synonym data by calculating ratings.
Extracting Raw Synonym Data
The basic idea is from CrossWikis. You take any links from a given Wikipedia page and say that the descriptor text is a synonym to the URL the link is pointing to.
In the MediaWiki Markup links are written as
[[URLs]]
or
[[URLs|Text]]
, where
URLs
is a Wikipedia URL (
wikipedia.org/wiki/URL
) with spaces instead of underscores (ex.
Prince Charles) and
Text
is the descriptor. Then
URLu
is the real URL (ex.
Prince_Charles).
It then can be said that
Text
is a synonym for
URLu
. If no explicit descriptor is given,
URLs
acts as an implicit descriptor and therefore as synonym to
URLu
.
To keep with the example of Prince Charles. If the link is
[[Prince Charles|Prince of Wales]]
then
Prince of Wales is a synonym for
Prince_Charles. But if the link is
[[Prince Charles]]
then
Prince Charles is a synonym for
Prince_Charles
In the first step all this data is collected as a dictionary where the keys are the descriptors (explicit
or implicit) and the values are all the
URLu
that have been described by this descriptor. Also, the inverse is collected; there all keys are the
URLu
and the values are all descriptors that have been used to describe said URL. These lists can and will
contain repeats.
This already creates a lot of decent data and is what CrossWikis does (they just do it on a bigger scale), but to differentiate my synonym finder from CrossWikis I also extract data from Infoboxes. Those are the little boxes on the side of the article that contain useful information such as full names of artists or authors for books. Since the infoboxes have many types which are specific to a usage (like books, movies or people) they can also be extended (athlete uses person as a baseline and adds more information to it), I only handle a tiny subset of infoboxes, namely:
- Books
- Movies
- Songs
- TV Shows
- Person
- Musical Artists
- Royalty
- sportsperson
- Company
And for each of them just a fraction of possible information (you can find what exactly is parsed in the
extract_infoboxes
function in
wiki_parser.py).
One last consideration in the first step are redirects. My earlier example of
Prince Charles is such a case. The page
/wiki/Prince_Charles
redirects to
/wiki/Charles,_Prince_of_Wales
but more than half of all links that use the descriptor
Prince Charles link to the
Prince_Charles
page, which is not what is expected, as the result should be the real page
Charles,_Prince_of_Wales
.
The tricky part with redirects is, that you can only know that a page is a redirecting page once you parsed it. But links to this page can exist before. Therefore, I decided to just create a mapping that saves all URLs of the pages which redirect and their redirect targets. This mapping then is used as a lookup in the second step.
Build Usable Synonym Data
Now for this step the raw data is collected. So I take each key-value pair and handle it. Remember that the value is a list of (repeating) URLs in the normal way and a list of (repeating) descriptors in the inverse way. I will only describe the normal way as the inverse way is done the exact same.
At first, I take all the URLs and check if any of them is a redirecting page, if so I replace this URL with the one the redirect is pointing to. Like Wikipedia, I only resolve redirects on the first level, if a link goes to a redirect-page which redirects to another redirect page it does not get resolved.
After this is done I count how many URLs are in the list, this is the
total occurrence of the descriptor (key). Then I count how often each URL is in the list as
score[URL]
.
With the total occurrence and how often each URL is in the list I then can calculate the rating, which is
just
score[URL] / total_occurrence[key]
. Each data point is saved as a list of
[URL, score[URL], total_occurrence[key]
. After all URLs for the current key are done, they get sorted by this rating and saved to the database.
Database Structure
This project uses rocksDB as a key-value database back-end.
- dict.db
- The primary dictionary. It maps (normalized) strings to Wikipedia URLs. The data contains a list of list. Each sub-list contains a URL, how often this URL has been described by the string (num) and the rating, which is simple calculated by dividing num through the total occurrences of the string. The normalized strings are prefixed with lnrm__ while the raw strings are prefixed with wiki__. It also saves the total occurrence under the to__ prefix.
- inv.dict.db
- The inverse of the dictionary. It maps Wikipedia URLs to the strings that have been used to describe the URL in the text of Wikipedia pages. The data contains a list of list. Each sub-list contains a string, how often this string has been used to describe by the URL (num) and the rating, which is simple calculated by dividing num through the total occurrences of the URL. The URLs are prefixed with wiki__. It also saves the total occurrence under the to__ prefix
- redirect.map.db
- This database generally maps URL → URL. It maps Wikipedia pages that redirect to another page to said page. This database is created while the Wikipedia dump is parsed but is only used when the final synonym data is created. This is done to make sure all redirects have been collected before to attempt to replace them. It also handles the filtering of disambiguation pages. For this instead of the URL → URL it stores URL → __DISAMBIGUATION__, which is not a valid Wikipedia URL. Every link that points to such a disambiguation page will just be ignored.
- str_to_urls.db
- This is a temporary database which stores a string as the key and as value collects all URLs that this string links to in a list which contains duplicates. By default, the synonymfinder does this in-place in the dict.db but you can force the creation of this database. For more information on how look at the usage of synonymfinder.py or wiki_parser.py.
- url_to_strs.db
- This is a temporary database which stores a URL as the key and as value collects all strings that this URL is described by in a list which contains duplicates. By default, the synonymfinder does this in-place in the inv.dict.db but you can force the creation of this database. For more information on how look at the usage of synonymfinder.py or wiki_parser.py.
- cw.dict.db
- The equivalent of dict.db for the CrossWikis dataset. But the data does only contain the URLs and their ratings. Keys are in LNRM form and not prefixed.
- cw.inv.dict.db
- The equivalent of inv.dict.db for the CrossWikis dataset. But the data does only contain the strings and their ratings. Keys are Wikipedia URLs and not prefixed.
Comparison with CrossWikis
Quality
This benchmark uses the WebQSP dataset to evaluate the quality of the query results. For that the PotentialTopicEntityMention is extracted and used as query. The gold answer for the query is the respective TopicEntityMid, which is mapped to the Wikipedia URL. This results in a dataset with a total of 3083 queries.
The code for the evaluation has also been adopted from the WebQSP dataset to fit the changed data.
The following charts are with regard to a Rating Cutoff. This is a value that determines which result of a query is given as an answer. A cutoff of 0% means that all results are considered answers while a cutoff of 25% means that only results that have a rating of above 25% will be taken as an answer. With that it is obvious that a higher cutoff means fewer answers. I decided to stop at 25% because I thought if a result is rated with a relevance of ¼ it is already pretty important.
The precision develops as expected. With a rising cutoff value the precision goes up as less and less false positives are in the set of found answers. On average the precision of my solution is 1.66% higher than the precision of CrossWikis. Both solutions improve their precision with a growing cutoff, meaning that the extra results that have a low rating are not a relevant answer to the query.
The recall behaves as expected. The recall drops as less results are considered answers. On average the recall of my solution is 1.94% higher than the recall of CrossWikis. Interestingly CrossWikis does not have a higher recall, even though it is a larger dataset.
Now the more interesting part: the F1-score. On average the F1-score of my solution is 1.54% higher than the F1-score of CrossWikis. We can see that the F1-score dramatically increases up to the 10%-cutoff, after which the performance continues to increase, but not as fast as at the lower cutoff values.
Dataset | Metric | Rating Cutoff | ||||||
---|---|---|---|---|---|---|---|---|
0% | 1% | 5% | 10% | 15% | 20% | 25% | ||
My Solution | Precision | 0.70% | 30.30% | 66.10% | 77.20% | 83.60% | 86.90% | 89.20% |
Recall | 96.90% | 95.80% | 93.90% | 93.00% | 91.80% | 91.10% | 90.30% | |
F1-Score | 1.40% | 46.00% | 77.60% | 84.40% | 87.50% | 89.00% | 89.80% | |
CrossWikis | Precision | <0.01% | 31.00% | 63.70% | 75.20% | 81.50% | 83.90% | 87.10% |
Recall | 93.60% | 93.10% | 92.20% | 91.20% | 90.60% | 89.80% | 88.70% | |
F1-Score | 0.10% | 46.50% | 75.40% | 82.50% | 85.80% | 86.70% | 87.90% |
Speed Test
The speed test comes in two parts. The first test is the database loading times and the second is the database
query times. For the query times I use the first 1000 queries from the same dataset that the accuracy
test uses. All speed tests are done with Pythons
timeit.repeat
function and runs 5 times.
Here the charts do not contain the rating cutoff as it does not affect the search, but only which elements are returned. Meaning all time differences would be either caused by system resource allocation or how Python does one ffloating-point comparison, appending to lists and returning larger lists.
We can see that CrossWikis rocksDB databases are load from disk significantly faster than my solution is. While the times don't vary much between the best case and the average case, the difference surprised me. My guess for the dramatic difference has to do with how the databases are build. My solution uses rocksDB settings that are tweaked to be faster with random writes. CrossWikis, on the other hand, is a presorted dataset, so the rocksDB settings are tweaked such that sequential writes are faster. Thus I guess that plays a role while loading from disk.
Dictionary Loadtime (in s) | Inverse Dictionary Loadtime (in s) | ||||
---|---|---|---|---|---|
Storage Media | Dataset | Best of 5 | Average | Best of 5 | Average |
HDD | My Solution | 3.7206 | 3.7760 | 8.4941 | 8.5726 |
CrossWikis | 0.1659 | 0.1688 | 0.1321 | 0.1383 | |
SSD | My Solution | 4.4675 | 4.5152 | 10.2273 | 10.3825 |
CrossWikis | 0.1971 | 0.2009 | 0.1639 | 0.1646 |
The query times on the other hand are the complete opposite to the database load times. Here my solution beats the CrossWikis hands down. I'm not completely sure what causes this dramatic difference here. My first guess is, that the rocksDB settings are not optimal for the CrossWikis dataset. My second, more unlikely, guess is that the size of the CrossWikis dataset plays a role here.
Make 1000 Queries to the Database | |||
---|---|---|---|
Storage Media | Dataset | Best of 5 (in s) | Average (in s) |
HDD | My Solution | 0.8747 | 0.8911 |
CrossWikis | 3.7579 | 3.8059 | |
SSD | My Solution | 0.9082 | 0.9566 |
CrossWikis | 3.9001 | 3.9729 |
Surprisingly the load speed and the query speed seemed to decrease with the SSDs, arguably the changes are within the margin of error, but still makes no sense. I take a guess here: Either the settings are to blame or it has to do with the fact, that the data was created on HDDs and rocksDB would structure them differently if they were created directly on the SSDs.
Filesize
Keep in mind that the filesizes are not directly comparable. My solution's dictionary database for example saves its keys in both raw and normalized forms and contains the total occurrences too. But even then my dataset has just slightly more than half the filesize of CrossWikis.
Filesize | |||
---|---|---|---|
Dataset | Dictionary Database | Inverse Dictionary Database | Total Filesize |
My Solution | 2.30GB (2467737417 Bytes) | 1.32GB (1419169724 Bytes) | 3.62GB (3886907141 Bytes) |
CrossWikis | 3.89GB (4180006883 Bytes) | 2.49GB (2670077874 Bytes) | 6.38GB (6850084757 Bytes) |
Currentness & Openness
Two minor benefits of my solution are, that it is a more recent snapshot of similar data. CrossWikis has been built once and with the help of data that is not accessible to everyone, while my solution can be build by everyone.
Conclusion
After presenting the data of my comparison between my solution and CrossWikis, I now come to the conclusion.
From the data we can see that my solution was performing overall better than CrossWikis (at least with a 10%-cutoff value). We can assume that the data extraction improvements, like the detailed parsing of the infoboxes, did make the difference. I want to exclude the speed test results here, because I think it really only comes down to sub-optimal rocksDB settings.
I think my solution is competitive to CrossWikis. The dictionary database of my solution is 41% smaller than the CrossWikis one, which could be improved even more if it was tweaked to only contain the normalized strings. If disk space is a concern, my solution beats CrossWikis. And, as said before, the precision is 1.66%, recall is 1.94% and F1-score is 1.54% higher.
The next reason to use my solution over CrossWikis is, if you need synonyms that have a bit more context. If you for example need to resolve an artist names to their real names or ISBN to book titles my solution should give you results. While CrossWikis might provide results too, they are not explicitly collected.
And the final reason to use my solution over CrossWikis is, if you need the latest data. As stated before, my solution can be easily build on the most recent Wikipedia dump. Theoretically it could be built on a daily basis to ensure the most recent data.
Possible Improvements
In the following I will go over some ways how the performance of my solution could potentially be improved or—at least—how more data could be collected from the dataset.
- Tweak the rocksDB settings to get faster build and read times. This is a major problem in my current implementation. I simply don't understand rocksDB enough to tweak it optimally.
- Parse more Infoboxes and even more information out of each.
- Extract names from infoboxes and save each possible permutation as synonyms.
- Handle linktails. A linktail is a Wikimedia feature that allows links like [[Help]], which displays as Help and links to the page wiki/Help, to be appended like [[Help]]ing which appears as Helping and links to wiki/Help. This feature has not been exploited because it is highly language specific. In English for example you can add -ly or -ing, but those would do nothing in German.
- A nicer web interface.
Code References
- benchmark.py
- build_crosswikis_db.py
- data_converter.py
- database.py
- query_tool.py
- synonymfinder.py
- utils.py
- web_interface.py
- webqsp_parser.py
- wiki_parser.py
benchmark.py
The benchmark tool requires you to have created the dictionary and inverse dictionary databases for both Wikipedia and CrossWikis as well as create the benchmark.json with webqsp_parser.py
You can change some options like the rating cutoff value (default: 1%) and how many loops the speed tests take (default: 5) at the beginning of the file.
usage: benchmark.py [-h] [-s] [-f] [-q]
wiki_dict wiki_inv_dict crosswikis_dict
crosswikis_inv_dict benchmark_file
Benchmark utility to evaluate the performance of my Wikipedia-backed synonym
finder solution compared to a CrossWikis-backed one.
positional arguments:
wiki_dict path to the dict.db of the Wikipedia dataset
wiki_inv_dict path to the inv.dict.db of the Wikipedia dataset
crosswikis_dict path to the dict.db of the CrossWikis dataset
crosswikis_inv_dict path to the inv.dict.db of the CrossWikis dataset
benchmark_file a json file that contains the benchmark data. required
for the speed and quality tests.
optional arguments:
-h, --help show this help message and exit
-s, --speed-test run the speed test
-f, --filesize-test run the filesize test
-q, --quality-test run the quality test
build_crosswikis_db.py
This tool is used to create the dictionary and inverse dictionary databases from the CrossWikis dataset.
usage: build_crosswikis_db.py [-h] [-b BATCH_SIZE] [-l LOG_LEVEL]
input output name
Load a CrossWikis dictionary or inv.dict txt dataset into a rocksDB
database(s).
positional arguments:
input the input CrossWikis dataset.
output the output directory where a new database will be
created.
name the name of the database.
optional arguments:
-h, --help show this help message and exit
-b BATCH_SIZE, --batch-size BATCH_SIZE
determines how large the batches grow before writing
to disk. Default is 300
-l LOG_LEVEL, --log-level LOG_LEVEL
determines the logging level. Possible values are:
debug, info, warning, error (warning is default).
data_converter.py
This tool is part of the main synonymfinder.py. It takes the raw synonym data that wiki_parser.py collects, counts the values and calculates the respective ratings. It can be executed as a standalone tool.
usage: data_converter.py [-h] [-l LOG_LEVEL] -i INPUT -o OUTPUT -r
REDIRECT_MAP
Tool to build the dictionary and inv.dict databases from a Wikipedia XML Dump.
optional arguments:
-h, --help show this help message and exit
-l LOG_LEVEL, --log-level LOG_LEVEL
determines the logging level. Possible values are:
debug, info, warning, error (warning is default).
-i INPUT, --input INPUT
rocksDB input database. This has to be the result of
wiki_parser.
-o OUTPUT, --output OUTPUT
Path and name of the resulting rocksDB database.
-r REDIRECT_MAP, --redirect-map REDIRECT_MAP
rocksDB database that contains the mapping of all
redirecting wikipedia page to their targets.
database.py
This module is a wrapper for the pyrocksdb library. It provides an easier usage by en- and decoding appropriately and offers support for write batches.
You can change the rocksDB settings in here too.
query_tool.py
This tool provides the query functions to get the synonym data from the database (both Wikipedia and CrossWikis) and sort it by ratings.
It also provides a command line interface to do interactive queries.
usage: query_tool.py [-h] [-cwd CROSSWIKIS_DICT] [-cwid CROSSWIKIS_INV_DICT]
wikipedia_dict wikipedia_inv_dict
Commandline query tool that allows to query both datasets.
positional arguments:
wikipedia_dict path to the dict.db of the Wikipedia dataset
wikipedia_inv_dict path to the inv.dict.db of the Wikipedia dataset
optional arguments:
-h, --help show this help message and exit
-cwd CROSSWIKIS_DICT, --cw-dict CROSSWIKIS_DICT
path to the dict.db of the CrossWikis dataset
-cwid CROSSWIKIS_INV_DICT, --cw-inv-dict CROSSWIKIS_INV_DICT
path to the inv.dict.db of the CrossWikis dataset
synonymfinder.py
The main tool of the project. It uses the functionality wiki_parser.py and data_converter.py to create the dictionary and the inverse dictionary database from the Wikipedia dump.
usage: synonymfinder.py [-h] --xml XML [-l LOG_LEVEL] [--inplace]
[--output OUTPUT_PATH] [--batch-size BATCH_SIZE]
Tool to build the dict and inv.dict databases from a Wikipedia XML Dump.
optional arguments:
-h, --help show this help message and exit
--xml XML, -x XML the XML Wikipedia dump file.
-l LOG_LEVEL, --log-level LOG_LEVEL
determines the logging level. Possible values are:
debug, info, warning, error (warning is default).
--inplace operate inplace, meaning all the initial data
extracted from the XML will be put into the dict.db
and inv.dict.db respectivly and then replace when the
data is converted.
--output OUTPUT_PATH, -o OUTPUT_PATH
where the databases should be created. Same directory
as the script by default.
--batch-size BATCH_SIZE, -b BATCH_SIZE
determines how many commits a batch receives before it
is written to the database.
utils.py
A collection of utility functions used throughout the whole codebase.
web_interface.py
This tool brings the functionality of query_tool.py into a Flask powered web interface. It can query the Wikipedia and optionally the CrossWikis databases.
usage: web_interface.py [-h] [-cwd CROSSWIKIS_DICT]
[-cwid CROSSWIKIS_INV_DICT]
wikipedia_dict wikipedia_inv_dict
Launch a Flask web interface that allows to query the databases. If the
CrossWikis databases are omitted only the Wikipedia databases can be queried.
positional arguments:
wikipedia_dict path to the dict.db of the Wikipedia dataset
wikipedia_inv_dict path to the inv.dict.db of the Wikipedia dataset
optional arguments:
-h, --help show this help message and exit
-cwd CROSSWIKIS_DICT, --cw-dict CROSSWIKIS_DICT
path to the dict.db of the CrossWikis dataset
-cwid CROSSWIKIS_INV_DICT, --cw-inv-dict CROSSWIKIS_INV_DICT
path to the inv.dict.db of the CrossWikis dataset
webqsp_parser.py
This utility extracts question, answer pairs from the WebQSP dataset. It takes the processed question, takes all sub-sequences and uses them as the questions. The answers are the TopicEntityMid, which by default is a Freebase ID but it gets mapped to a valid Wikipedia URL.
The data is not really fitting for the task of finding synonyms but it does the job to get data that is comparable.
usage: webqsp_parser.py [-h] input mapping output
Tool to convert data from the WebQSP dataset to be used by my benchmark.
positional arguments:
input location of a json file from the WebQSP dataset
mapping location of a file that contais Freebase ID to Wikipedia URL
mappings
output location where the output benchmark.json will be written to
optional arguments:
-h, --help show this help message and exit
wiki_parser.py
This tool is part of the main synonymfinder.py. It takes a Wikipedia XML dump, parses it and extracts the raw synonym data.
usage: wiki_parser.py [-h] [-l LOG_LEVEL] [-i INPUT] [-o OUTPUT]
[--batch-size BATCH_SIZE]
Extract raw data from a XML Wikipedia dump. Creates key, values databases
where values is a list of (repeating) strings.
optional arguments:
-h, --help show this help message and exit
-l LOG_LEVEL, --log-level LOG_LEVEL
determines the logging level. Possible values are:
debug, info, warning, error (warning is default).
-i INPUT, --input INPUT
Location of the XML Wikipedia dump.
-o OUTPUT, --output OUTPUT
Where the databases should be created. Same directory
as the script by default.
--batch-size BATCH_SIZE, -b BATCH_SIZE
determines how many commits a batch receives before it
is written to the database.