This site provides the accompanying materials to our publication.

Code

Download The current version of our code is on Github.

This includes all code and instructions on how to download required data and reproduce results. We provide a complete Virtuoso index of Freebase for download. You will need a Linux system to run the code.

Results

We provide the output of our system on the Free917 and WebQuestions datasets as well as a script to compute the evaluation results. All of the files are available from Github.

evaluate.py Extended evaluation script from Berant et al.

The script has two simple extensions compared to the version by Berant et al.: 1) reporting accuracy in addition to average F1 2) accounting for different date formats when comparing dates.

final_f917_evaluation_output.txt Results on test part of Free917 dataset

Can be evaluated with the script from above to produce the following

> python evaluation.py final_f917_evaluation_output.txt
Number of questions: 276
Average recall over questions: 0.677553043687
Average precision over questions: 0.72050651442
Average f1 over questions: 0.67350224897
Accuracy over questions: 0.659420289855
F1 of average recall and average precision: 0.698369935688

final_f917_eo_evaluation_output.txt Results on test part of Free917 dataset with entity lexicon

Can be evaluated with the script from above to produce the following .

> python evaluation.py final_f917_eo_evaluation_output.txt
Number of questions: 276
Average recall over questions: 0.777616143426
Average precision over questions: 0.81884057971
Average f1 over questions: 0.779488902448
Accuracy over questions: 0.764492753623
F1 of average recall and average precision: 0.797696103436

final_wq_evaluation_output.txt Results on test part of WebQuestions dataset

Can be evaluated with the script from above to produce the following .

> python evaluation.py final_wq_evaluation_output.txt
Number of questions: 2032
Average recall over questions: 0.604353752656
Average precision over questions: 0.496435505027
Average f1 over questions: 0.494262559655
Accuracy over questions: 0.369094488189
F1 of average recall and average precision: 0.545104629829

Error Analysis

We have manually classified the errors of a subset of questions for each dataset. The list is available from a Google Spreadsheet.

Aqqu Error Analysis Google Error Analysis Spreadsheet

For each dataset we looked at a sample of 50 questions that were not answered correctly. For Free917 a question is answered incorrectly if its F1 is smaller than 1.0, for WebQuestions we considered an answer incorrect if the candidate with highest F1 was not ranked first. Note: the document lists all questions not answered correctly and we analyzed sample of these, i.e. some questions are without error analysis.

Aqqu Demo

A running demo of our system. No guarantees towards availability and performance.

Aqqu Demo