More Accurate Question Answering on Freebase CIKM 2015
Hannah Bast and Elmar Haussmann, University of Freiburg
This site provides the accompanying materials to our publication.
Code
Download The current version of our code is on Github.
This includes all code and instructions on how to download required data and reproduce results. We provide a complete Virtuoso index of Freebase for download. You will need a Linux system to run the code.
Results
We provide the output of our system on the Free917 and WebQuestions datasets as well as a script to compute the evaluation results. All of the files are available from Github.
evaluate.py Extended evaluation script from Berant et al.
The script has two simple extensions compared to the version by Berant et al.: 1) reporting accuracy in addition to average F1 2) accounting for different date formats when comparing dates.
final_f917_evaluation_output.txt Results on test part of Free917 dataset
Can be evaluated with the script from above to produce the following output
> python evaluation.py final_f917_evaluation_output.txt Number of questions: 276 Average recall over questions: 0.677553043687 Average precision over questions: 0.72050651442 Average f1 over questions: 0.67350224897 Accuracy over questions: 0.659420289855 F1 of average recall and average precision: 0.698369935688
final_f917_eo_evaluation_output.txt Results on test part of Free917 dataset with entity lexicon
Can be evaluated with the script from above to produce the following output.
> python evaluation.py final_f917_eo_evaluation_output.txt Number of questions: 276 Average recall over questions: 0.777616143426 Average precision over questions: 0.81884057971 Average f1 over questions: 0.779488902448 Accuracy over questions: 0.764492753623 F1 of average recall and average precision: 0.797696103436
final_wq_evaluation_output.txt Results on test part of WebQuestions dataset
Can be evaluated with the script from above to produce the following output.
> python evaluation.py final_wq_evaluation_output.txt Number of questions: 2032 Average recall over questions: 0.604353752656 Average precision over questions: 0.496435505027 Average f1 over questions: 0.494262559655 Accuracy over questions: 0.369094488189 F1 of average recall and average precision: 0.545104629829
Error Analysis
We have manually classified the errors of a subset of questions for each dataset. The list is available from a Google Spreadsheet.
Aqqu Error Analysis Google Error Analysis Spreadsheet
For each dataset we looked at a sample of 50 questions that were not answered correctly. For Free917 a question is answered incorrectly if its F1 is smaller than 1.0, for WebQuestions we considered an answer incorrect if the candidate with highest F1 was not ranked first. Note: the document lists all questions not answered correctly and we analyzed sample of these, i.e. some questions are without error analysis.Aqqu Demo
A running demo of our system. No guarantees towards availability and performance.