Bachelorprojekt: UniPal


Overview/Information


Project-Type Bachelorproject
Project-Title UniPal
Participants Tanyu Tanev, Pascal Muckenhirn
Duration 19.11.2018 - 23.06.2019
Supervisor Claudius Korzen, Prof. Dr. Hannah Bast
Department Chair of Algorithms and Data Structures
University Albert Ludwigs University of Freiburg

Abstract


Have you ever tried to find information about a tutorial and gotten lost in the university webpages? That is one of the problems we tried to fix with UniPal - a chat bot, designed to assist with everyday student problems, such as getting information about courses, lecturers and tutorials.
Instead of crawling the web, you can now just ask UniPal and it'll fetch the information for you - just like any of your best pals1 would! You could try it out directly via the converstaion window below. Just type in "When is the Maths II tutorial?" below and see what happens!

UniPal



>>>


Usage


Supported questions

Below we present you the list with all type of questions, which can be posed to UniPal, together with examples of how to do that! All of these questions and more can be answered by UniPal regarding any events from the 2 most recent semesters. These include:
  • Introductory lectures
  • Major lectures
  • Specialized lectures

Keep in mind that the first 8 queries support semester specification! That means you can add a <semester> at the end of the example input and the chat bot will process the query for exactly that <semester>! The style of this could be seen in case 10 and 11.
For the cases 9 and 12 there is also the possibility to add the day for which the information should be given.
  1. How many ECTS points a lecture gives
    >>> How many points does Information Retrieval give?

  2. The lecturer, which holds a lecture
    >>> Who holds Information Retrieval?

  3. The language, in which a lecture is held
    >>> In which language is Information Retrieval given?

  4. Time and location of a lecture
    >>> (When and where|When|Where) is Information Retrieval held?

  5. All subjects, given by a lecturer
    >>> What lectures does Hannah Bast hold?

  6. All subjects, held in a specific language
    >>> What lectures are held in german?

  7. All subjects, held in a specific language by a given lecturer
    >>> What lectures are held in german by Hannah Bast?

  8. If a given lecture is available in a given language
    >>> Is Information Retrieval held in german?

  9. Information about events, held by the TF Fachschaft
    >>> What fachschaft events are there in the near future?

  10. Organisational information about the time frame of a semester
    >>> How long is SS 2019?

  11. Organisational information about the holidays of a semester
    >>> When are the holidays in SS 2019?

  12. Meal plans of canteens for a given day
    >>> What is there to eat in Mensa Flugplatz today?

Functionality

We tried to make the conversation with UniPal as smooth and natural as we could! As such, here are a couple of cool functionalities it supports:
  1. UniPal remembers what you want to ask, even if you don't know what you're talking about:

    >>> So how many ECTS does it give?
    (UniPal) Which subject are you talking about exactly?
    >>> Stochastics
    (UniPal) Stochastik für Studierende der Informatik - Vorlesung (SS 2019) gives 6 ECTS points.

    >>> And how many ECTS points does Informatik III give?
    (UniPal) Can you tell me in which semester the subject was held exactly? I found it in more than one:

    1) WS_1718
    2) WS_1819
    3) WS_1617

    >>> WS 1819
    (UniPal) Informatik III: Theoretische Informatik (WS 1819) gives 8 points.

  2. UniPal can follow your train of though and remembers the conversation context:

    ... (continuation from above)
    >>> And which lecturer holds it?
    (UniPal) The lecturers of Informatik III: Theoretische Informatik (WS 1819) are Andreas Podelski, Matthias Heizmann & Christian-Matthias Schilling.

  3. UniPal can make guesses about what you'll say next and ask you about it (assuming that the actual semester is still WS 1819):

    >>> How many points does programming in C give?
    (UniPal) Was the subject held in SS 2019? (NOTE: SS 2019 != actual semester)
    >>> Yes, it was!
    (UniPal) Programmieren in C (SS 2019) gives 6 points.

  4. In addition to 2), UniPal can also memorize the semester of your past subject of attention, if it is not the most recent one:

    ... (continuation from above)
    >>> And when and where was it?
    (UniPal) Programmieren in C (SS 2019) is held on:

    Di 10:00-12:00 @ 101 HS 026


  5. Another addition to 2) is the opportunity to modify the intents for different semesters:

    ... (assuming the stored lecture is "computer science III: theoretical computer science" and the stored semester is WS 1718)
    >>> And who held it in WS 1819?
    (UniPal) The lecturers of Informatik III: Theoretische Informatik (WS 1819) are Andreas Podelski, Matthias Heizmann & Christian-Matthias Schilling.

Approach


The idea of answering questions about events, offered in the technical faculty, brings three main problems:
  1. Intent Classification (IC): Understanding the intent - what does the user want to know/do?
  2. Named Entity Recognition (NER):Understanding the subject(s) of attention - what is the user talking ABOUT?
  3. Answer Computation (AC): Computing the correct answer that correlates with the identified intent and subject(s).
For an illustration of all three problems, consider the example query "How many ECTS points does Introduction to Programming in WS 2018/19 give?" IC is the problem of understanding that the user wants to know the ECTS points of a lecture; NER is the problem of understanding that the user wants to know this about the lecture Introduction to Programming (in the winter semester 2018/19) and AC is the problem of computing and returning the correct answer 6.


For IC, we use a deep learning model based on a bidirectional LSTM2 architecture using ELMo word embeddings3. This allows for a more flexible understanding of questions written in natural language. For example, to detect that the two following questions are equivalent:
"How many points does Introduction to Programming in WS 2018/19 give?"

is equivalent to

"Introduction to Programming - how many ECTS in WS 2018/19?"

To train the model we build up training data, at total roughly about 415000. It's generated on the base of manually-written templates in the form of "What lectures does <lect> hold?", with <lect> being a placeholder for the names of lecturers. Analogously, there are such templates for every intent UniPal supports. While the creation of the training data these placeholders, inside each template, are replaced by courses, lecturers, languages, semesters, canteens and days. The tags for the courses, lecturers and languages are previously extracted and formatted from the VVZ API. In the following there are tags listed for these types of tags:

EVENT_MICRO
LECTURER_HANNAH_BAST
LANGUAGE_DEUTSCH


The other types of tags, including semesters, canteens and days, are handcrafted and look like this:

SEMESTER_SS_2018
MENSA_FLUGPLATZ
DAY_TODAY


The training data looks like this:
how many credits does micro give in ss 2018? #TABULATOR# query_subject_credit_points
can you please tell me if there are any events offered in german by hannah bast? #TABULATOR# query_language_lecturer_events
do you know about ss 2018 free days? #TABULATOR# query_organisation_semester_free_days
The part on the left from the #TABULATOR# is the input for the model and the part on the right of it the expected output.


For NER, we use the same architecture as for the IC. The idea of generation the training data is also identical, but the training data looks a bit different:
how many credits does <subj>micro<subj> give in <sem>ss#2018<sem>? #TABULATOR# EVENT_MICRO SEMESTER_SS_2018
do you know which events <lect>hannah#bast<lect> holds? #TABULATOR# LECTURER_HANNAH_BAST
do you know about <sem>ss#2018<sem> free days? #TABULATOR# SEMESTER_SS_2018
As above, the part on the left from #TABULATOR# is the input for the model and the part on the right of it the expected output.

Additionally, tags in the NER training data has to be formatted in the way, seen above. Every tag has to be surrounded by two angle brackets, which contain information about the type of the tag. In the first sentence, shown above, those are the <subj><subj> brackets around micro and the <sem><sem> brackets around ss#2018. The latter also showcases how blank space is replaced by the # character. This is done, because it's possible to have more than 1 named entity per user sentence. By using this specific formatting, the deep learning model is shown their positions.


For AC we use the VVZ API from the University of Freiburg. It's a lecture directory, where you can look up information about courses, for example the language or the lecturer of it. It's possible to ask the API for a certain course, a lecturer or a language. At total the lecture directory consists of 3946 courses (current status: 07.06.2019).

Based on the subject(s), which were recognised by the NER, a request query is built. It is then sent to the VVZ API, in order to get the whole information about the given subject. With the help of the intent, which was recognised by the IC, the right information will then be extracted. Subsequently, through manually-written response templates, a human-readable response will be created. This works analogous to the generation of the data - there are response templates with placeholders in them, which are replaced with the relevant information.
As an example of how the overall pipeline architecture works, this is how the first sentence above is processed:
  1. The sentence is first fed to the NER model, which recognizes the subject of the user's attention:
    Input: How many points does Introduction to Programming in WS 2018/19 give?
    Output: EVENT_INTRODUCTION_TO_PROGRAMMING SEMESTER_WS_1819
  2. Then the IC model classifies the user intent:
    Input: How many points does Introduction to Programming in WS 2018/19 give?
    Output: query_subject_credit_points
  3. The pipeline checks if the classified intent and named entities make sense together and asks the user for additional information, if required. For sake of brevity, we have chosen an example, which contains full context and the pipeline can proceed safely.
  4. A query is generated and sent to the VVZ (Vorlesungverzeichnis) API.
  5. The API response is parsed and the respective inforamtion (in this case - ECTS points) is given back in the form of a randomly chosen template:
    (UniPal) Introduction to Programming (WS 2018/19) gives 6 ECTS points.

Evaluation


For the evaluation we first measured the performance of IC and NER independently and then that of them combined, when used in the Pipeline architecture. We're going to discuss what test data and metrics we used, what the results mean and how the evaluation was executed one by one, beginning with the IC.

The test data for all three cases is a 5% subset of all data, which translates to roughly 22,000 samples. It's generated using the same idea as above. See Usage - IC for further information.


For the IC we used test data in the following format:

Input: Do you know how many credit points neuroscience for engineers gives in ss 2018?
Output: query_subject_credit_points

Before explaining what metrics we used for the evaluation and what they mean for the performance of the model, we're going to introduce how the evaluation itself was done: Training data was fed into the model one sentence at a time and the expected intent was compared with the predicted one. There are three possible scenarios, which can happen, and they are explained in the example below:

We're going to assume the model can only classify two user intents: query_subject_credit_points and query_subject_lecturer. The principle behind this example is easily extendable to more classes, but for the sake of clarity, we're going to stick to this 2-intent model. Now, consider the three following sentences as test data:
how many credits does micro give in ss 2018? #TABULATOR# query_subject_credit_points
who holds computer networks in ss 2019? #TABULATOR# query_subject_lecturer
how many points does maths II give? #TABULATOR# query_subject_credit_points
Next, assume we feed those sentences to the IC model one by one, and get the following three respective results:

how many credits does micro give in ss 2018? Expected: query_subject_credit_points Prediction: query_subject_credit_points (True Positive)
who holds computer networks in ss 2019? Expected: query_subject_lecturer Prediction: query_subject_lecturer (True Positive)
how many points does maths II give? Expected: query_subject_credit_points Prediction: query_subject_lecturer (False Negative + False Positive)

The labels at the end of each line categorize the results:

True Positive (TP) result - the expected and the predicted intent are the same (Examples are the correct classifications of the first two sentences)
False Negative (FN) result - the predicted intent is not the same as the expected one (query_subject_credit_points wasn't classified for the third sentence)
False Positive (FP) result - an intent was falsely predicted (query_subject_lecturer was classified for the third sentence, even though it shouldn't have been)

The metrics we used to measure the performance are precision, recall and F1 Score. All three are calculated on the base of the amount of TF, FN and FP samples:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 * (precision * recall) / (precision + recall)

Consider that the F1 score is a combined metric on the base of the former two and is used to give an idea about the overall performance of the model.

A good precision score for the IC model would mean that it returns few false results. This is important, as there isn't any post-processing of the predictions of the model - UniPal just assumes that the predicted intent is the correct one. If precision has a low value, the user would often be misunderstood regarding he wants to know exactly. A good recall score, on the other hand, would mean that UniPal isn't too picky, when classifying intents. This attribute is also desired from the IC model, as not all intents have the same amount of test samples - for some problems, there are many more ways to pose a question as for others. Recall helps by giving an idea about whether all classes, independant of sample size, are classified correctly.

For an even better visualization of the concept of precision and recall, please refer to this excellent article, written by Andreas Klintberg:
Explaining precision and recall

As to how we measured the performance of the model itself, we first computed the metrics for each intent separately. Afterwards, we used two methods to get the average of the metrics4:
  • Macro-averaging - take the average over all classes regarding a metric
  • Micro-averaging - calculate metric with with regards to the TF, FN and FP samples across all classes
Macro-averaging is the standard way of taking an average over some data and gives an idea about the overall performance of the model. Micro-averaging, on the other hand, increases the impact of improperly classified intents with a smaller sample size. This is helpful, as it gives further information about whether the rarelier asked questions will also be classified correctly.

After running the evaluation, the results below reflect the performance of our IC model. The high scores across the board give us confidence about the language flexibility and accuracy of the IC model.
Macro-averaged Micro-averaged
Precision 0.93 0.99
Recall 0.93 0.99
F1 Score 0.93 0.99

For the NER we used test data in the following format:

Input: Do you know how many credit points neuroscience for engineers gives in ss 2018?
Output: EVENT_NEUROSCIENCE_FOR_ENGINEERS SEMESTER_SS_2018

In comparison with the IC model, it is possible for the NER to be multi-output, as shown in the example above. This means that there can be more than one correct answer, as the user can give multiple named entities, in order to clarify the context of his question. Because of this, we use a significance threshold, which helps UniPal determine if a given prediction is correct or not.

The metrics used are the same as the ones listed above for the evaluation of the IC model and they were once again based on TP, FP and FN results. A good precision score would indicate how many of the recognised tags have been correctly classified. This is important, as a named entity, falsely classified with high confidence, would pass the significance threshold and lead to a faulty output. A good recall score would lead to the idea how that the model isn't too picky, when classifying named entities. This is once again connected to proper classification of named entities, which are rarelier used. Recall is especially important for the NER model, as there are a lot of different named entities, as described under the "Approach" section.

The method of evaluation is the same as the one described above for the evaluation of the IC model, with the difference being that the expected results are named entities, and not intents. The results below depict a stellar performance of the NER model, both in accuracy and language flexibility:
Macro-averaged Micro-averaged
Precision 0.94 0.96
Recall 0.95 0.98
F1 Score 0.94 0.97

For the Pipeline we used test data in the following format:

Input: Do you know how many credit points neuroscience for engineers gives in ss 2018?
Output: query_subject_credit_points EVENT_NEUROSCIENCE_FOR_ENGINEERS SEMESTER_SS_2018

The metrics, their significance and method of evaluation are analogous to the ones listed above. The only difference is the fact that the expected results are both intents and named entities. This makes sure that UniPal uses both models to correctly classify all needed information, when used in practice.

The results below give a representation of UniPal's performance overall. The scores show that it is able to correctly classify a high percentage of user inquiries. As the responces of UniPal depend only on the classified intent and named entities, this also implicates that the chat bot is able to give a satisfying answers to the posed problems.
Macro-averaged Micro-averaged
Precision 0.94 0.97
Recall 0.96 0.98
F1 Score 0.94 0.98

Code