Img

DSTC4

Fourth Dialog State Tracking Challenge
@ IWSDS2016

Pilot tasks: optional pilot tasks available at DSTC4

Pilot Task General Overview

In this fourth edition of the Dialog State Tracking Challenge, we will focus on human-human dialogs. In addition to the main task, we also propose a series of pilot tracks for the core components in developing end-to-end dialog systems based on the same dataset. More specifically, four pilot tasks are available in DSTC4. These pilot tasks are optional for all participants in the challenge:

* Spoken language understanding (SLU): The objective of this task is to tag a given utterance (either from the tourist or the tour guide) with speech acts and semantic slots.

* Speech act prediction (SAP): The objective of this task is to predict the speech act of the next turn imitating the policy of one speaker (either the tourist or the tour guide).

* Spoken language generation (SLG): The objective of this task is to generate a response utterance for one of the participants (either the tourist or the tour guide) by using the corresponding speech act and semantic slot information.

* End-to-end system (EES) The objective of this task is to develop an end-to-end system playing the part of a guide or a tourist by pipelining and/or combining different SLU, SAP and SLG systems.

Dataset General Description

In this challenge, participants will use the TourSG corpus to develop the components. TourSG consists of 35 dialog sessions on touristic information for Singapore collected from Skype calls between three tour guides and 35 tourists. These 35 dialogs sum up to 31,034 utterances and 273,580 words. All the recorded dialogs, with a total length of 21 hours, have been manually transcribed and annotated with speech act and semantic labels for each turn level.

Different from the main task, in which dialog states are defined at the sub-dialog level and each of the sub-dialogs has a frame structure with slot value pairs to represent the subject discussed within it; in the pilot tasks, annotations are provided at the utterance level and, accordingly, systems must deal with slot value pairs at the utterance level. Annotations at the utterance level involve both, semantic slots and speech acts (see an example of reference annotations).

Evaluation General Description

As the TourSG corpus constitutes a collection of conversations between two specific roles: a tour guide and a tourist, pilot tasks are to be focalized in modeling one of the two interlocutor roles of the TourSG dataset. In this sense, each pilot task has two primary subtasks, one related to the modeling of the tour guide and the other related to the modeling of the tourist. Each subtask can be better defined in terms of the input data the system should use and the output data it should produce:

* SLU-TOURIST subtask: in this case the input to the systems will be the utterances from both the tourist and the guide, and the system must produce both semantic tags (slot values) and speech acts for the tourist utterances only.

* SLU-GUIDE subtask: in this case the input to the systems will be the utterances from both the tourist and the guide, and the system must produce both semantic tags (slot values) and speech acts for the guide utterances only.

* SAP-TOURIST subtask: in this case the input to the systems will be the utterances and annotations (semantic tags and speech acts) from the guide along with the resulting semantic tags for the next tourist utterances, and the system must produce the speech acts for the corresponding tourist utterances.

* SAP-GUIDE subtask: in this case the input to the systems will be the utterances and annotations (semantic tags and speech acts) from the tourist along with the resulting semantic tags for the next guide utterances, and the system must produce the speech acts for the corresponding guide utterances.

* SLG-TOURIST subtask: in this case the input to the systems will be the semantic tags and speech acts from the tourist only, and the system must produce the final surface form for the tourist utterances.

* SLG-GUIDE subtask: in this case the input to the systems will be the semantic tags and speech acts from the guide only, and the system must produce the final surface form for the guide utterances.

* EES-TOURIST subtask: in this case the objective is to deploy a system able to model the tourist behavior in the dialogues. The input to the systems will be the guide utterances and the system must produce the tourist utterances.

* EES-GUIDE subtask: in this case the objective is to deploy a system able to model the tour guide behavior in the dialogues. The input to the systems will be the tourist utterances and the system must produce the guide utterances.

Two different families of metrics will be used for evaluating the pilot tasks: classification accuracy metrics will be used for SLU and SAP tasks, and semantic similarity metrics will be used for SLG and EES tasks. For all subtasks in the pilot tasks, evaluation schedule 1 will be used (i.e. system outputs are evaluated at all turns). In DSTC4, the following evaluation metrics are used for the pilot tasks:

* SLU and SAP tasks:
-- Precision: Fraction of semantic tags and/or speech acts that are correctly generated.
-- Recall: Fraction of semantic tags and/or speech acts in the gold standard that are correctly generated.
-- F-measure: The harmonic mean of precision and recall.

* SLG and EES tasks:
-- BLEU: Geometric average of n-gram precision (for n = 1, 2, 3, 4) of the system generated utterance with respect to the reference utterance.
-- AM-FM: Weighted mean of (1) the cosine similarity between the system generated utterance and the reference utterance and (2) the normalized n-gram probability of the system generated utterance.

Regarding operational aspects of pilot task evaluation, a web-service (WS) implementation is required for a pilot task system to be evaluated. In this modality, participant teams are required to run their systems under a web-service architecture. The evaluation will be conducted by a master evaluation script which will be calling the corresponding web-services at specified time slots during the evaluation dates.

In order to facilitate the evaluation of the pilot tasks, both the server and client python scripts are provided. These scripts are configured by default to be used with the development set so each team can check that the systems are working and if the system is reachable from outside its local network.

DSTC4 Pilot Task Handbook and Resources

A more comprehensive description of the pilot tasks, avaliable datasets and evaluation protocol can be found in the official pilot tasks' Handbook: DSTC4 Pilot Task Handbook (V3).

Additional resources related to DSTC4 pilot tasks can be found in the Resources Page of this website.

Please check this page frequently for possible updates. Handbook and resources updates will be also announced through the DSTC4 mailing list. For instructions on how to subscribe to the mailing list, please refer to the Contact Information section of this website.