The Spoken CALL Shared Task is an initiative to create an open challenge dataset for speech-enabled CALL systems, jointly organised by the University of Geneva, the University of Birmingham, Radboud University and the University of Cambridge. The task is based on data collected from a speech-enabled online tool which has been used to help young Swiss German teens practise skills in English conversation. Items are prompt-response pairs, where the prompt is a piece of German text and the response is a recorded English audio file. The task is to label pairs as “accept” or “reject”, accepting responses which are grammatically and linguistically correct to match a set of hidden gold standard answers as closely as possible. Resources are provided so that a scratch system can be constructed with a minimal investment of effort, and in particular without necessarily using a speech recogniser.
The first edition of the Shared Task was carried out in 2017, with results presented at the SLaTE 2017 workshop in Stockholm. The second edition, with improved training data and improved baseline recogniser resources, was carried out in 2018, results this time being presented at a special session of Interspeech 2018 in Hyderabad. Details, including full results and links to all the papers, are available from the Shared Task 1 site and the Shared Task 2 site.
The third edition of the Shared Task will make available the same training data and resources as the second edition. There will be new test data. Given the strong results reported in the second edition, we are also making an important change: THE THIRD EDITION WILL USE A NEW METRIC. This new metric, Dfull, is defined in the instructions tab and motivated in §5 of the Interspeech 2018 overview paper. Unlike the metric used in the two previous editions of the Shared Task, which focused on optimizing performance for correct student responses (i.e. responses which should be accepted), Dfull places equal weight on correct and incorrect utterances. Since incorrect responses are considerably harder to process than correct ones, we expect Dfull to pose interesting new challenges.
Results for anonymised submissions
Id | T | Pr | Rec | F | SA | IRej | CRej | D | DA | Dfull |
---|---|---|---|---|---|---|---|---|---|---|
BaselinePerfectRec | Text | 0.977 | 0.907 | 0.940 | 0.916 | 0.940 | 0.093 | 10.080 | 15.075 | 12.327 |
GGG | Speech | 0.901 | 0.935 | 0.918 | 0.879 | 0.736 | 0.065 | 11.348 | 3.544 | 6.342 |
HHH | Speech | 0.884 | 0.946 | 0.914 | 0.873 | 0.689 | 0.054 | 12.750 | 3.043 | 6.229 |
III | Speech | 0.883 | 0.945 | 0.913 | 0.871 | 0.688 | 0.055 | 12.416 | 3.027 | 6.130 |
OOO | Speech | 0.895 | 0.923 | 0.909 | 0.867 | 0.724 | 0.077 | 9.401 | 3.346 | 5.608 |
PPP | Speech | 0.895 | 0.923 | 0.909 | 0.867 | 0.724 | 0.077 | 9.401 | 3.346 | 5.608 |
NNN | Speech | 0.896 | 0.919 | 0.907 | 0.865 | 0.726 | 0.081 | 8.950 | 3.350 | 5.476 |
CCC | Speech | 0.879 | 0.932 | 0.905 | 0.860 | 0.681 | 0.068 | 10.082 | 2.925 | 5.430 |
AAA | Speech | 0.879 | 0.924 | 0.901 | 0.855 | 0.685 | 0.076 | 9.046 | 2.930 | 5.149 |
BBB | Speech | 0.890 | 0.905 | 0.898 | 0.852 | 0.716 | 0.095 | 7.567 | 3.185 | 4.909 |
FFF | Text | 0.892 | 0.878 | 0.885 | 0.836 | 0.729 | 0.122 | 5.998 | 3.247 | 4.413 |
DDD | Text | 0.885 | 0.886 | 0.886 | 0.837 | 0.713 | 0.114 | 6.280 | 3.087 | 4.403 |
EEE | Text | 0.893 | 0.865 | 0.879 | 0.828 | 0.736 | 0.135 | 5.449 | 3.280 | 4.227 |
Baseline | Text | 0.892 | 0.858 | 0.875 | 0.823 | 0.734 | 0.142 | 5.176 | 3.232 | 4.090 |
MMM | Text | 0.891 | 0.851 | 0.871 | 0.819 | 0.736 | 0.149 | 4.953 | 3.229 | 3.999 |
KKK | Text | 0.891 | 0.847 | 0.868 | 0.816 | 0.736 | 0.153 | 4.822 | 3.213 | 3.936 |
LLL | Text | 0.890 | 0.843 | 0.866 | 0.813 | 0.736 | 0.157 | 4.697 | 3.198 | 3.876 |
JJJ | Text | 0.659 | 0.900 | 0.761 | 0.649 | 0.236 | 0.100 | 2.356 | 1.177 | 1.665 |
The scores have been calculated using the scoring script
This data is also available as a csv file in the downloads tab.
(in alphabetical order)
Claudia Baur, FTI/TIM, Université de Genève
Andrew Caines , University of Cambridge
Cathy Chua, Independent researcher
Johanna Gerlach , FTI/TIM, Université de Genève
Mengjie Qian, Department of Electronic, Electrical and Systems Engineering, University of Birmingham
Manny Rayner , FTI/TIM, Université de Genève
Martin Russell, Department of Electronic, Electrical and Systems Engineering, University of Birmingham
Helmer Strik , Centre for Language Studies (CLS), Radboud University Nijmegen
Xizi Wei, Department of Electronic, Electrical and Systems Engineering, University of Birmingham
In the speech version of the CALL shared task, each item consists of
The data was collected from an online CALL tool used to help young Swiss German students improve their English fluency.
The task is to create software that will decide whether each response is appropriate (accept) or inappropriate (reject) in the context of the prompt. This will presumably require a combination of speech recognition and text processing methods. A response is considered appropriate if it both responds to the prompt in terms of meaning and is also correct English. For example, if the prompt is
"Frag: rote Stiefel"
("Ask for: red boots"), then "I would like some red boots" or "Red boots, please" are appropriate responses. "Give me brown boots" is inappropriate because it has the wrong meaning. "I wants red boots" is inappropriate because it is incorrect English.
The task is open-ended; there are many potentially appropriate responses to each prompt.
In this version of the task, no explicit attention is paid to quality of pronunciation.
The training data has been created in two tranches. The original data, created for the first edition of the task, was hand-annotated by three native speakers. The original speech task training release directory contains the following resources:
Id | Prompt | Wavfile | Transcription | language | meaning |
11336 | Frag: rote Stiefel | 11336.wav | i'd like red boots | correct | correct |
7068 | Frag: Wie viel kostet es? | 7068.wav | how many is it | incorrect | incorrect |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | i want the bills | incorrect | correct |
There is one line for each audio file. The specific CSV format is UTF-8, tab separated.
The 'language' column contains the word "correct" if the response has been judged fully correct in terms of both language and meaning by the human annotators. The 'meaning' column contains the word "correct" if the response has been judged correct in terms of meaning, but not necessarily language.
The new data added for the second edition of the task was able to leverage systems developed for the first edition, to create an improved annotation process where each item has been annotated both by machines and by humans. The metadata for the second edition of the task contains an extra column giving a summary of the annotation information, and has also been divided into three groups (A highest, C lowest) of descending reliability. A brief summary of the annotation process is given in the Shared Task 2 release notes.
The second edition speech task training release directory contains the following resources:
Id | Prompt | Wavfile | Transcription | language | meaning | Trace |
11336 | Frag: rote Stiefel | 11336.wav | i'd like red boots | correct | correct | H: 3-1 M: 3-0 |
7068 | Frag: Wie viel kostet es? | 7068.wav | how many is it | incorrect | incorrect | H: 0-4 M: 0-0 |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | i want the bills | incorrect | correct | H: 2-2 M: 0-3 |
Again, there is one line for each audio file. The specific CSV format is UTF-8, tab separated.
The 'language' column contains the word "correct" if the response has been judged fully correct in terms of both language and meaning. The 'meaning' column contains the word "correct" if the response has been judged correct in terms of meaning, but not necessarily language.
The speech task test release directory, which will be released on Jan 31 2018, will contain the following resources:
Id | Prompt | Wavfile |
11336 | Frag: rote Stiefel | 11336.wav |
7068 | Frag: Wie viel kostet es? | 7068.wav |
8774 | Frag: Ich möchte die Rechnung | 8774.wav |
i.e. like the training data but without transcriptions, judgements or annotation information. The specific CSV format is UTF-8, tab separated.
Groups who wish to submit an entry to the shared task should submit a CSV file, produced by running their system over the test data. The format should be the same as the test data, but with an extra column called Judgement added in which the possible values are 'accept' and 'reject'. For example:
Id | Prompt | Wavfile | Judgement |
11336 | Frag: rote Stiefel | 11336.wav | accept |
7068 | Frag: Wie viel kostet es? | 7068.wav | reject |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | reject |
There should be one line for each line in the test data. The specific CSV format is once more UTF-8, tab separated.
Answer spreadsheets will be submitted by email to johanna.gerlach@unige.ch and emmanuel.rayner@unige.ch.
Important: note that the metric has been changed compared to the one used in the first and second editions of the Spoken CALL Shared Task. The reasons for changing it are explained in §5 of the Interspeech 2018 paper; basically, the old metric has topped out.
The metric used to score the results is based on three intuitions:
The metric is defined as follows (there is discussion in §5 of the Interspeech 2018 paper):
Each system response falls into one of five categories:
Define CR, CA, FR, PFA, GFA to be the number of utterances in each of the above categories, and put FA = PFA + k.GFA where k is a weighting factor that makes gross false accepts relatively more important. Then we define the differential response score on rejects, D, to be the ratio of the reject rate on incorrect utterances to the reject rate on correct utterances:
D = ( CR/(CR + FA) ) / ( FR/(FR + CA) ) = CR(FR + CA) / FR(CR + FA)
Similarly, we define the differential response score on accepts, DA to be the ratio of accept rate on correct utterances to the accept rate on incorrect utterances:
DA = ( CA/(FR + CA) ) / ( FA/(CR + FA) ) = CA(CR + FA) / FA(FR + CA)
Finally, we define the balanced differential response rate, Dfull, to be the geometric mean of D and DA. Following the derivation in §5 of the Interspeech 2018 paper, we find that Dfull is defined by the simple formula:
Dfull = √ ( CA.CR ) / ( FA.FR )
We will use Dfull as the metric for evaluating the quality of systems competing in the shared task, with the weighting factor k set equal to 3.
Important: In order to prevent "gaming" of the metric by concentrating on only one out of D and DA while ignoring the other, entries are also required to reject at least 50% of all incorrect responses and accept at least 50% of all correct responses.
A baseline recogniser for the task, built using the popular KALDI platform, will soon be available from the downloads tab.
A sample grammar, based on the one in the app used to collect the data, is provided as part of the release. The grammar is in XML format, and associates each possible prompt with
A typical entry looks like this:
<prompt_unit>
<prompt>Sag: Ich möchte am Montagmorgen abreisen</prompt>
<translated_prompt>Ask for: I want to leave on monday morning</translated_prompt>
<response>i need to leave on monday morning</response>
<response>i need to leave on monday morning please</response>
<response>i should like to leave on monday morning</response>
<response>i should like to leave on monday morning please</response>
<response>i want to leave on monday morning</response>
<response>i want to leave on monday morning please</response>
<response>i would like to leave on monday morning</response>
<response>i would like to leave on monday morning please</response>
<response>i'd like to leave on monday morning</response>
<response>i'd like to leave on monday morning please</response>
</prompt_unit>
Important: the sample grammar is NOT INTENDED TO BE COMPLETE. As already noted, the task is open-ended.
In the text version of the CALL shared task, each item consists of
The data was collected from an online CALL tool used to help young Swiss German students improve their English fluency.
The task is to create software that will decide whether each response is appropriate (accept) or inappropriate (reject) in the context of the prompt. This will require some kind of text processing method. A response is considered appropriate if it both responds to the prompt in terms of meaning and is also correct English. For example, if the prompt is
"Frag: rote Stiefel"
("Ask for: red boots"), then "I would like some red boots" or "Red boots, please" are appropriate responses. "Give me brown boots" is inappropriate because it has the wrong meaning. "I wants red boots" is inappropriate because it is incorrect English.
The task is open-ended; there are many potentially appropriate responses to each prompt.
The training data has been created in two tranches. The original data, created for the first edition of the task, was hand-annotated by three native speakers. The speech task training release directory contains the following resources:
Id | Prompt | Wavfile | RecResult | Transcription | language | meaning |
11336 | Frag: rote Stiefel | 11336.wav | i'd like red boots | i'd like red boots | correct | correct |
7068 | Frag: Wie viel kostet es? | 7068.wav | how many is it | how many is it | incorrect | incorrect |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | i want the bill | i want the bills | incorrect | correct |
There is one line for each audio file. The specific CSV format is UTF-8, tab separated.
The 'language' column contains the word "correct" if the response has been judged fully correct in terms of both language and meaning. The 'meaning' column contains the word "correct" if the response has been judged correct in terms of meaning, but not necessarily language.
The new data added for the second edition of the task was able to leverage systems developed for the first edition, to create an improved annotation process where each item has been annotated both by machines and by humans. The metadata for the second edition of the task contains an extra column giving a summary of the annotation information, and has also been divided into three groups (A highest, C lowest) of descending reliability. A brief summary of the annotation process is given in the Shared Task 2 release notes.
The second edition text task training release directory contains the following resources:
Id | Prompt | Wavfile | RecResult | Transcription | language | meaning | Trace |
11336 | Frag: rote Stiefel | 11336.wav | i'd like red boots | i'd like red boots | correct | correct | H: 3-1 M: 3-0 |
7068 | Frag: Wie viel kostet es? | 7068.wav | how many is it | how many is it | incorrect | incorrect | H: 0-4 M: 0-0 |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | i want the bill | i want the bills | incorrect | correct | H: 2-2 M: 0-3 |
Again, there is one line for each audio file. The specific CSV format is UTF-8, tab separated.
The 'language' column contains the word "correct" if the response has been judged fully correct in terms of both language and meaning. The 'meaning' column contains the word "correct" if the response has been judged correct in terms of meaning, but not necessarily language.
The fact that speech recognition is often inaccurate means that there may not always be sufficient information to make a correct decision. For example, the third utterance should be rejected, since the student has replied with a grammatically incorrect sentence, but since the recogniser has corrected the error there is no way to determine this.
The text task test release directory, which will be released on Jan 31 2018, will contain the following resources:
Id | Prompt | Wavfile | RecResult |
11336 | Frag: rote Stiefel | 11336.wav | i'd like red boots |
7068 | Frag: Wie viel kostet es? | 7068.wav | how many is it |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | i want the bill |
i.e. like the training data but without transcriptions or judgements. The specific CSV format is UTF-8, tab separated.
Groups who wish to submit an entry to the shared task should upload a CSV file, produced by running their system over the test data. The format should be the same as the test data, but with an extra column called Judgement added in which the possible values are 'accept' and 'reject'. For example:
Id | Prompt | Wavfile | RecResult | Judgement |
11336 | Frag: rote Stiefel | 11336.wav | i'd like red boots | accept |
7068 | Frag: Wie viel kostet es? | 7068.wav | how many is it | reject |
8774 | Frag: Ich möchte die Rechnung | 8774.wav | i want the bill | reject |
There should be one line for each line in the test data. The specific CSV format is once more UTF-8, tab separated.
Answer spreadsheets will be submitted by email to johanna.gerlach@unige.ch and emmanuel.rayner@unige.ch.
Important: note that the metric has been changed compared to the one used in the first and second editions of the Spoken CALL Shared Task. The reasons for changing it are explained in §5 of the Interspeech 2018 paper; basically, the old metric has topped out.
The metric used to score the results is based on three intuitions:
The metric is defined as follows (there is discussion in §5 of the Interspeech 2018 paper):
Each system response falls into one of five categories:
Define CR, CA, FR, PFA, GFA to be the number of utterances in each of the above categories, and put FA = PFA + k.GFA where k is a weighting factor that makes gross false accepts relatively more important. Then we define the differential response score on rejects, D, to be the ratio of the reject rate on incorrect utterances to the reject rate on correct utterances:
D = ( CR/(CR + FA) ) / ( FR/(FR + CA) ) = CR(FR + CA) / FR(CR + FA)
Similarly, we define the differential response score on accepts, DA to be the ratio of accept rate on correct utterances to the accept rate on incorrect utterances:
DA = ( CA/(FR + CA) ) / ( FA/(CR + FA) ) = CA(CR + FA) / FA(FR + CA)
Finally, we define the balanced differential response rate, Dfull, to be the geometric mean of D and DA. Following the derivation in §5 of the Interspeech 2018 paper, we find that Dfull is defined by the simple formula:
Dfull = √ ( CA.CR ) / ( FA.FR )
We will use Dfull as the metric for evaluating the quality of systems competing in the shared task, with the weighting factor k set equal to 3.
Important: In order to prevent "gaming" of the metric by concentrating on only one out of D and DA while ignoring the other, entries are also required to reject at least 50% of all incorrect responses and accept at least 50% of all correct responses.
A sample grammar, based on the one in the app used to collect the data, is provided as part of the release. The grammar is in XML format, and associates each possible prompt with
A typical entry looks like this:
<prompt_unit>
<prompt>Sag: Ich möchte am Montagmorgen abreisen</prompt>
<translated_prompt>Ask for: I want to leave on monday morning</translated_prompt>
<response>i need to leave on monday morning</response>
<response>i need to leave on monday morning please</response>
<response>i should like to leave on monday morning</response>
<response>i should like to leave on monday morning please</response>
<response>i want to leave on monday morning</response>
<response>i want to leave on monday morning please</response>
<response>i would like to leave on monday morning</response>
<response>i would like to leave on monday morning please</response>
<response>i'd like to leave on monday morning</response>
<response>i'd like to leave on monday morning please</response>
</prompt_unit>
Important: the sample grammar is NOT INTENDED TO BE COMPLETE. As already noted, the task is open-ended.
A Python3 script which carries out a baseline version of the text task is provided as part of the release. The script reads the sample XML grammar and a training data spreadsheet, then scores each item in the spreadsheet by matching the prompt and recognition result against the appropriate record in the grammar. If the recognition result is listed in the grammar as a possible response for the prompt, it is accepted, otherwise it is rejected. The results are written out as a new spreadsheet.
The files used (resource grammar, input spreadsheet and output spreadsheet) are defined at the top of the script.
Note that the script does not run under Python 2.x
The ideas behind the shared task are elaborated further in the following two papers:
Original paper: Baur, Claudia, Johanna Gerlach, Manny Rayner, Martin Russell, and Helmer Strik. (2016). "A Shared Task for Spoken CALL?". Proc LREC 2016, Portoroz, Slovenia.
Summary of first edition: Baur, Claudia, Cathy Chua, Johanna Gerlach, Manny Rayner, Martin Russell, Helmer Strik and Xizi Wei. (2017). "Overview of the 2017 Spoken CALL Shared Task". Proc SLaTE 2017 workshop, Stockholm, Sweden.
If you have questions, please contact us at {johanna.gerlach,emmanuel.rayner}@unige.ch
Test data with annotations (same as training data).
Test data for the speech task. See task instructions tab for more info about the data format.
Download this data set if you wish to participate in the speech-processing task. It includes all the audio files for the test data.
Test data for the text processing task. See task instructions tab for more info about the data format.
Download this data set if you wish to participate in the speech-processing task. It includes all the audio files for the training data.
Download this data set if you wish to participate in the speech-processing task. It includes three tab-separated metadata files with: prompt, wavfile, transcription, language judgement, meaning judgement.
Baseline Kaldi system. This is the system JJJ that achieved the highest score in the first edition of the shared task, cf. The University of Birmingham 2017 SLaTE CALL Shared Task Systems. Mengjie Qian, Xizi Wei, Peter Jančovič and Martin Russell
Download this data set if you wish to participate in the text-processing task. It includes three tab-separated files with: prompt, recognition result (produced by the highest scoring system from the first edition of the shared task), transcription, language judgement, meaning judgement. The audio files can be downloaded separately under 'Speech-processing task downloads' if required.
Python3 system which carries out a baseline version of the text task by matching the prompt and recognition result against the appropriate record in the sample grammar.
Sample grammar in XML format which associates each possible prompt with a translation of the prompt into English and a set of possible responses.
Test data for the speech task.
Download this data set if you wish to participate in the speech-processing task. It includes all the audio files for the test data.
Test data for the text processing task.
Test data with annotations (same as training data).
The transcription scheme has been improved since the first task and all transcriptions have been updated accordingly. Please make sure to download the updated data below to get the new transcriptions.
Training data audio files.
Updated version of the training metadata: tab-separated file with prompt, transcription, language judgement, meaning judgement
Updated version of the training metadata: tab-separated file with prompt, recognition result (produced by JJJ), transcription, language judgement, meaning judgement.
Test data audio files.
Updated version of the test metadata with transcriptions and language and meaning annotations.
Results should be submitted by email to johanna.gerlach@unige.ch and emmanuel.rayner@unige.ch.
Please submit a csv result file in the format specified in the task instructions tab..
Groups may submit up to three entries for each task. When ranking the results, only the best entry from each group will be included.
Entries may use any available material (not just the material in the Spoken CALL Shared Task v. 3 training release) for training and tuning. In particular, you may use material from two previous editions of the Spoken CALL Shared Task training and test releases in any way you think appropriate, and there is no explicit development set.
As with the previous editions of the Spoken CALL Shared Task, test set prompts will not necessarily all appear in the training set, but they will all be in the reference grammar.
Submission deadline: Tuesday April 30, 2019, 23:59 CET.