2nd Edition

3rd Edition

About the task

The shared task is based on data collected from a speech-enabled online tool which has been used to help young Swiss German teens practise skills in English conversation. Items are prompt-response pairs, where the prompt is a piece of German text and the response is a recorded English audio file. The task is to label pairs as “accept” or “reject”, accepting responses which are grammatically and linguistically correct to match a set of hidden gold standard answers as closely as possible. Initial resources are provided so that a scratch system can be constructed with a minimal investment of effort, and in particular without necessarily using a speech recogniser.

The task is now closed. The annotated test data is now available in the downloads tab, the anonymised task results in the results tab and a browsable annotated version of the test data in the test data tab.

The papers published at SLaTE 2017 are now available in th SLaTE 2017 papers tab


The University of Birmingham team submitted the two highest scoring entries in the first edition of the Spoken CALL Shared Task.

From left to right, Mengjie Qian, Martin Russell and Xizi Wei

The University of Birmingham 2017 SLaTE CALL Shared Task Systems. Mengjie Qian, Xizi Wei, Peter Jančovič and Martin Russell



Schedule

July 13, 2016
Release of training material
March 13, 2017
Release of test material
March 20, 2017
Deadline for submission of results
May 5, 2017
SLaTE 2017 submission deadline
August 25-26, 2017
SLaTE 2017: Presentation and discussion of results

Shared task papers published at SLaTE 2017

Overview paper

Submitted systems

Shared task results

Results for 20 anonymised submissions and three baseline systems (BaselineKaldi = system with baseline Kaldi recogniser and baseline XML grammar; BaselineNuance = system with baseline Nuance recogniser and baseline XML grammar; BaselinePerfectRec = system with input from transcriptions and baseline XML grammar.)

IdRecPrRecFSAIRejCRejDResources
KKKCustom0.8810.8450.8620.8400.7390.1554.766
JJJCustom0.8710.8480.8590.8380.7170.1524.710ASR output
BaselinePerfectRec n/a0.9950.7810.8750.8390.9890.2194.512
CCCKaldi0.6870.9330.7910.8010.3000.0674.468
IIICustom0.7520.9040.8210.8050.4210.0964.371
MMMKaldi0.7320.9050.8090.8180.4130.0954.353
OOOKaldi0.7390.8990.8120.8180.4300.1014.273
GGGKaldi0.6390.9570.7660.7690.1730.0433.998
AAAKaldi0.6060.9730.7470.7490.0980.0273.678
FFFKaldi0.6310.9510.7590.7620.1640.0493.352
BBBKaldi0.7480.8620.8010.7980.4610.1383.335
HHHKaldi0.6020.9710.7430.7470.0960.0293.289
PPPCustom0.8380.7950.8160.7860.6600.2053.217
NNNKaldi0.7130.8800.7880.7900.3830.1203.188
QQQCustom0.8540.7530.8000.7700.7130.2472.882
DDDKaldi0.7770.8110.7940.7790.5390.1892.857
LLLNuance0.8140.7540.7830.7460.6230.2462.533
BaselineNuance Nuance0.8220.7230.7700.7310.6520.2772.358
EEENuance0.8160.7290.7700.7310.6360.2712.347
SSSNuance0.7370.8200.7760.7430.4230.1802.346
RRRCustom0.8840.5840.7030.6680.8180.4161.965
BaselineKaldi Kaldi0.9570.4390.6020.5880.9510.5611.694
TTTNuance0.6220.8970.7350.7130.1480.1031.437
  • Rec = recogniser used
  • Pr = precision,
  • R = recall
  • F = F-measure
  • SA = scoring average
  • IRej = rejections on incorrect responses,
  • CRej = rejections on correct responses
  • D = D-measure.
  • Resources = resources shared by participant

The scores have been calculated using the scoring script

Test data

View 2017 test data online

Task instructions

Speech-processing task

In the speech version of the CALL shared task, each item consists of

  • an identifier
  • a German text prompt
  • an audio file containing an English language response

The data was collected from an online CALL tool used to help young Swiss German students improve their English fluency.

The task is to create software that will decide whether each response is appropriate (accept) or inappropriate (reject) in the context of the prompt. This will presumably require a combination of speech recognition and text processing methods. A response is considered appropriate if it both responds to the prompt in terms of meaning and is also correct English. For example, if the prompt is

"Frag: rote Stiefel"

("Ask for: red boots"), then "I would like some red boots" or "Red boots, please" are appropriate responses. "Give me brown boots" is inappropriate because it has the wrong meaning. "I wants red boots" is inappropriate because it is incorrect English.

The task is open-ended; there are many potentially appropriate responses to each prompt.

In this version of the task, no explicit attention is paid to quality of pronunciation.

Format of training data

The speech task training release directory contains the following resources:

  1. A set of 5248 audio files
  2. A CSV file of metadata in the format (headers and example lines)
Id Prompt Wavfile Transcription language meaning
11336 Frag: rote Stiefel 11336.wav i'd like red boots correct correct
7068 Frag: Wie viel kostet es? 7068.wav how many is it incorrect incorrect
8774 Frag: Ich möchte die Rechnung 8774.wav i want the bills incorrect correct

There is one line for each audio file. The specific CSV format is UTF-8, tab separated.

The 'language' and 'meaning' columns give judgements derived from three native speakers. The 'language' column contains the word "correct" if the response has been judged fully correct in terms of both language and meaning. The 'meaning' column contains the word "correct" if the response has been judged correct in terms of meaning, but not necessarily language.

Format of test data

The speech task test release directory, which will be released on Mar 13 2017, will contain the following resources:

  1. A set of 1000 audio files
  2. A CSV file of metadata in the format (headers and example lines)
Id Prompt Wavfile
11336 Frag: rote Stiefel 11336.wav
7068 Frag: Wie viel kostet es? 7068.wav
8774 Frag: Ich möchte die Rechnung 8774.wav

i.e. like the training data but without transcriptions or judgements. The specific CSV format is UTF-8, tab separated.

Format of answers

Groups who wish to submit an entry to the shared task should submit a CSV file, produced by running their system over the test data. The format should be the same as the training data, but with an extra column called Judgement added in which the possible values are 'accept' and 'reject'. For example:

Id Prompt Wavfile Judgement
11336 Frag: rote Stiefel 11336.wav accept
7068 Frag: Wie viel kostet es? 7068.wav reject
8774 Frag: Ich möchte die Rechnung 8774.wav reject

There should be one line for each line in the test data. The specific CSV format is once more UTF-8, tab separated.

Answer spreadsheets will be submitted by email to johanna.gerlach@unige.ch and emmanuel.rayner@unige.ch.

Scoring metric

The metric used to score the results is based on three intuitions:

  • The system should reject incorrect answers as often as possible, and reject correct answers as seldom as possible.
  • The more pronounced the difference between the system's response to incorrect as opposed to correct answers, the more useful it will be.
  • Some system errors are more serious than others. In particular, it is worse for the system to accept a sentence which is incorrect in terms of meaning than it is to accept one which is correct in terms of meaning but incorrect in terms of language.

The metric is defined as follows (there is further discussion in §4.1 of the LREC 2016 paper):

Each system response falls into one of five categories:

  1. Correct Reject: the student's answer is incorrect, the system rejects.
  2. Correct Accept: the student's answer is correct, the system accepts.
  3. False Reject: the student's answer is correct, the system rejects.
  4. Plain False Accept: the student's answer is correct in meaning but incorrect English, the system accepts.
  5. Gross False Accept: the student's answer is incorrect in meaning, the system accepts.

Define CR, CA, FR, PFA, GFA to be the number of utterances in each of the above categories, and put FA = PFA + k.GFA where k is a weighting factor that makes gross false accepts relatively more important. Then we define the differential response score, D, to be the ratio of the reject rate on incorrect answers to the reject rate on correct utterances:

D = ( CR/(CR + FA) ) / ( FR/(FR + CA) ) = CR(FR + CA) / FR(CR + FA)

We will use D as the metric for evaluating the quality of systems competing in the shared task, with the weighting factor k set equal to 3.

Recognition resources

A baseline recogniser for the task, built using the popular KALDI platform, will soon be available from the downloads tab.

Grammar resources

A sample grammar, based on the one in the app used to collect the data, is provided as part of the release. The grammar is in XML format, and associates each possible prompt with

  1. a translation of the prompt into English
  2. a set of possible responses.

A typical entry looks like this:

<prompt_unit>
<prompt>Sag: Ich möchte am Montagmorgen abreisen</prompt>
<translated_prompt>Ask for: I want to leave on monday morning</translated_prompt>
<response>i need to leave on monday morning</response>
<response>i need to leave on monday morning please</response>
<response>i should like to leave on monday morning</response>
<response>i should like to leave on monday morning please</response>
<response>i want to leave on monday morning</response>
<response>i want to leave on monday morning please</response>
<response>i would like to leave on monday morning</response>
<response>i would like to leave on monday morning please</response>
<response>i'd like to leave on monday morning</response>
<response>i'd like to leave on monday morning please</response>
</prompt_unit>

Important: the sample grammar is NOT INTENDED TO BE COMPLETE. As already noted, the task is open-ended.

Text-processing task

In the text version of the CALL shared task, each item consists of

  • an identifier
  • a German text prompt
  • an audio file containing an English language response
  • a text string representing the 1-best result of performing speech recognition on the audio file

The data was collected from an online CALL tool used to help young Swiss German students improve their English fluency.

The task is to create software that will decide whether each response is appropriate (accept) or inappropriate (reject) in the context of the prompt. This will require some kind of text processing method. A response is considered appropriate if it both responds to the prompt in terms of meaning and is also correct English. For example, if the prompt is

"Frag: rote Stiefel"

("Ask for: red boots"), then "I would like some red boots" or "Red boots, please" are appropriate responses. "Give me brown boots" is inappropriate because it has the wrong meaning. "I wants red boots" is inappropriate because it is incorrect English.

The task is open-ended; there are many potentially appropriate responses to each prompt.

Format of training data

The speech task training release directory contains the following resources:

  1. A set of 5248 audio files
  2. A CSV file of metadata in the format (headers and example lines)
Id Prompt Wavfile RecResult Transcription language meaning
11336 Frag: rote Stiefel 11336.wav i'd like red boots i'd like red boots correct correct
7068 Frag: Wie viel kostet es? 7068.wav how many is it how many is it incorrect incorrect
8774 Frag: Ich möchte die Rechnung 8774.wav i want the bill i want the bills incorrect correct

There is one line for each audio file. The specific CSV format is UTF-8, tab separated.

The 'language' and 'meaning' columns give judgements derived from three native speakers. The 'language' column contains the word "correct" if the response has been judged fully correct in terms of both language and meaning. The 'meaning' column contains the word "correct" if the response has been judged correct in terms of meaning, but not necessarily language.

The fact that speech recognition is often inaccurate means that there may not always be sufficient information to make a correct decision. For example, the third utterance should be rejected, since the student has replied with a grammatically incorrect sentence, but since the recogniser has corrected the error there is no way to determine this.

Format of test data

The text task test release directory, which will be released on Mar 13 2017, will contain the following resources:

  1. A set of 1000 audio files
  2. A CSV file of metadata in the format (headers and example lines)
Id Prompt Wavfile RecResult
11336 Frag: rote Stiefel 11336.wav i'd like red boots
7068 Frag: Wie viel kostet es? 7068.wav how many is it
8774 Frag: Ich möchte die Rechnung 8774.wav i want the bill

i.e. like the training data but without transcriptions or judgements. The specific CSV format is UTF-8, tab separated.

Format of answers

Groups who wish to submit an entry to the shared task should upload a CSV file, produced by running their system over the test data. The format should be the same as the test data, but with an extra column called Judgement added in which the possible values are 'accept' and 'reject'. For example:

Id Prompt Wavfile RecResult Judgement
11336 Frag: rote Stiefel 11336.wav i'd like red boots accept
7068 Frag: Wie viel kostet es? 7068.wav how many is it reject
8774 Frag: Ich möchte die Rechnung 8774.wav i want the bill reject

There should be one line for each line in the test data. The specific CSV format is once more UTF-8, tab separated.

Answer spreadsheets will be submitted by email to johanna.gerlach@unige.ch and emmanuel.rayner@unige.ch.

Scoring metric

The metric used to score the results is based on three intuitions:

  • The system should reject incorrect answers as often as possible, and reject correct answers as seldom as possible.
  • The more pronounced the difference between the system's response to incorrect as opposed to correct answers, the more useful it will be.
  • Some system errors are more serious than others. In particular, it is worse for the system to accept a sentence which is incorrect in terms of meaning than it is to accept one which is correct in terms of meaning but incorrect in terms of language.

The metric is defined as follows (there is further discussion in §4.1 of the LREC 2016 paper):

Each system response falls into one of five categories:

  1. Correct Reject: the student's answer is incorrect, the system rejects.
  2. Correct Accept: the student's answer is correct, the system accepts.
  3. False Reject: the student's answer is correct, the system rejects.
  4. Plain False Accept: the student's answer is correct in meaning but incorrect English, the system accepts.
  5. Gross False Accept: the student's answer is incorrect in meaning, the system accepts.

Define CR, CA, FR, PFA, GFA to be the number of utterances in each of the above categories, and put FA = PFA + k.GFA where k is a weighting factor that makes gross false accepts relatively more important. Then we define the differential response score, D, to be the ratio of the reject rate on incorrect answers to the reject rate on correct utterances:

D = ( CR/(CR + FA) ) / ( FR/(FR + CA) ) = CR(FR + CA) / FR(CR + FA)

We will use D as the metric for evaluating the quality of systems competing in the shared task, with the weighting factor k set equal to 3.

Grammar resources

A sample grammar, based on the one in the app used to collect the data, is provided as part of the release. The grammar is in XML format, and associates each possible prompt with

  1. a translation of the prompt into English
  2. a set of possible responses.

A typical entry looks like this:

<prompt_unit>
<prompt>Sag: Ich möchte am Montagmorgen abreisen</prompt>
<translated_prompt>Ask for: I want to leave on monday morning</translated_prompt>
<response>i need to leave on monday morning</response>
<response>i need to leave on monday morning please</response>
<response>i should like to leave on monday morning</response>
<response>i should like to leave on monday morning please</response>
<response>i want to leave on monday morning</response>
<response>i want to leave on monday morning please</response>
<response>i would like to leave on monday morning</response>
<response>i would like to leave on monday morning please</response>
<response>i'd like to leave on monday morning</response>
<response>i'd like to leave on monday morning please</response>
</prompt_unit>

Important: the sample grammar is NOT INTENDED TO BE COMPLETE. As already noted, the task is open-ended.

Baseline system resource

A Python3 script which carries out a baseline version of the text task is provided as part of the release. The script reads the sample XML grammar and a training data spreadsheet, then scores each item in the spreadsheet by matching the prompt and recognition result against the appropriate record in the grammar. If the recognition result is listed in the grammar as a possible response for the prompt, it is accepted, otherwise it is rejected. The results are written out as a new spreadsheet.

The files used (resource grammar, input spreadsheet and output spreadsheet) are defined at the top of the script.

Note that the script does not run under Python 2.x

Further information

The ideas behind the shared task are elaborated further in the paper: Baur, Claudia, Johanna Gerlach, Emmanuel Rayner, Martin Russell, and Helmer Strik. (2016). "A Shared Task for Spoken CALL?". Proc LREC 2016, Portoroz, Slovenia.

If you have questions, please contact us at johanna.gerlach@unige.ch

Please enter a valid email address to get acces to the data download.

Your email address will never be shared with any 3rd parties and we will use it only to keep you updated on shared task resources, notifications and deadlines.

Available downloads

Annotated test data

Annotated test data

Test data with language and meaning annotations (same as training data).

Speech-processing task test data

Test data will be released on Mar 13 2017
Speech processing task test data

Test data for the speech task. See task instructions tab for more info about the data format.

Text-processing task test data

Test data will be released on Mar 13 2017
Nuance text processing task test data

Test data for the text processing task using Nuance recognition results. See task instructions tab for more info about the data format.

Kaldi text processing task test data

Test data for the text processing task using Kaldi recognition results. See task instructions tab for more info about the data format.

Speech-processing task downloads

{{l.name}} ({{l.size}})

{{l.name}} coming soon

Kaldi resources

Speech recognition resources include acoustic and language models as well as scripts to construct a minimal baseline system to produce recognition results for the speech-processing task

{{l.name}} ({{l.size}})

{{l.name}} coming soon

Text-processing task downloads

Two data sets are available for the text-processing task: one with recognition results produced by Nuance with the original CALL-SLT grammar, another with recognition results produced by a baseline Kaldi system. Either one can be used for the task.

{{l.name}} ({{l.size}})

{{l.name}} coming soon

Common downloads (resources usable for both tasks)

{{l.name}} ({{l.size}})

{{l.name}} coming soon

2 Sep 2016: Minor correction to training data metadata sheet (1 line): line 284, item with Id 6156: language is 'incorrect'

13 Feb 2017: Correction of ~1% of judgments of training data

Submitting task results

Results should be submitted by email to johanna.gerlach@unige.ch and emmanuel.rayner@unige.ch.

Please submit a csv result file in the format specified in the task instructions tab.

Groups may submit up to three entries for each task. There is only one text task: entrants may use the Nuance or Kaldi versions of the data as they prefer, or combine them. When ranking the results, only the best entry from each group will be included.

Submission deadline: March 20, 2017, 23:59 CET.

Data annotation

In detail, the data was annotated as follows.

Three native speakers of English performed the annotation independently, using a web tool; for each utterance, the tool presented the original German prompt, an English translation of the prompt, and the response. The annotator filled in both the Language and Meaning fields together. Annotators did not refer to the sample grammar when annotating. The specific instructions given to the annotators are reproduced below.

A total of 6000 utterances were annotated. We had hoped that this would yield 5000 utterances where judgements from the three annotators were unanimous, but this proved slightly optimistic: in fact, only 4873 such utterances were found. In order to make up the shortfall, one of the authors, who had also previously served as an annotator, carefully reviewed the remaining utterances to select a subset where they judged that there was a strong probability the majority judgement was correct. The adjudicator tried to be as conservative as possible and exclude any examples where they felt at all uncertain. The utterances obtained using this method were added to the 'unanimous' set.

A Shared Task for Spoken CALL?

Claudia Baur, Johanna Gerlach, Manny Rayner, Martin Russell and Helmer Strik

bibtex hide bibtex

@InProceedings{BAUR16.654,
author = {Claudia Baur and Johanna Gerlach and Manny Rayner and Martin Russell and Helmer Strik},
title = {A Shared Task for Spoken CALL?},
booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
year = {2016},
month = {may},
date = {23-28},
location = {Portorož, Slovenia},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
address = {Paris, France},
isbn = {978-2-9517408-9-1},
language = {english}
}

Interactive demo

This tab presents a toy interactive demo showing how the shared task works. On each line, you can see a German text prompt, a link to an audio file with the student's response to the prompt and a recognition result. The task is to decide whether the response is appropriate or not. The radio buttons in the "Accept/reject" column let you choose whether to accept or reject the student's response. The final columns (displayed when you press "Show score") give a transcription and gold standard judgements taken from native speakers of English. The "language" column indicates whether the response is fully correct, both in terms of meaning and in terms of being correct English. The "meaning" column only indicates whether the meaning is right, so is a weaker criterion of correctness.

There are two versions of the task:

  • For the speech version you should imagine that you are a system which includes a speech recogniser. You listen to the audio file, and decide on the basis of what you hear whether to accept or reject. For example, the prompt for the first line is "Frag : Ticket zum Trafalgar Square" ("Ask for: ticket to Trafalgar Square"), and the audio file has the response "A ticket to the Trafalgar Square". This is incorrect (the superfluous "the"), so should be rejected. The second line has the prompt "Frag : Zimmer für 7 Nächte" ("Ask for: room for 7 nights"), and the audio prompt has the response "hello". This is completely wrong, and should be rejected. The third line has the same prompt as the second, and the audio file has the response "A room for seven nights". This is completely correct, so should be accepted.
  • For the text version, you should imagine that that the speech recognition has already been done for you. You should NOT listen to the audio file, but make your decision only on the basis of what you see in the "Recognition result" column. If speech recognition was incorrect, you may not have enough information to make a good decision. For example, the second line has the prompt "Frag : Zimmer für 7 Nächte" ("Ask for: room for 7 nights"), and the recognition result is "hello". The obvious decision is to reject, which is correct. However, in the third line, the prompt is the same and the response is "an room for seven nights". This looks slightly wrong, since the first word should be "a" rather than "an". But in fact there has been a recognition error: the student has answered appropriately, and the correct decision is to accept.
Prompt
Audio
Recognition result
Accept/reject
Transcription
Language
Meaninng

{{e.prompt}}
{{e.recResult}}
{{e.transcription}}
correct
incorrect
correct
incorrect

Please complete annotations in Accept/reject column

Correct Reject: {{cr()}} (the student's answer is incorrect, the system rejects)

Correct Accept: {{ca()}} (the student's answer is correct, the system accepts)

False Reject: {{fr()}} (the student's answer is correct, the system rejects)

Plain False Accept: {{pfa()}} (the student's answer is correct in meaning but incorrect English, the system accepts)

Gross False Accept: {{gfa()}} (the student's answer is incorrect in meaning, the system accepts)


Weighted rejection rate on incorrect utterances: {{wRRIU() |textOrNumber:2}} = CR / (CR + PFA + k.GFA) = {{cr()}}/({{cr()}}+{{pfa()}}+3*{{gfa()}})

Rejection rate on correct utterances: {{rRCU() |textOrNumber:2}} = FR / (FR + CA) = {{fr()}}/({{fr()}}+{{ca()}})

Differential response score: {{score() |textOrNumber:2}} = Weighted rejection rate on incorrect utterances / Rejection rate on correct utterances

(or change selections in the Accept/reject column to see impact on the score)

Organisers

(in alphabetical order)

Claudia Baur, FTI/TIM, Université de Genève

Cathy Chua, Independent researcher

Johanna Gerlach , FTI/TIM, Université de Genève

Manny Rayner , FTI/TIM, Université de Genève

Martin Russell, Department of Electronic, Electrical and Systems Engineering, University of Birmingham

Helmer Strik , Centre for Language Studies (CLS), Radboud University Nijmegen

Xizi Wei, Department of Electronic, Electrical and Systems Engineering, University of Birmingham