XML Viewer - w04-2305

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2305_metho.xml
Size: 8,448 bytes
Last Modified: 2025-10-06 14:09:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2305">
  <Title>Combining Acoustic Confidences and Pragmatic Plausibility for Classifying Spoken Chess Move Instructions</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Baseline Systems
</SectionTitle>
    <Paragraph position="0"> The general aim of our experiments is to decide whether a recognised move instruction is the one intended by the speaker. A system should accept correct recognition hypotheses and reject incorrect ones. We define the following two baseline systems for this binary decision task.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 First Hypothesis Baseline
</SectionTitle>
      <Paragraph position="0"> The first hypothesis baseline uses a confidence rejection threshold to decide whether the best recognition hypothesis should be accepted or rejected. To find an optimal value, we linearly vary the confidence threshold returned by the Nuance 8.0 recogniser (integral values in the range a0 a1a3a2a5a4a6a1a7a1a9a8 ) and use it to classify the training and development data.</Paragraph>
      <Paragraph position="1"> The best performing confidence threshold on the combined training and development data was 17 with an accuracy of 63.8%. This low confidence threshold turned out to be equal to the majority class baseline which is to classify all hypotheses as correctly recognised. In order to get a more balanced distribution of classification errors, we also optimised the confidence threshold according to the cost measure defined in Section 5. According to this measure, the optimal confidence rejection threshold is 45 with a classification accuracy of 60.5%.3</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 First Legal Move Baseline
</SectionTitle>
      <Paragraph position="0"> The first legal move baseline makes use of the constraint that user utterances only contain moves that are legal in the current board configuration. We thus first eliminate all hypotheses that denote illegal moves from the 10-best output and then apply a confidence rejection threshold to decide whether the best legal hypothesis should be accepted or rejected.</Paragraph>
      <Paragraph position="1"> The best performing confidence threshold on the combined training and test data for the first legal move base-line was 23 with an accuracy of 92.4%. This threshold also optimised the cost measure defined in Section 5. The performance of both baseline systems on the test data is reported below in Table 2 together with the results for the machine learning experiments.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 ML Experiments
</SectionTitle>
    <Paragraph position="0"> We devise two different machine learning experiments for selecting hypotheses from the recogniser's n-best output and from a list of all legal moves given a certain board configuration.</Paragraph>
    <Paragraph position="1"> In Experiment 1, we first filter out all illegal moves from the n-best recognition results and represent the remaining legal moves in terms of 32 dimensional feature vectors including acoustic confidence scores from 345 is also the default confidence rejection threshold of the Nuance 8.0 speech recogniser.</Paragraph>
    <Paragraph position="2"> the recogniser as well as move evaluation scores from a computer chess program. We then use machine learners to decide for each move hypothesis whether it was the one intended by the speaker. If more than one hypothesis is classified as correct, we pick the one with the highest acoustic confidence. If there is no legal move among the recognition hypotheses or all hypotheses are classified as incorrect, the input is rejected.</Paragraph>
    <Paragraph position="3"> Experiment 2 adds a second classification step to Experiment 1. In case an utterance is rejected in Experiment 1, we try to find the intended move among all legal moves in the current situation. This is again defined in terms of a classification problem. All legal moves are represented by 31 dimensional feature vectors that include &amp;quot;similarity features&amp;quot; with respect to the interpretation of the best recognition hypothesis and move evaluation scores. Each move is then classified as either correct or incorrect. We pick a move if it is the only one that is classified as correct and all others as incorrect; otherwise the input is rejected. The average number of legal moves in the development and training games was 35.3 with a maximum of 61.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Feature Sets
</SectionTitle>
      <Paragraph position="0"> The feature set for the classification of legal move hypotheses in the recogniser's n-best list (Experiment 1) consists of 32 features that can be coarsely grouped into six categories (see below). All features were automatically extracted or computed from the output of the speech recogniser, move evaluation scores, and game logs.</Paragraph>
      <Paragraph position="1">  1. Recognition statistics (3): position in n-best list; relative position among and total number of legal moves in n-best list 2. Acoustic confidences (6): overall acoustic confidence; min, max, mean, variance, standard deviation of individual word confidences 3. Text (1): hypothesis length (in words) 4. Depth1 plausibility (10): raw &amp; normalised move evaluation score wrt. scores for all legal moves; score rank; raw score difference to max score; min, max, mean of raw scores; raw z-score; move evaluation rank &amp; z-score among n-best legal moves 5. Depth10 plausibility (10): same features as for depth1 plausibility (at search depth 10) 6. Game (2): ELO (strength) of player; ply number  The feature set for the classification of all legal moves in Experiment 2 is summarised below. Each move is represented in terms of 31 (automatically derived) features which can again be grouped into 6 different categories.  1. Similarity (5): difference size; difference bags; overlap size; overlap bag 2. Acoustic confidences (6): same as in Experiment 1 for best recognition hypothesis 3. Text (2): length of best recognition hypothesis (in words) and recognised string (bag of words) 4. Depth1 plausibility (8): same as in Experiment 1 (w/o features relating to n-best legal moves) 5. Depth10 plausibility (8): same as in Experiment 1 (w/o features relating to n-best legal moves) 6. Game (2): same as in Experiment 1  The similarity features are meant to represent how close a move is to the interpretation of the best recognition result. The motivation for these features is that the machine learner might find regularities about what likely confusions arise in the data. For example, the letters &amp;quot;b&amp;quot;, &amp;quot;c&amp;quot;, &amp;quot;d&amp;quot;, &amp;quot;e&amp;quot; and &amp;quot;g&amp;quot; are phonemically similar in German (as are the letters &amp;quot;a&amp;quot; and &amp;quot;h&amp;quot; and the two digits &amp;quot;zwei&amp;quot; and &amp;quot;drei&amp;quot;). Although the move representations are abstractions from the actual verbalisations, the language model data showed that most of the subjects referred to coordinates with single letters and digits and therefore there is some correspondence between the abstract representations and what was actually said.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Learners
</SectionTitle>
      <Paragraph position="0"> We considered three different machine learners for the two classification tasks: the memory-based learner TiMBL (Daelemans et al., 2002), the rule induction learner RIPPER (Cohen, 1995), and an implementation of Support Vector Machines, SVMa0 a1a3a2a5a4a7a6 (Joachims, 1999). We trained all learners with various parameter settings on the training data and tested them on the development data. The best results for the first task (selecting legal moves from n-best lists) were achieved with SVMa0 a1a3a2a5a4a7a6 whereas RIPPER outperformed the other two learners on the second task (selecting from all possible legal moves).</Paragraph>
      <Paragraph position="1"> SVMa0 a1a8a2a5a4a9a6 and RIPPER where therefore chosen to classify the test data in the actual experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML