File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0703_metho.xml
Size: 21,895 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0703"> <Title>Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Interlingua </SectionTitle> <Paragraph position="0"> The interlingua used in the NESPOLE! system is called Interchange Format (IF) (Levin et al., 1998; Levin et al., 2000). The IF defines a shallow semantic representation for task-oriented utterances that abstracts away from language-specific syntax and idiosyncrasies while capturing the meaning of the input. Each utterance is divided into semantic segments called semantic dialog units (SDUs), and an IF is assigned to each SDU.</Paragraph> <Paragraph position="1"> An IF representation consists of four parts: a speaker tag, a speech act, an optional sequence of concepts, and an optional set of arguments. The representation takes the following form: speaker : speech act +concept* (argument*) The speaker tag indicates the role of the speaker in the dialogue. The speech act captures the speaker's intention. The concept sequence, which may contain zero or more concepts, captures the focus of an SDU. The speech act and concept sequence are collectively referred to as the domain action (DA). The arguments use a feature-value representation to encode specific information from the utterance. Argument values can be atomic or complex. The IF specification defines all of the components and describes how they can be legally combined. Several examples of utterances with corresponding IFs are shown below.</Paragraph> <Paragraph position="2"> Thank you very much.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The Hybrid Analysis Approach </SectionTitle> <Paragraph position="0"> Our hybrid analysis approach uses a combination of grammar-based parsing and machine learning techniques to transform spoken utterances into the IF representation described above. The speaker tag is assumed to be given. Thus, the goal of the analyzer is to identify the DA and arguments.</Paragraph> <Paragraph position="1"> The hybrid analyzer operates in three stages.</Paragraph> <Paragraph position="2"> First, semantic grammars are used to parse an utterance into a sequence of arguments. Next, the utterance is segmented into SDUs. Finally, the DA is identified using automatic classifiers.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Argument Parsing </SectionTitle> <Paragraph position="0"> The first stage in analysis is parsing an utterance for arguments. During this stage, utterances are parsed with phrase-level semantic grammars using the robust SOUP parser (Gavalda, 2000).</Paragraph> <Paragraph position="1"> The SOUP parser is a stochastic, chart-based, top-down parser that is designed to provide real-time analysis of spoken language using context-free semantic grammars. One important feature provided by SOUP is word skipping. The amount of skipping allowed is configurable and a list of unskippable words can be defined. Another feature that is critical for phrase-level argument parsing is the ability to produce analyses consisting of multiple parse trees. SOUP also supports modular grammar development (Woszczyna et al., 1998).</Paragraph> <Paragraph position="2"> Subgrammars designed for different domains or purposes can be developed independently and applied in parallel during parsing. Parse tree nodes are then marked with a subgrammar label. When an input can be parsed in multiple ways, SOUP can provide a ranked list of interpretations.</Paragraph> <Paragraph position="3"> In the prototype analyzer, word skipping is only allowed between parse trees. Only the best-ranked argument parse is used for further processing.</Paragraph> <Paragraph position="4"> Four grammars are defined for argument parsing: an argument grammar, a pseudo-argument grammar, a cross-domain grammar, and a shared grammar. The argument grammar contains phrase-level rules for parsing arguments defined in the IF. Top-level argument grammar nonterminals correspond to top-level arguments in the IF.</Paragraph> <Paragraph position="5"> The pseudo-argument grammar contains top-level nonterminals that do not correspond to interlingua concepts. These rules are used for parsing common phrases that can be grouped into classes to capture more useful information for the classifiers. For example, all booked up, full, and sold out might be grouped into a class of phrases that indicate unavailability. In addition, rules in the pseudo-argument grammar can be used for contextual anchoring of ambiguous arguments. For example, the arguments [who=] and [to-whom=] have the same values. To parse these arguments properly in a sentence like &quot;Can you send me the brochure?&quot;, we use a pseudo-argument grammar rule, which refers to the arguments [who=] and [to-whom=] within the appropriate context.</Paragraph> <Paragraph position="6"> The cross-domain grammar contains rules for parsing whole DAs that are domain-independent.</Paragraph> <Paragraph position="7"> For example, this grammar contains rules for greetings (Hello, Good bye, Nice to meet you, etc.). Cross-domain grammar rules do not cover all possible domain-independent DAs. Instead, the rules focus on DAs with simple or no argument lists. Domain-independent DAs with complex argument lists are left to the classifiers. Cross-domain rules play an important role in the prediction of SDU boundaries.</Paragraph> <Paragraph position="8"> Finally, the shared grammar contains common grammar rules that can be used by all other subgrammars. These include definitions for most of the arguments, since many can also appear as sub-arguments. RHSs in the argument grammar contain mostly references to rules in the shared grammar. This method eliminates redundant rules in the argument and shared grammars and allows for more accurate grammar maintenance.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Segmentation </SectionTitle> <Paragraph position="0"> The second stage of processing in the hybrid analysis approach is segmentation of the input into SDUs. The IF representation assigns DAs at the SDU level. However, since dialogue utterances often consist of multiple SDUs, utterances must be segmented into SDUs before DAs can be assigned.</Paragraph> <Paragraph position="1"> Figure 1 shows an example utterance containing four arguments segmented into two SDUs.</Paragraph> <Paragraph position="3"> hello i would like to take a vacation in val di fiemme The argument parse may contain trees for cross-domain DAs, which by definition cover a complete SDU. Thus, there must be an SDU boundary on both sides of a cross-domain tree. Additionally, no SDU boundaries are allowed within parse trees.</Paragraph> <Paragraph position="4"> The prototype analyzer drops words skipped between parse trees, leaving only a sequence of trees. The parse trees on each side of a potential boundary are examined, and if either tree was constructed by the cross-domain grammar, an SDU boundary is inserted. Otherwise, a simple statistical model similar to the one described by Lavie et al. (1997) estimates the likelihood of a boundary.</Paragraph> <Paragraph position="5"> The statistical model is based only on the root labels of the parse trees immediately preceding and following the potential boundary position. Suppose the position under consideration looks like are computed from the training data. An evaluation of this baseline model is presented in section 6.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 DA Classification </SectionTitle> <Paragraph position="0"> The third stage of analysis is the identification of the DA for each SDU using automatic classifiers.</Paragraph> <Paragraph position="1"> After segmentation, a cross-domain parse tree may cover an SDU. In this case, analysis is complete since the parse tree contains the DA. Otherwise, automatic classifiers are used to assign the DA. In the prototype analyzer, the DA classification task is split into separate subtasks of classifying the speech act and concept sequence. This reduces the complexity of each subtask and allows for the application of specialized techniques to identify each component.</Paragraph> <Paragraph position="2"> One classifier is used to identify the speech act, and a second classifier identifies the concept sequence. Both classifiers are implemented using TiMBL (Daelemans et al., 2000), a memory-based learner. Speech act classification is performed first. Input to the speech act classifier is a set of binary features that indicate whether each of the possible argument and pseudo-argument labels is present in the argument parse for the SDU. No other features are currently used. Concept sequence classification is performed after speech act classification. The concept sequence classifier uses the same feature set as the speech act classifier with one additional feature: the speech act assigned by the speech act classifier. We present an evaluation of this baseline DA classification scheme in section 6.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Using the IF Specification </SectionTitle> <Paragraph position="0"> The IF specification imposes constraints on how elements of the IF representation can legally combine. DA classification can be augmented with knowledge of constraints from the IF specification, providing two advantages over otherwise naive classification. First, the analyzer must produce valid IF representations in order to be useful in a translation system. Second, using knowledge from the IF specification can improve the quality of the IF produced, and thus the translation.</Paragraph> <Paragraph position="1"> Two elements of the IF specification are especially relevant to DA classification. First, the specification defines constraints on the composition of DAs. There are constraints on how concepts are allowed to pair with speech acts as well as ordering constraints on how concepts are allowed to combine to form a valid concept sequence. These constraints can be used to eliminate illegal DAs during classification. The second important element of the IF specification is the definition of how arguments are licensed by speech acts and concepts. In order for an IF to be valid, at least one speech act or concept in the DA must license each argument.</Paragraph> <Paragraph position="2"> The prototype analyzer uses the IF specification to aid classification and guarantee that a valid IF representation is produced. The speech act and concept sequence classifiers each provide a ranked list of possible classifications. When the best speech act and concept sequence combine to form an illegal DA or form a legal DA that does not license all of the arguments, the analyzer attempts to find the next best legal DA that licenses the most arguments. Each of the alternative concept sequences (in ranked order) is combined with each of the alternative speech acts (in ranked order). For each possible legal DA, the analyzer checks if all of the arguments found during parsing are licensed.</Paragraph> <Paragraph position="3"> If a legal DA is found that licenses all of the arguments, then the process stops. If not, one additional fallback strategy is used. The analyzer then tries to combine the best classified speech act with each of the concept sequences that occurred in the training data, sorted by their frequency of occurrence. Again, the analyzer checks if each legal DA licenses all of the arguments and stops if such a DA is found. If this step fails to produce a legal DA that licenses all of the arguments, the best-ranked DA that licenses the most arguments is returned. In this case, any arguments that are not licensed by the selected DA are removed. This approach is used because it is generally better to select an alternative DA and retain more arguments than to keep the best DA and lose the information represented by the arguments. An evaluation of this strategy is presented in the section 6.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Grammar Development and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Classifier Training </SectionTitle> <Paragraph position="0"> During grammar development, it is generally useful to see how changes to the grammar affect the IF representations produced by the analyzer. In a purely grammar-based analysis approach, full interlingua representations are produced as the result of parsing, so testing new grammars simply requires loading them into the parser. Because the grammars used in our hybrid approach parse at the argument level, testing grammar modifications at the complete IF level requires retraining the segmentation model and the DA classifiers.</Paragraph> <Paragraph position="1"> When new grammars are ready for testing, utterance-IF pairs for the appropriate language are extracted from the training database. Each utterance-IF pair in the training data consists of a single SDU with a manually annotated IF. Using the new grammars, the argument parser is applied to each utterance to produce an argument parse.</Paragraph> <Paragraph position="2"> The counts used by the segmentation model are then recomputed based on the new argument parses. Since each utterance contains a single SDU, the counts C([*A computed directly from the first and last arguments in the parse respectively.</Paragraph> <Paragraph position="3"> Next, the training examples for the DA classifiers are constructed. Each training example for the speech act classifier consists of the speech act from the annotated IF and a vector of binary features with a positive value set for each argument or pseudo-argument label that occurs in the argument parse. The training examples for the concept sequence classifiers are similar with the addition of the annotated speech act to the feature vector. After the training examples are constructed, new classifiers are trained.</Paragraph> <Paragraph position="4"> Two tools are available to support easy testing during grammar development. First, the entire training process can be run using a single script. Retraining for a new grammar simply requires running the script with pointers to the new grammars. Then, a special development mode of the translation servers allows the grammar writers to load development grammars and their corresponding segmentation model and DA classifiers. The translation server supports input in the form of individual utterances or files and allows the grammar developers to look at the results of each stage of the analysis process.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> We present the results from recent experiments to measure the performance of the analyzer components and of end-to-end translation using the analyzer. We also report the results of an ablation experiment that used earlier versions of the analyzer and IF specification.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Translation Experiment </SectionTitle> <Paragraph position="0"> Tables 1 and 2 show end-to-end translation results of the NESPOLE! system. In this experiment, the input was a set of English utterances. The utterances were paraphrased back into English via the interlingua (Table 1) and translated into Italian (Table 2). The data used to train the DA classifiers consisted of 3350 SDUs annotated with IF representations. The test set contained 151 utterances consisting of 332 SDUs from 4 unseen dialogues. Translations were compared to human transcriptions and graded as described in (Levin et al., 2000). A grade of perfect, ok, or bad was assigned to each translation by human graders. A grade of perfect or ok is considered acceptable. The table shows the average of grades assigned by three graders.</Paragraph> <Paragraph position="1"> The row in Table 1 labeled SR Hypotheses shows the grades when the speech recognizer output is compared directly to human transcripts.</Paragraph> <Paragraph position="2"> As these grades show, recognition errors can be a major source of unacceptable translations. These grades provide a rough bound on the translation performance that can be expected when using input from the speech recognizer since meaning lost due to recognition errors cannot be recovered. The rows labeled Translation from Transcribed Text show the results when human transcripts are used as input. These grades reflect the combined performance of the analyzer and generator. The rows labeled Translation from SR Hypotheses show the results when the speech recognizer produces the input utterances. As expected, translation performance was worse with the introduction of recognition errors.</Paragraph> <Paragraph position="3"> Table 3 shows the performance of the segmentation model on the test set. The SDU boundary positions assigned automatically were compared with manually annotated positions.</Paragraph> <Paragraph position="4"> classifiers, and Table 5 shows the frequency of the most common DA, speech act, and concept sequence in the test set. Transcribed utterances were used as input and were segmented into SDUs before analysis. This experiment is based on only 293 SDUs. For the remaining SDUs in the test set, it was not possible to assign a valid representation based on the current IF specification.</Paragraph> <Paragraph position="5"> These results demonstrate that it is not always necessary to find the canonical DA to produce an acceptable translation. This can be seen by comparing the Domain Action accuracy from Table 4 with the Transcribed grades from Table 1.</Paragraph> <Paragraph position="6"> Although the DA classifiers produced the canonical DA only 43% of the time, 58% of the translations were graded as acceptable.</Paragraph> <Paragraph position="7"> In order to examine the effects of using IF specification constraints, we looked at the 182 SDUs which were not parsed by the cross-domain grammar and thus required DA classification.</Paragraph> <Paragraph position="8"> Table 6 shows how many DAs, speech acts, and concept sequences were changed as a result of using the constraints. DAs were changed either because the DA was illegal or because the DA did not license some of the arguments. Without the IF specification, 4% of the SDUs would have been assigned an illegal DA, and 29% of the SDUs (those with a changed DA) would have been assigned an illegal IF. Furthermore, without the IF specification, 0.38 arguments per SDU would have to be dropped while only 0.07 arguments per SDU were dropped when using the fallback strategy.</Paragraph> <Paragraph position="9"> The mean number of arguments per SDU was 1.47.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Ablation Experiment </SectionTitle> <Paragraph position="0"> experiment that examined the effect of varying the training set size on DA classification accuracy.</Paragraph> <Paragraph position="1"> Each point represents the average accuracy using a 16-fold cross validation setup.</Paragraph> <Paragraph position="2"> The training data contained 6409 SDUinterlingua pairs. The data were randomly divided into 16 test sets containing 400 examples each. In each fold, the remaining data were used to create training sets containing 500, 1000, 2000, 3000, 4000, 5000, and 6009 examples.</Paragraph> <Paragraph position="3"> The performance of the classifiers appears to begin leveling off around 4000 training examples. These results seem promising with regard to the portability of the DA classifiers since a data set of this size could be constructed in a few weeks.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Related Work </SectionTitle> <Paragraph position="0"> Lavie et al. (1997) developed a method for identifying SDU boundaries in a speech-to-speech translation system. Identifying SDU boundaries is also similar to sentence boundary detection.</Paragraph> <Paragraph position="1"> Stevenson and Gaizauskas (2000) use TiMBL (Daelemans et al., 2000) to identify sentence boundaries in speech recognizer output, and Gotoh and Renals (2000) use a statistical approach to identify sentence boundaries in automatic speech recognition transcripts of broadcast speech.</Paragraph> <Paragraph position="2"> Munk (1999) attempted to combine grammars and machine learning for DA classification. In Munk's SALT system, a two-layer HMM was used to segment and label arguments and speech acts. A neural network identified the concept sequences.</Paragraph> <Paragraph position="3"> Finally, semantic grammars were used to parse each argument segment. One problem with SALT was that the segmentation was often inaccurate and resulted in bad parses. Also, SALT did not use a cross-domain grammar or interlingua specification.</Paragraph> <Paragraph position="4"> Cattoni et al. (2001) apply statistical language models to DA classification. A word bigram model is trained for each DA in the training data. To label an utterance, the most likely DA is assigned.</Paragraph> <Paragraph position="5"> Arguments are identified using recursive transition networks. IF specification constraints are used to find the most likely valid DA and arguments.</Paragraph> </Section> class="xml-element"></Paper>