File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1042_metho.xml

Size: 23,782 bytes

Last Modified: 2025-10-06 14:13:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1042">
  <Title>A SPEECH TO SPEECH TRANSLATION SYSTEM BUILT FROM STANDARD COMPONENTS</Title>
  <Section position="4" start_page="0" end_page="218" type="metho">
    <SectionTitle>
2. COMPONENTS AND
INTERFACES
</SectionTitle>
    <Paragraph position="0"> The speech translation process begins with SRI's DE-CIPHER(TM) system, based on hidden Markov modeling and a progressive search \[12, 13\]. It outputs to the source language processor a small lattice of word hypotheses generated using acoustic and language model scores. The language processor, for both English and Swedish, is the SRI Core Language Engine (CLE) \[1\], a unification-based, broad coverage natural language system for analysis and generation. Transfer occurs at the level of quasi logical form (QLF); transfer rules are defined in a simple declarative formalism \[2\]. Speech synthesis is performed by the Swedish Telecom PROPHON  system \[8\], based on stored polyphones. This section describes in more detail these components and their interfaces. null</Paragraph>
    <Section position="1" start_page="217" end_page="217" type="sub_section">
      <SectionTitle>
2.1. Speech Recognition
</SectionTitle>
      <Paragraph position="0"> The first component is a fast version of SRI's DE-CIPHER(TM) speaker-independent continuous speech recognition system \[12\]. It uses context-dependent phonetic-based hidden Markov models with discrete observation distributions for 4 features: cepstrum, deltacepstrum, energy and delta-energy. The models are gender-independent and the system is trained on 19,000 sentences and has a 1381-word vocabulary. The progressive recognition search \[13\] is a three-pass scheme that produces a word lattice and an N-best list for use by the language analysis component. Two recognition passes are used to create a word lattice. During the forward pass, the probabilities of all words that can end at each frame are recorded, and this information is used to prune the word lattice generated in the backward pass. The word lattice is then used as a grammar to constrain the search space of a third recognition pass, which produces an N-best list using an exact algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="217" end_page="217" type="sub_section">
      <SectionTitle>
2.2. Language Analysis and Generation
</SectionTitle>
      <Paragraph position="0"> Language analysis and generation are performed by the SRI Core Language Engine (CLE), a general natural-language processing system developed at SRI Cambridge \[1\]; two copies of the CLE are used, equipped with English and Swedish grammars respectively. The English grammar is a large, domain-independent unification-based phrase-structure grammar, augmented by a small number of domain-specific rules (Section 3.1). The Swedish grammar is a fairly direct adaptation of the English one (Section 3.3).</Paragraph>
      <Paragraph position="1"> The system's linguistic information is in declarative form, compiled in different ways for the two tasks. In analysis mode, the grammar is compiled into tables that drive a left-corner parser; input is supplied in the form of a word hypothesis lattice, and output is a set of possible semantic analyses expressed in Quasi Logical Form (QLF). QLF includes predicate-argument structure and some surface features, but also allows a semantic analysis to be only partially specified \[3\].</Paragraph>
      <Paragraph position="2"> The set of QLF analyses is then ranked in order of a priori plausibility using a set of heuristic preferences, which are partially trainable from example corpus data (Section 3.2). In generation mode, the linguistic information is compiled into another set of tables, which control a version of the Semantic Head-Driven Generation algorithm \[16\]. Here, the input is a QLF form, and the output is the set of possible surface strings which realize the form. Early forms of the analysis and generation algorithms used are described in \[1\].</Paragraph>
    </Section>
    <Section position="3" start_page="217" end_page="218" type="sub_section">
      <SectionTitle>
2.3. Speech/Language Interface
</SectionTitle>
      <Paragraph position="0"> The interface between speech recognition and source language analysis can be either a 1-best or an N-best interface. In 1-best mode, the recognizer simply passes the CLE a string representing the single best hypothesis. In N-best mode, the string is replaced by a list containing all hypotheses that are active at the end of the third recognition pass. Since the word lattice generated during the first two recognition passes significantly constrains the search space of the third pass, we can have a large number of hypotheses without a significant increase in computation.</Paragraph>
      <Paragraph position="1"> As the CLE is capable of using lattice input directly \[6\], the N-best hypotheses are combined into a new lattice before being passed to linguistic processing; in cases where divergences occur near the end of the utterance, this yields a substantial speed improvement. The different analyses produced are scored using a weighted sum of the acoustic score received from DECIPHER and the linguistic preference score produced by the CLE. When at least one linguistically valid analysis exists, this implicitly results in a selection of one of the N-best hypotheses. Our experimental findings to date indicate that N=5 gives a good tradeoff between speed and accuracy, performance surprisingly being fairly insensitive to the setting of the relative weights given to acoustic and linguistic scoring information. Some performance results are presented in Section 5.</Paragraph>
      <Paragraph position="2"> 2.4. Transfer Unification-based QLF transfer \[2\], compositionally translates a QLF of the source language to a QLF of the target language. QLF is the transfer level of choice in the system, since it is a contextually unresolved semantic representation reflecting both predicate-argument relations and linguistic features such as tense, aspect, and modality. The translation process uses declarative transfer rules containing cross-linguistic data, i.e., it specifies only the differences between the two languages. The monolingual knowledge of grammars, lexica, and preferences is used for ranking alternative target QLFs, filtering out ungrammatical QLFs, and finally generating the source language utterance.</Paragraph>
      <Paragraph position="3"> A transfer rule specifies a pair of QLF patterns; the left hand side matches a fragment of the source language QLF and the right hand side the corresponding target QLF. Table 1 breaks down transfer rules by type. As can been seen, over 90% map atomic constants to atomic constants; of the remainder, about half relate to spe- null cific lexical items, and half are general structural transfer rules. For example, the following rule expresses a mapping of English NPs postnominally modified by a progressive VP (aFiights going to Boston&amp;quot;) to Swedish NPs modified by a relative clause ( &amp;quot;Flygningar som gdr till Boston&amp;quot;):</Paragraph>
      <Paragraph position="5"> land, tr (head), \[island, form(verb(tense=pres ,perf=P, prog=n), tr (mod))2 \] Transfer variables, of the form tr(atom), show how subexpressions in the source QLF correspond to subexpressions in the target QLF. Note how the transition from a tenseless, progressive VP to a present tense, nonprogressive VP can be specified directly through changing the values of the slots of the &amp;quot;verb&amp;quot; term. This fairly simple transfer rule formalism seems to allow most important restructuring phenomena (e.g., change of aspect, object raising, argument switching, and to some extent also head switching) to be specified succinctly. The degree of compositionality in the rule set currently employed is high; normally no special transfer rules are needed to specify combinations of complex transfer. In addition, the vast majority of the rules are reversible, providing for future Swedish to English translation.</Paragraph>
    </Section>
    <Section position="4" start_page="218" end_page="218" type="sub_section">
      <SectionTitle>
2.5. Speech Synthesis
</SectionTitle>
      <Paragraph position="0"> The Prophon speech synthesis system, developed at Swedish Telecom, is an interactive environment for developing applications and conducting research in multi-lingual text-to-speech conversion. The system includes a large lexicon, a speech synthesizer and rule modules for text formatting, syntactic analysis, phonetic transcription, parameter generation and prosody. Two synthesis strategies are included in the system, formant synthesis and polyphone synthesis, i.e., concatenation of speech units of arbitrary size. In the latter case, the synthesizer accesses the database of polyphone speech waveforms according to the allophonic specification derived from the lexicon and/or phonetic transcription rules. The polyphones are concatenated and the prosody of the utteranee is imposed via the PSOLA (pitch synchronous overlap add) signal processing technique \[11\]. The Prophon system has access to information other than the text string, in particular the parse tree, which can be used to provide a better, more natural prosodic structure than normally is possible.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="218" end_page="220" type="metho">
    <SectionTitle>
3. ADAPTATION
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the methods used for adapting the various processing components to the English-Swedish ATIS translation task. Section 3.1 describes the domain customization of the language component, and section 3.2 the semi-automatic method developed to customize the linguistic preference filter. Finally, section 3.3 summarizes the work carried out in adapting the English-language grammar and lexicon to Swedish.</Paragraph>
    <Section position="1" start_page="218" end_page="219" type="sub_section">
      <SectionTitle>
3.1. CLE Domain Adaptation
</SectionTitle>
      <Paragraph position="0"> We begin by describing the customizations performed to adapt the general CLE English grammar and lexicon to the ATIS domain. First, about 500 lexical entries needed to be added. Of these, about 450 were regular content words ( airfare, Boston, seven forty seven, etc.), all of which were added by a graduate student 3 using the interactive VEX lexicon acquisition tool \[7\]. About 55 other entries, not of a regular form, were also added.</Paragraph>
      <Paragraph position="1"> Of these, 26 corresponded to the letters of the alphabet, which were treated as a new syntactic class, 15 or so were interjections (Sure, OK, etc.), and seven were entries for the days of the week, which turned out to have slightly different syntactic properties in American and British English. The only genuinely new entries were for available, round trip, first class, nonstop and one way, all of which failed to fit syntactic patterns previously implemented within the grammar, (e.g. &amp;quot;Flights available from United&amp;quot;, &amp;quot;Flights to Boston first class&amp;quot;).</Paragraph>
      <Paragraph position="2"> Sixteen domain-specific phrase-structure rules were also added, most of them by the graduate student. Of these, six covered 'code' expressions (e.g. &amp;quot;Q X&amp;quot;), and eight covered 'double utterances' (e.g. &amp;quot;Flights to Boston show me the fares&amp;quot;). The remaining two rules covered ordinal expressions without determiners (&amp;quot;Next flight to Boston&amp;quot;), and PP expressions of the form 'Name to Name' (e.g. &amp;quot;Atlanta to Boston Friday&amp;quot;). Finally, the preference metrics were augmented by a preference for attaching 'from-to' PP pairs to the same constituent, (this is a domain-independent heuristic, but is particularly important in the context of the ATIS task), and the semantic collocation preference metrics (Section 3.2) 3Marie-Susanne AgnKs, the graduate student in question, was a competent linguist but had no previous experience with the CLE or other large computational grammars.</Paragraph>
      <Paragraph position="3">  were retrained with ATIS data. The grammar and lexicon customization effort has so far consumed about three person-months of specialist time, and about two and a half person-months of the graduate student. The current level of coverage is indicated in Section 5.</Paragraph>
    </Section>
    <Section position="2" start_page="219" end_page="219" type="sub_section">
      <SectionTitle>
3.2. Training Preference Heuristics
</SectionTitle>
      <Paragraph position="0"> Grammars with thorough coverage of a non-trivial sub-language tend to yield large numbers of analyses for many sentences, and rules for accurately selecting the correct analysis are difficult if not impossible to state explicitly. We therefore use a set of about twenty preference metrics to rank QLFs in order of a priori plausibility. Some metrics count occurrences of phenomena such as adjuncts, ellipsis, particular attachment configurations, or balanced conjunctions. Others, which are trained automatically, reflect the strengths of semantic collocations between triples of logical constants occurring in relevant configurations in QLFs.</Paragraph>
      <Paragraph position="1"> The overall plausibility score for a QLF under this scheme is a weighted (scaled) sum of the scores returned by the individual metrics. Initially, we chose scaling factors by hand, but this became an increasingly skilled and difficult task as more metrics were added, and it was clear that the choice would have to be repeated for other domains. The following semi-automatic optimization procedure \[4\] was therefore developed.</Paragraph>
      <Paragraph position="2"> QLFs were derived for about 4600 context-independent and context-dependent ATIS sentences of 1 to 15 words.</Paragraph>
      <Paragraph position="3"> It is easy to derive from a QLF the set of segments of the input sentence which it analyses as being either predications or arguments. These segments, taken together, effectively define a tree of roughly the form used by the Treebank project \[5\]. A user presented with all strings derived/.from any QLF for a sentence selected the correct tree (if present). A skilled judge was then able to assign trees to hundreds of sentences per hour.</Paragraph>
      <Paragraph position="4"> The &amp;quot;goodness&amp;quot; of a QLF Q with respect to an approved tree T was defined as I(Q,T) - 10. A(Q,T), where I(Q, T) is the number of string segments induced by Q and present in T, and A(Q, T) is the number induced by Q but absent from T. This choice of goodness function was found, by trial and error, to lead to a good correlation with the metrics. Optimization then consisted of minimizing, with respect to scaling factors ej for each preference metric mi, the value of ~(g, - E~ ei*~J) 2 where gl is the goodness of QLF i and sit is the score assigned to QLF i by metric fj ; to remove some &amp;quot;noise&amp;quot; from the data, all values were relativized by subtracting the (average of the) corresponding scores for the best-scoring QLF(s) for the sentence.</Paragraph>
      <Paragraph position="5"> The kth simultaneous equation, derived by setting the derivative of the above expression with respect to ck to zero for the minimum, is ~, s~(gi - Z~ cj,i~) = 0 These equations can be solved by Gaussian elimination.</Paragraph>
      <Paragraph position="6"> The optimized and hand-selected scaling factors each resuited in a correct QLF being selected for about 75% of the 157 sentences from an unseen test set that were within coverage, showing that automatic scaling can produce results as good as those derived by labourand skill-intensive hand-tuning. The value of Kendall's ranking correlation coefficient between the relativized &amp;quot;goodness&amp;quot; values and the scaled sum (reflecting the degree of agreement between the orderings induced by the two criteria) was also almost identical for the two sets of factors. However, the optimized factors achieved much better correlation (0.80 versus 0.58) under the more usual product-moment definition of correlation, o',v/o'xo'v, which the least-squares optimization used here is defined to maximize. This suggests that optimization with respect to a (non-linear) criterion that refleets ranking rather than linear agreement could lead to a still better set of scaling factors that might out-perform both the hand-selected and the least-squaresoptimal ones. A hill-climbing algorithm to determine such factors is therefore being developed.</Paragraph>
      <Paragraph position="7"> The training process allows optimization of scaling factors, and also provides data for several metrics assessing semantic collocations. In our case, we use semantic collocations extracted from QLF expressions in the form of (H1, R, H2) triples where H1 and H2 are the head predicates of phrases in a sentence and R indicates the semantic relationship (e.g. a preposition or an argument position) between the two phrases in the proposed analysis. We have found that a simple metric, original to us, that scores triples according to the average treebank score of QLFs in which they occur, performs about as well as a chi-squared metric, and better than one based on mutual information (of \[9\]).</Paragraph>
    </Section>
    <Section position="3" start_page="219" end_page="220" type="sub_section">
      <SectionTitle>
3.3. CLE Language Adaptation
</SectionTitle>
      <Paragraph position="0"> The Swedish-language customization of the CLE (S-CLE) has been developed at SICS from the English-language version by replacing English-specific modules with corresponding Swedish-language versions. 4  gests that adapting the English system to close languages is fairly easy and straight-forward. The total effort spent on the Swedish adaptation was about 14 person-months (compared with about 20 person-years for the original CLE), resulting in coverage only slightly less than that of the English version.</Paragraph>
      <Paragraph position="1"> The amount of work needed to adapt the various CLE modules to Swedish declined steadily as a function of their &amp;quot;distance&amp;quot; from surface structure. Thus the morphology rules had to be nearly completely rewritten; Swedish morphology is considerably more complex than English. In contrast, only 33 of the 401 Swedish function word entries were not derived from English counterparts, the differences being confined to variance in surface form and regular changes to the values of a small number of features. At the level of syntax, 97 (81%) of a set of 120 Swedish syntax rules were derived from exact or very similar English rules. The most common difference is some small change in the features; for example, Swedish marks for definiteness, which means that this feature often needs to be added. 11 rules (9%) originated in English rules, but had undergone major changes, e.g., some permutation or deletion of the daughters; thus Swedish time rules demand a word-order which in English would be &amp;quot;o'clock five&amp;quot;, and there is a rule that makes an NP out of a bare definite NBAR. This last rule corresponds to the English NP ~ DET NBAR rule, with the DET deleted but the other features instantiated as if it were present. Only 12 (10%) Swedish syntax rules were completely new. The percentage of changed semantic rules was even smaller.</Paragraph>
      <Paragraph position="2"> The most immediately apparent surface divergences between Swedish and English word-order stem from the strongly verb-second nature of Swedish. Formation of both YN- and WH-questions is by simple inversion of the subject and verb without the introduction of an auxiliary, thus for example &amp;quot;Did he fly with Delta?&amp;quot; is &amp;quot;FlSg han rned Delta?&amp;quot;, lit. &amp;quot;Flew he with Delta?&amp;quot;. It is worth noting that these changes can all be captured by doing no more than adjusting features. The main rules that had to be written &amp;quot;from scratch&amp;quot; are those that cover adverbials, negation, conditionals, and the common vad ...fJr construction, e.g., &amp;quot;Vad finns det fJr flygningar till Atlanta&amp;quot; (lit. &amp;quot;What are there for flights to Atlanta&amp;quot;, i.e., &amp;quot;What flights are there to Atlanta?&amp;quot;).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="220" end_page="221" type="metho">
    <SectionTitle>
4. RATIONAL DEVELOPMENT
METHODOLOGY
</SectionTitle>
    <Paragraph position="0"> In a project like this one, where software development is taking place simultaneously at several sites, regular testing is important to ensure that changes retain inter-component compatibility. Our approach is to maintain a set of test corpora to be run through the system (from text analysis to text generation) whenever a significant change is made to the code or data. Changes in the status of a sentence - the translation it receives, or the stage at which it fails if it receives no translation - are notified to developers, which facilitates bug detection and documentation of progress.</Paragraph>
    <Paragraph position="1"> The most difficult part of the exercise is the construction of the test corpora. The original training/development corpus is a 4600-sentence subset of the ATIS corpus consisting of sentences of length not more than 15 words.</Paragraph>
    <Paragraph position="2"> For routine system testing, this corpus is too large to be convenient; if a randomly chosen subset is used instead, it is often difficult to tell whether processing failures are important or not, in the sense of representing problems that occur in a large number of corpus sentences. What is needed is a sub-corpus that contains all the commonly occurring types of construction, together with an indication of how many sentences each example in the sub-corpus represents.</Paragraph>
    <Paragraph position="3"> We have developed a systematic method for constructing representative sub-corpora, using &amp;quot;Explanation Based Learning&amp;quot; (EBL) \[15\]. The original corpus is parsed, and the resulting analysis trees are grouped into equivalence classes; then one member is chosen from each class, and stored with the number of examples it represents. In the simplest version, trees are equivalent if their leaves are of the same lexical types. The criterion for equivalence can be varied easily: we have experimented with schemes where all sub-trees representing NPs are deemed to be equivalent. When generalization is performed over non-lexical classes like NPs and PPs, the method is used recursively to extract representative examples of each generalized class.</Paragraph>
    <Paragraph position="4"> At present, three main EBL-derived sub-corpora are used for system testing. Corpus 1, used most frequently, was constructed by generalizing at the level of lexical items, and contains one sentence for each class with at least three members. This yields a corpus of 281 sentences, which together represent 1743 sentences from the original corpus. Corpus 2, the &amp;quot;lexical&amp;quot; test corpus, is a set with one analyzable phrase for each lexical item occuring at least four times in the original corpus, comprising a total of 460 phrases. Corpus 3 generalizes over NPs and PPs, and analyzes NPs by generalizing over non-recursive NP and PP constituents; one to five examples are included for each class that occurs ten or more times (depending on the size of the class), giving 244 examples. This corpus is useful for finding problems linked with constructions specific to either the NP or the sentence level, but not to a combination. The time needed to process each corpus through the system is on  the order of an hour.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML