XML Viewer - w97-0405

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0405_metho.xml
Size: 25,471 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0405">
  <Title>A Formal Basis for Spoken Language Translation by Analogy</Title>
  <Section position="4" start_page="32" end_page="33" type="metho">
    <SectionTitle>
2 A Hybrid Analogical Approach
</SectionTitle>
    <Paragraph position="0"> Since language is productive, a realistic analogical system needs to be able to handle linguistic constructions that do not have an exact match in the example database. Therefore it is important for a system to be able to combine fragments from more than one example expression to cover the input expression. null To meet this requirement, we have designed an architecture for robust, practical translation of spoken language in limited domains that integrates morphological and syntactic linguistic processing with an analogical transfer component. The overall system is described briefly in this section.</Paragraph>
    <Section position="1" start_page="32" end_page="32" type="sub_section">
      <SectionTitle>
2.1 System Architecture
</SectionTitle>
      <Paragraph position="0"> The pipelined system architecture, shown in Figure 2, separates speech recognition, morphological  analysis, shallow parsing, and recursive analogical translation into different modules. This separation of general linguistic, domain, and transfer knowledge improves portability and scalability of the system.</Paragraph>
    </Section>
    <Section position="2" start_page="32" end_page="32" type="sub_section">
      <SectionTitle>
2.2 Shallow Source Language Analysis
</SectionTitle>
      <Paragraph position="0"> The purpose of the shallow analysis component is to identify clauses and phrases, to identify modifying relations as long as they are unambiguous (deriving a canonical interpretation in ambiguous cases), and to convert some surface variations into features.</Paragraph>
      <Paragraph position="1"> In our prototype implementation, an adapted version of the JUMAN 3.1 Japanese morphological analyzer (Kurohashi et al., 1994) is used for Part-of-Speech disambiguation, and for dictionary and thesanrus look-up.</Paragraph>
      <Paragraph position="2"> The second step of source language analysis is carried out using an augmented context-free grammar for the NLYAcc parser (Ishii, Ohta, and Saito. 1994), which is an implementation of the Generalized LR parsing parsing algorithm (Tomita, 1985). The shallow analysis module returns a shallow syntactic parse tree with various lexical and syntactic features. It is robust enough to tolerate extragrammaticalities, disfluencies, and the like in the input.</Paragraph>
    </Section>
    <Section position="3" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
2.3 Analogical Transfer
</SectionTitle>
      <Paragraph position="0"> The recursive analogical transfer module matches the input shallow syntactic tree against the source language portions of example shallow syntactic trees.</Paragraph>
      <Paragraph position="1"> The example data is classified into different linguistic constituent levels, such as clause-level examples, phrase-level examples, and word-level exampies. The system tries to match the input against examples of the largest unit.</Paragraph>
      <Paragraph position="2"> Once the system finds the best matching exampies of the largest unit, it checks whether there are portions that differ significantly between the input  and the example. If so. the system performs the analogical matching process again on the identified portion from'the input, using examples of the corresponding smaller unit. This recursive process continues until all parts have been matched. Finally, the target language portions of the selected best matches are comhined to form the complete target language expression.</Paragraph>
      <Paragraph position="3"> The analogical matching step is based on a probabilistic formalization of matching by analogy. The details of this model are described in Section 4.</Paragraph>
    </Section>
    <Section position="4" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.4 Target Language Generation
</SectionTitle>
      <Paragraph position="0"> The target language generation module is designed to perform a number of linguistic operations, such ms enforcing subject-verb agreement, ensuring that required definiteness information is present (such as English determiners, quantifiers, or possessives), and generating the appropriate inflectional morphology.</Paragraph>
      <Paragraph position="1"> In our prototype implementation, we are using the PC-KIMMO system for generating English morphology (Ant.worth, 1990). After these operations, the shallow syntactic tree is linearized to create an expression in the target language.</Paragraph>
    </Section>
    <Section position="5" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.5 Speech Synthesis
</SectionTitle>
      <Paragraph position="0"> In the final step, spoken output is generated from the target language expression. In our Japanese-Engiish prototype, this step is carried out by the DECTALK system (HMlalaan, 1996).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="33" end_page="34" type="metho">
    <SectionTitle>
3 Advantages of the Hybrid
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
Analogical Approach
</SectionTitle>
      <Paragraph position="0"> The hybrid approach combines analogical matching and transfer with a rule-baaed component that accounts for one of the fundamental properties of language: Its productiveness. This section describes what we perceive to be the main advantages of the hybrid analogical approach to speech translation.</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
3.1 Modular, Natural Knowledge Sources
</SectionTitle>
      <Paragraph position="0"> The system architecture separates general linguistic knowledge, domain knowledge, and transfer knowledge. This means it is easier to port it to different domains, and to apply it to new languages.</Paragraph>
      <Paragraph position="1"> We also consider the knowledge sources to be &amp;quot;natural&amp;quot;. By this we mean that, from the point of view of knowledge representation, each knowledge source captures certain aspects of the translation process in its most natural form. For example, the example data ba.se captures translation correspondences in a natural way - by means of corresponding natural language expressions in the souce and target language, Other, less natural means of knowledge representation would require significantly more effort to acquire and maintain. As a result, it is easier to improve the translation quality by adding and modifying examples, and by modifying the thesaurus (if necessary).</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
3.2 Examples vs. Syntactic or Semantic Grammars
</SectionTitle>
      <Paragraph position="0"> As described above, analogical translation relies on a database of example pairs which can encode idiomatic translation correspondences at the lexical.</Paragraph>
      <Paragraph position="1"> phrasal, and clausal manner in a natural way. This is an improvement over previous approaches which rely on syntactic or semantic grammars.</Paragraph>
      <Paragraph position="2"> For example, the &amp;quot;transfer-driven'&amp;quot; apl)roach of (Sobashima et M., 1994) relies on essentially syntactically-based analysis and transfer rules that are manually annotated with examples by providing a sound formal basis for analogical matching. This requires an extensive effort to create a body of rules that covers all possible expressions, and which can handle extra.grammatical or disfluent input. As an example of a semantic-grammax based approach.</Paragraph>
      <Paragraph position="3"> &amp;quot;concept-based&amp;quot; translation (Mayfield et al., 1995) requires an extensive manual knowledge acquisition effort to create detailed, domain and task-specific templates and semantic grammars. In addition, a heavily semantics-based approach such a.s this work suffers from a lack of generality due to the absence of linguistic processing.</Paragraph>
    </Section>
    <Section position="4" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
3.3 ;Exa,r~ples vs. Interlingua
</SectionTitle>
      <Paragraph position="0"> The framework of Interlingua-based translation rests on the presupposition that there can be a universal, unaxnbiguous, language-neutral,&amp;quot; and practically (if not formally) sound knowledge representation forrealism to mediate between source and target languages. In practice, defining, maintaining, and extending such a formalism for multiple, not closely related languages has proved to be a major challenge. Analogical speech translation does not rely on this presupposition, and instead seeks to capture intuitive translation correspondences.</Paragraph>
    </Section>
    <Section position="5" start_page="33" end_page="34" type="sub_section">
      <SectionTitle>
3.4 Syntactic and Lexical Distance
</SectionTitle>
      <Paragraph position="0"> In the hybrid analogical approach, the example data is categorized by linguistic constituent. For exantpie, there axe translation example pairs at the clause level, phrase level, and word level. This yields a more efficient search procedure during the matching process, while only assuming non-controversial notions of syntactic constituency. By treating syntactically similarity and semantic similarity as two separate aspects of the matching process, we derive an improvement over methods that combine these two aspects. For example, (Sato and Nagao, 1990} combine a measure of structural similarity with a measure of word distance in order to obtain the over-all distance measure that is used for matching.</Paragraph>
    </Section>
    <Section position="6" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.5 Computational Efficiency
</SectionTitle>
      <Paragraph position="0"> Analogical translation relies on a large database of example pairs. This incurs a significant computational cost for searching and matching against all the examples, which is proportional to the number of examples multiplied by the average size of the representations of the examples. (In practice, this cost can be mitigated somewhat by clustering and indexing schemes for the example databa.se.) Hybrid analogical translation greatly reduces the number of required examples by relying on the generality of linguistic rules.</Paragraph>
      <Paragraph position="1"> Pure statistical machine translation (Brown et al., 1993) mltst in principle recover the most probable alignment out of all possible alignments between the input and a translation. While this approach is theoretically intriguing, it has yet to be shown to be computationally feasible in practice.</Paragraph>
    </Section>
    <Section position="7" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.6 Linguistic Efficiency
</SectionTitle>
      <Paragraph position="0"> In addition to computational efficiency, we also consider a factor that might be called &amp;quot;linguistic efficiency&amp;quot;. We hold that a significant body of systematic linguistic regularities has been identified that nlust be accounted for somehow during the process of translation. Linguistic efficiency refers to the notion of how efficient the system is with regard to these regularities.</Paragraph>
      <Paragraph position="1"> In hybrid analogical translation, the use of a morphological and syntactic module for shallow analysis to derive a linguistic representation with syntactic and lexical features allows us to handle phenomena such as inflections, transformations, and language-specific phenomena (such as the English determiner system and certain Japanese constructions that encode politeness information) in a linguistically efficient manner.</Paragraph>
    </Section>
    <Section position="8" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.7 Translation Adequacy
</SectionTitle>
      <Paragraph position="0"> In order to be able to provide stylistically and pragmatically adequate translations of spoken language, it is not sufficient to merely ignore or tolerate extragrammaticalities in the input; in many cases, the information carried by such phenomena must be reflected in the target language output. The hybrid analogical approach is able to model such phenomena using probabilistic operators, which are explained in more detail in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="34" end_page="36" type="metho">
    <SectionTitle>
4 A Probabilistic Model for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
Analogical Matching
</SectionTitle>
      <Paragraph position="0"> When applied to spoken language, the central step in analogical translation is a robust matching Step that compares the output of the speech recognition component with the contents of the example database.</Paragraph>
      <Paragraph position="1"> This section presents the probabilistic model that provides a formal basis for this matching step.</Paragraph>
    </Section>
    <Section position="2" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
4.1 Notation
</SectionTitle>
      <Paragraph position="0"> Let I denote the input expression, consisting of a sequence of words along with certain features resulting from shallow parsing. Thus, an input expressions I consists of a sequence of words iwl, iw2 ..... iw,~.</Paragraph>
      <Paragraph position="1"> and a set of features ifl,if2 ..... ifm. Similarly, let the source expression E of an example pair consist of ewl, ew2,..., ewp and efb el2 ..... efq.</Paragraph>
    </Section>
    <Section position="3" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
4,2 The Noisy Channel Model
</SectionTitle>
      <Paragraph position="0"> The &amp;quot;noisy-channel&amp;quot; model from information theory has proven highly effective in speech recognition and, more recently, in language understanding (Epstein et al., 1996; Miller et al., 1994). We adopt this model for translation by analogy in the following manner.</Paragraph>
      <Paragraph position="1"> Given an input expression, the analogical matching algorithm must determine the example expression that is closest in meaning to the input expression. We denote the probability that an example expression is appropriate for translating some input as the conditional probability of the example, given the input:</Paragraph>
      <Paragraph position="3"> Our aim is to find the example that has the highest conditional probability of being appropriate to translate the given input. We denote that exampie with Emax, where the max function chooses the example with the maximum conditional probability: (2) Emax = trmXE~Examples\[P(ElI)\] Our approach to determining Ernax is as follows. First, we can use Bayes' Law to obtain a reexpression of the conditional probability that needs to be maximized:</Paragraph>
      <Paragraph position="5"> Since the input expression, and therefore P(I). remains constant over different examples, we can disregard the term P(I) in the denominator. Thus. we need to determine Emax which can be defined ms follows:  (4) Emax = maXEEExamples \[P(E)P(I\]E)\]  The probability distribution over the examples P(E) encodes .the prior probability of using the different examples to translate expressions in the domain. It can be used to penalize certain specialized expressions that should be used less frequently. The conditional probability distribution is estimated using a &amp;quot;'distortion&amp;quot; model of utterances that is de.scribed in the next section.</Paragraph>
    </Section>
    <Section position="4" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
4.3 Viewing Input as Distorted Examples
</SectionTitle>
      <Paragraph position="0"> The conditional probability distribution P(I\[E) is modeled as follows. Consider that the speaker intends to express an underlying message S, but speech errors, certain speech properties, misrecognitions, and other factors interfere, resulting in the actual utterance I, which forms the input to the translation system (Figure 3). This is modeled using a number &amp;quot;'distortion&amp;quot; operators:  * echo-word(ewi). This operator simply echoes the ith word, ewi, from the example to the input.</Paragraph>
      <Paragraph position="1"> * delete-word(ewe). This operator deletes the ith word, ewi, from the example.</Paragraph>
      <Paragraph position="2"> * add-word(iwj). This operator adds the jth word, iwj, to the input.</Paragraph>
      <Paragraph position="3"> * alter-word(ewi,iwi). This operator alters the ith word, ewi, from the example to the jth word, iwj, in the input expression. The altered word is different, but usually semantically somewhat similar.</Paragraph>
      <Paragraph position="4"> * Corresponding operators for features.</Paragraph>
      <Paragraph position="5">  Given these operators, we can view the input I as an example E to which a number of distortion operators have been applied. Thus, we can represent an input expression I as an example plus a set of distortion operators: (5) I = {E, distortl .... ,distortffi} This means that we can re-express the conditional probability distribution for an input expression I, given that the meaning expressed by example E is intended, ms follows:</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="5" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
4.4 Operator Independence Assumption
</SectionTitle>
      <Paragraph position="0"> Two independence assumptions axe required to make this model computationally feasible. For the first assumption, we assume that the individual distortion operators are conditionally independent, given the</Paragraph>
    </Section>
    <Section position="6" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
4.5 Operator Localization Assumption
</SectionTitle>
      <Paragraph position="0"> For the second assumption, we make the Assumption that the individual distortion operators only depend on the words and features that they directly involve.</Paragraph>
      <Paragraph position="1"> In effect, we stipulate that the operators only affect a strictly local portion of the input. For example.</Paragraph>
      <Paragraph position="2"> we assume that the probability of echoing a word depends only on the word itself, so that the following holds:  (11) P(echo-word(ew~)\[ewx ..... p; eh ..... q)</Paragraph>
      <Paragraph position="4"> Similarly, we assume that the probability of e.g.</Paragraph>
      <Paragraph position="5"> deleting a feature depends only on the feature itself.</Paragraph>
      <Paragraph position="6"> so that the following holds:</Paragraph>
    </Section>
    <Section position="7" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
4.6 Computing the Match
</SectionTitle>
      <Paragraph position="0"> Given an input I and an expression E, it is straight* forward io determine the isrobability of the feature distortion, since the features are indexed by name:  In order to deternfine the probability of the word distortions, we must find the most probable set of distortion operators. Given an input and an example. there are many different sets of distortion operators that could relate the two. Of course, we are interested in the most straightforward relation between the two, which corresponds to the least cost or highest probability. To further complicate matters, there may not be a single unique set of distortion operators with a unique minimum cost (corresponding to a unique maximum probability); instead, there may be a number of distortion sets that all share the same minimal cost (and maximal probability).</Paragraph>
      <Paragraph position="1"> In this case, we are content to chose one of the minimal cost sets at random. This set is defined as follows: null (15) Distortmax = rnAxOi,tor t \[P(DistortiE, I)\] We solve this problem with a dynamic programming algorithm that finds a set of distortion operators with maximal probability. First, to obtain a distance me&amp;sure, we take the negative logarithm of this expression: (16) -logP(DistortlE, I) Given that we have assumed independence between individual distortion operators above, this can be simplified as follows: no. of operators (17) -log H P(distortkl E,I) k=l We have also assumed that the distortion oper, ators are independent of the part of the sentence that does not directly involve them. Thus, we can simplify further as follows:  This corresponds directly to the individual costs that we use for the dynamic programming equation. Let the example expression be E = et,e2,...,ep and and the input expression be I = it,J2 ..... i,~. Then, let D(p, n) be the distance between the example and the input. This distance is defined by the following recurrence:</Paragraph>
      <Paragraph position="3"> I, D(p-l,n-1)-logP(alter( ewp. iw~ )} The result of this is the optimal alignment between the input and the example, as well as the minimum distance between them. The matcher selects the example with the smallest distance to the input, and assembles the target language portions of the selected example pairs to form a complete translation in the target language.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="36" end_page="37" type="metho">
    <SectionTitle>
5 Probability Estimation
</SectionTitle>
    <Paragraph position="0"> The method for speech translation by analogy described in this paper was designed to overcome the manual knowledge acquisition bottleneck by relying on techniques from symbolic and statistical machine learning, while still allowing the kind of manual tunlag that is necessary to produce high-quality translations. null</Paragraph>
    <Section position="1" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
5.1 Prior Probability Distribution
</SectionTitle>
      <Paragraph position="0"> The prior probability distribution over the exampie database P(Examples) is used to penalize highly specialized example pairs that should be used less often. After an initial distribution is estimated, these probabilities can be adjusted to solve translation problems due to idiosyncratic exRmples.</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
5.2 Alteration Probability Distribution
</SectionTitle>
      <Paragraph position="0"> The two distortion operators alter-word and alterfeature perform the function of matching semantically similar words or feature values. If a monolingual or bilingual corpus from the application domain is available, these probability distributions can be estimated using iterative methods. If neither type of corpus is available, the probabilities can be estimated with the aid of a manually-constructed thesaurus. null</Paragraph>
    </Section>
    <Section position="3" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
5.3 Thesaurus-based Estimation
</SectionTitle>
      <Paragraph position="0"> A thesaurus is a semantic IA-A hierarchy whose nodes are semantic categories, and whose leaves are words. The traditional method of estimating word similarity, based on counting IS-A links, presupposes that every link encodes equal semantic distance but in practice, this is never the case (Resnick, 1995). Thus, we adopt a new method for judging semantic distance between two words. If appropriate distributional information for words is available. then the semantic similarity of two words could be estimated from the entropy of their lowest common dominating node lcdn:  In the absence Of distributional information, the entropy of a node depends only on the number of words that the node dominates.</Paragraph>
    </Section>
    <Section position="4" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
5.4 Other Distortion Probabilities
</SectionTitle>
      <Paragraph position="0"> The probability distributions for adding and deleting words and features can also be estimated from corpora. if available. Since there are very few Japanese spoken language corpora available, we are currently adopting a word-class based model for the remaining distributions that uses the categories for &amp;quot;strong content words&amp;quot; (nouns and verbs), &amp;quot;light content words&amp;quot; (adjectives and some adverbs), &amp;quot;grammatical fimction words&amp;quot; (e.g. particles and conjunctions), and &amp;quot;modifiers and adjuncts&amp;quot;, In addition, it is possible to a.~sign specific lexical penalties to individual words.</Paragraph>
    </Section>
    <Section position="5" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
5.5 Learning Example Pairs
</SectionTitle>
      <Paragraph position="0"> Since the contents of the database is central to achieving high-quality translations, it is usually necessary to adjust it manually in response to errors in the translation. At the same time, since the example databa,~e must be adapted for every new domain, it is important to minimize the amount of manual effort. For this reason, the example database was designed in such a way that it is possible to acquire new examples by a semi-automatic method consisting of an automatic extraction step from a bilingual corpus (see. for example, (Watanabe, 1993)), followed by a manual filtering and refinement step.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML