File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0101_metho.xml
Size: 23,383 bytes
Last Modified: 2025-10-06 14:13:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W94-0101"> <Title>Qualitative and Quantitative Models of Speech Translation</Title> <Section position="4" start_page="3" end_page="4" type="metho"> <SectionTitle> 4. Quantitative Model Components </SectionTitle> <Paragraph position="0"> Moving to a Quantitative Model In moving to a quantitative architecture, we propose to retain many of the basic characteristics of the qualitative model: * A transfer organization with analysis, transfer, and generation components.</Paragraph> <Paragraph position="1"> * Monolingual models that can be used for both analysis and generation.</Paragraph> <Paragraph position="2"> * Translation models that exclusively code contrastive (cross-linguistic) information.</Paragraph> <Paragraph position="3"> * Hierarchical phrases capturing recursive linguistic structure.</Paragraph> <Paragraph position="4"> Instead of feature based syntax trees and first-order logical forms we will adopt a simpler, monostratal representation that is more closely related to those found in dependency grammars (e.g. Hudson 1984). Dependency representations have been used in large scale qualitative machine translation systems, notably by McCord (1988). The notion of a lexical 'head' of a phrase is central to these representations because they concentrate on relations between such lexical heads. In our case, the dependency representation is monostratal in that the relations may include ones normally classified as belonging to syntax, semantics or l)ragmatics. One salient property of our language model is that it is strongly lexical: it consists of statistical parameters associated with relations between lexical items and the number and ordering of dependents of lexical heads. This lexical anchoring facilitates statistical training and sensitivity to lexical variation and collocations. In order to gain the benefits of probabilistic modeling, we replace the task of developing large rule sets with the task of estimating large numbers of statistical parameters for the monolingual and translation models. This gives rise to a new cost trade-off in human annotation/judgement versus barely tractable fully automatic training. It also necessitates further research on lexical similarity and clustering (e.g. Pereira, Tishby and Lee 1993, Dagan, Marcus and Markovitch 1993) to improve parameter estimation from sparse data.</Paragraph> <Paragraph position="5"> Translation via Lexical Relation Graphs The model associates phrases with relation graphs. A relation graph is a directed labeled graph consisting of a set of relation edges. Each edge has the form of an atomic proposition ~(wi, w~) where r is a relation symbol, wi is the lexical head of a phrase and wj is the lexical head of another phrase (typically a subphrase of the phrase headed by w~). The nodes wi and wj are word occurrences representable by a word and an index, the indices uniquely identifying particular occurrences of the words in a discourse or corpus. The set of relation symbols is open ended, but the first argument of the relation is always interpreted as the head and the second as the dependent with respect to this relation. The relations in the models for the sour~:e and target languages need not be the same, or even overlap. To keep the language models simple, we will mainly restrict ourselves here to dependency graphs that are trees with unordered siblings. In particular, phrases will always be contiguous strings of words and dependents will always be heads of subphrases.</Paragraph> <Paragraph position="6"> Ignoring algorithmic issues relating to compactly representing and efficiently searching the space of alternative hypotheses, the overall design of the quantitative system is as follows. The speech recognizer produces a set of word-position hypotheses (perhaps in the form of a word lattice) corresponding to a set of string hypotheses for the input. The source language model is used to compute a set of possible relation graphs, with associated probabilities, for each string hypothesis. A probabilistic graph translation model then provides, for each source relation graph, the probabilities of deriving corresponding graphs with word occurrences from the target language. These target graphs include all the words of possible translations of the utterance hypotheses but do not specify the surface order of these words. Probabilities for different possible word orderings are computed according to ordering parameters which form part of the target language model.</Paragraph> <Paragraph position="7"> In the following section we explain how the probabilities for these various processing stages are combined to select the most likely target word sequence. This word sequence can then be handed to the speech synthesizer. For tighter integration between getmraliovt aml sy,,tl,~', sis, information about the derivation of I.Iw l,arg,'l uI I,erance can also I)c passed to the syuthesizcr.</Paragraph> <Paragraph position="8"> The probabilities associated with phrases in the abov,, description are computed according to the statistical models for analysis, translation, and generation. In this section we show the relationship between these models to arrive at an overall statistical model of sp,,,.,&quot; h translation. We are not considering training ismws in this paper, though a number of now familiar techniques ranging from methods for maximum likelihood estimation to direct estimation using fully annotated data are applicable.</Paragraph> <Paragraph position="9"> The objects involved in the overall model are as Jbllows (we omit target speech synthesis under the, assumption that it proceeds deterministically from a target language word string): Given a spoken input in the source language, we wish to find a target language string that is the most likely translation of the input. We are thus interestc.d in the conditional probability of We given A,. This conditional probability can be expressed as follows (of. Chang and Su 1993):</Paragraph> <Paragraph position="11"> We now apply some simplifying independence .ssumptions concerning relation graphs. Specifically. that their derivation from word strings is independent of acoustic information; that their translation is independent of the original words and acoustics involved; and that target word string generation from target relation edges is independent of the source language represent, ations. The extent to which these (Markovian) assumptions hold depend on the extent to which relation edges represent all the relevant information for translation.</Paragraph> <Paragraph position="12"> In particular it means they should express aspects of surface relevant to meaning, such as topicalization, as well as predicate argument structure. In any case, the simplifying assumptions give the following: P(W~IA, ) _~ ~w.,c.,c, P( W, IA, ) P(C01W,) P( Ct lCo ) P( Wt IPS :, ). This can be rewritten with two applications of Bay,,~ ignored in finding the maximum of P(Wt\]As). Determining Wt that maximizes P(WdA, ) therefore involves the following factors: * I'(A, I W, ): source language acoustics * /'(\[.V, IC,): source language generation . I'(C.,): source content relations * /'(('tiCs): source to target transfer * I'(IVtlC't ): target language generation Wc a.,~ume that the speech recognizer provides acoustic scores proportional to P(A, IW, ) (or logs thereof). Sud~ scores are normally computed by speech recognil i,,n systems, although they are usually also multiplied by w,,rd-based language model probabilities P(W,) which we do not require in this application context. ()ur approach to language modeling, which covers the corn.cat analysis and language generation factors, is pre:~,,uted in section 5 and the transfer probabilities fall umh,r the translation model of section 6.</Paragraph> <Paragraph position="13"> Finally note thai. by another application of Bayes ,-,d,, w,, can replace the two factors P(C,)P(CdC,) by I'(Ct)l'(C, lCt} without changing other parts of the model. Tiffs latter fornmlation allows us to apply constraints imposed by the target language model to illt,'r inappropriate possibilities suggested by analysis and tra.sfi~r. In some respects this is similar to Dagan and Itai's (I 994) approach to word sense disambiguation using statistical associations in a second language.</Paragraph> </Section> <Section position="5" start_page="4" end_page="6" type="metho"> <SectionTitle> 5. Language Models Language Production Model </SectionTitle> <Paragraph position="0"> ~).r bmguage model can be viewed in terms of a probabihstic generative process based on the choice of lexical &quot;heads&quot; of phrases and the recursive generation of sub;,bra~es and their ordering. For this purpose, we can de(ira, tho head word of a phrase to be the word that most strongly influences the way the phrase may be combiucd with other phrases. This notion has been central to a number of approaches to grammar for some time, including theories like dependency grammar (Hudson I!~7(;, 1990) and HPSG (Pollard and Sag 1987). More ;,'~,.t,l.ly, the statistical properties of associations be-Iw,.,'n words, and more particularly heads of phrases, JL:t.~ J~,~'~,l|lql, all a.el.iw; area of research (e.g. Chang, l,uo, aml Su 1992; Ilindlc and R.ooth 1993).</Paragraph> <Paragraph position="1"> 'l'h,' language model factors the statistical derivation ,,f a .~'ul.ence with word string W as follows:</Paragraph> <Paragraph position="3"> where C ranges over relation graphs. The content model, P(C), and generation model, P(WIC), are components of the overall statistical model for spoken language translation given earlier. This decomposition of P(W) can be viewed as first deciding on the content of a sentence, formulated as a set of relation edges according to a statistical model for P(C), and then deciding on word order according to P(WIC ).</Paragraph> <Paragraph position="4"> Of course, this decomposition simplifies the realities of language production in that real language is always generated in the context of some situation S (real or imaginary), so a more comprehensive model would be concerned with P(CIS), i.e. language production in context. This is less important, however, in the translation setting since we produce Ct in the context of a source relation graph C, and we assume the availability of a model for P(CtlC,).</Paragraph> <Paragraph position="5"> Content Derivation Model The model for deriving the relation graph of a phrase is taken to consist of choosing a lexical head h0 for the phrase (what the phrase is 'about') followed by a series of 'node expansion' steps. An expansion step takes a node and chooses a possibly empty set of edges (relation labels and ending nodes) starting from that node. Here we consider only the case of relation graphs that are trees with unordered siblings.</Paragraph> <Paragraph position="6"> To start with, let us take the simplified case where a head word h has no optional or duplicated dependents (i.e. exactly one for each relation). There will be a set of edges</Paragraph> <Paragraph position="8"> corresponding to the local tree rooted at h with dependent nodes Wl...wk. The set of relation edges for the entire derivation is the union of these local edge sets.</Paragraph> <Paragraph position="9"> To determine the probability of deriving a relation graph C for a phrase headed by h0 we make use of parameters ('dependency parameters')</Paragraph> <Paragraph position="11"> for the probability, given a node h and a relation r, that w is an r-dependent of h. Under the assumption that the dependents of a head are chosen independently from each other, the probability of deriving C is:</Paragraph> <Paragraph position="13"> where P(Top(ho)) is the probability of choosing h0 to start the derivation.</Paragraph> <Paragraph position="14"> If we now remove the assumption made earlier that there is exactly one r-dependent of a head, we need to elaborate the derivation model to include choosing the number of such dependents. We model this by param-</Paragraph> <Paragraph position="16"> that is, the I)rol)aifility that head h h+~ n r-dep(m(lents.</Paragraph> <Paragraph position="17"> We will r,ffer t,o t,|lis I)robability ;m a '(let, all parameter'.</Paragraph> <Paragraph position="18"> Our previous assmnption amounted to stating that this was always 1 for n = 1 or for n = 0. Detail parameters allow us to model, for example, the number of adjectival modifiers of a noun or the 'degree' to which a particular argument of a verb is optional. The probability of an expansion of h giving rise to local edges E(h) is now:</Paragraph> <Paragraph position="20"> where r ranges over the set of relation labels and h has nr r-dependents w~... w nP . k(nr) is a combinatorie constant for taking account of the fact that we are not distinguishing permutations of the dependents (e.g. there are n,.! permutations of the r-dependents of h if these dependents are all distinct).</Paragraph> <Paragraph position="22"> where heads(C) is thc set of nodes in C and Ec(h) is the set of edges headed by h in C.</Paragraph> <Paragraph position="23"> The above formulation is only an approximation for relation graphs that are not trees because the independence assumptions which allow the dependency parameters to be simply multiplied together no longer hold for the general case. Dependency graphs with cycles do arise as the most natural analyses of certain linguistic constructions, but calculating their probabilities on a node by node basis as above may still provide probability estimates that are accurate enough for practical purposes.</Paragraph> <Paragraph position="24"> Generation Model We now return to the generation model P(WIC). As mentioned earlier, since C includes the words in W and a set of relations between them, the generation model is concerned only with surface order. One possibility is to use 'bi-relation' parameters for the probability that an ri-dependent immediately follows an u-dependent.</Paragraph> <Paragraph position="25"> This approach is problematic for oui: overall statistical model because such parameters are not independent from the 'detail' parameters specifying the number of r-dependents of a head.</Paragraph> <Paragraph position="26"> We therefore adopt the use of 'sequencing' parameters, these being probabilities of particular orderings of dependents given that the multiset of dependency relations is known. We let the identity relation e stand for the head itself. Specifically, we have parameters P(slM(s)) where s is a sequence of relation labels including an occurrence of e and M(s) is the multiset for this sequence. For a head h in a relation graph C, let swch be the sequence of dependent relations induced by a particular word string W generated from C. We now have s>(WlC) = I-Ih~w(Il. ~-~--~ ) l'(.sw < &quot;h I M ( ~'w < &quot;h )) where It ranges over all the heads in (;, aud m. is I.h<' number of occurrences of r in sW(:h, assuming that all orderings of nr-dependents are equally likely. We can thus use these sequencing parameters directly in our overall model.</Paragraph> <Paragraph position="27"> To summarize, our monolingual models are specifi,'d by: The overall model splits the contributions of ('ollt~mt P(C) and ordering P(WIC ). However, we may also want a model for P(W), for example for pruning spec(:h recognition hypotheses. Combining our content ;rod ordering models we get:</Paragraph> <Paragraph position="29"> The parameters P(slh ) can be derived by combining sequencing parameters with the detail parameters for h.</Paragraph> </Section> <Section position="6" start_page="6" end_page="8" type="metho"> <SectionTitle> 6. Translation Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="6" end_page="7" type="sub_section"> <SectionTitle> Mapping Relation Graphs </SectionTitle> <Paragraph position="0"> As already mentioned, the translation model delines mappings between relation graphs C., for the source language and Ct for the target language. A direct (though incomplete)justification of translation via n.lation graphs may be based on a simple referential view of natural language semantics. Thus nominals and their modifiers pick out entities in a (real or imaginary) world, verbs and their modifiers refer to actions or events in which the entities participate in roles indicated by the edge relations. Under this view, the purpose of the translation mapping is to determhm a target language relation graph that provides the best approximation to the referential function induced by the source relation graph. We call this approximating referential equivalence.</Paragraph> <Paragraph position="1"> This referential view of semantics is not adequate for taking account of much of the complexity of natural language including many aspects of quantification, distributivity and modality. This means it cannot capture some of the subtleties that a theory based on logical equivalence might be expected to. On the other hand, when we proposed a logic based approach as our qualitative model, we had to restrict it to a simple first order logic anyway for computational reasons, and even then it did not appear to be practical. Thus using the more impow~rished lexical relations representation may not tw costing us much in practice.</Paragraph> <Paragraph position="2"> One aspect of the representation that is particularly useful in the translation application is its convenience for partial and/or incremental representation of content we can refine the representation by the addition of furthor edges. A fully specified denotation of the meaning of a s,mtence is rarely required for translation, and as w,~ pointed out when discussing logic representations, a c~mq~lete specification may not have been intended by th,, slwaker. Although we have not provided a denotatio.al semantics for sets of relation edges, we anticipate that this will be possible along the lines developed in m(motonic semantics (Alshawi and Crouch 1992).</Paragraph> </Section> <Section position="2" start_page="7" end_page="8" type="sub_section"> <SectionTitle> Translation Parameters </SectionTitle> <Paragraph position="0"> '1'o bc practical, a model for P(CtIC,) needs to decompose the source and target graphs C~ and Ct into sub-graphs small enough that subgraph translation parameters can be estimated. We do this with the help of 'node a.lignment relations' between the nodes of these graphs.</Paragraph> <Paragraph position="1"> 'l'lmse alignment relations are similar in some respects to the alignments used by Brown et al. (1990) in their surface translation model. The translation probability is then the sum of probabilities over different alignments .t: I'(C, ICo) = ~s P(C. flC,).</Paragraph> <Paragraph position="2"> There are different ways to model P(Ct,.tIC,) corresp(mding to different kinds of alignment relations and different independence assumptions about the translation mapping.</Paragraph> <Paragraph position="3"> l&quot;or our quantitative design, we adopt a simple model in which lexical and relation (structural) probabilities are assumed to be independent. In this model the alignnlent relations are functions from the word occurrence ~lodes of Ct to the word occurrences of C~. The idea is that .t(,j) = wi means that the source word occurr('ncc wi 'gave rise' to the target word occurrence vj. 'l'lw inverse relation .t-1 need not be a function, allowing different numbers of words in the source and target sentences.</Paragraph> <Paragraph position="4"> We decompose P(C~,.tIC,) into 'lexical' and 'structural' probabilities as follows: I'(Ct, fie,) = P(N,, IIN,)P(EtINt, .t, C,) where Nt and N, are the node sets for Ct and C0 respectiw.ly, and Et is the set of edges for the target graph. The lirst factor P(Nt, .fiN,) is the lexical component it~ ~.hat it does not take into account any of the relations in I.he source graph C.,. This lexical component is the pro,luct of alignment probabilities for each node of N,: PCN,, fiN, ) = H wiEN. &quot;?}lwd.</Paragraph> <Paragraph position="5"> That is, the probability that .I' maps exactly the (possibly empty) subset {vi*... v~} of Nt to wi. These sets are assumed to be disjoint for different source graph nodes, so we can replace the factors in the above product with parameters: P(MIw) where w is a source language word and M is a multiset of target language words.</Paragraph> <Paragraph position="6"> We will derive a target set of edges Et of Ct by k derivation steps which partition the set of source edges E, into subgraphs St ... Sk. These subgraphs give rise to disjoint sets of relation edges T1 ... Tk which together form Et. The structural component of our translation model will be the sum of derivation probabilities for such an edge set Et.</Paragraph> <Paragraph position="7"> For simplicity, we assume here that the source graph C, is a tree. This is consistent with our earlier assumptions about the source language model. We take our partitions of the source graph to be the edge sets for local trees. This ensures that the the partitioning is deterministic so the probability of a derivation is the product of the probabilities of derivation steps. More complex models with larger partitions rooted at a node are possible but these require additional parameters for partitioning. For the simple model it remains to specify derivation step probabilities.</Paragraph> <Paragraph position="8"> The probability of a derivation step is given by parameters of the form: P(T, qS,', .td where S~ and T\[ are unlabeled graphs and ffi is a node alignment function from T\[ to S~. Unlabeled graphs are just like our relation edge graphs except that the nodes are not labeled with words (the edges still have relation labels). To apply a derivation step we need a notion of graph matching that respects edge labels: g is an isomorphism (modulo node labels) from a graph G to a graph H if g is a one-one and onto function from the nodes of G to the nodes of H such that r(a, b) e V iff r(g(a), g(b)) * H.</Paragraph> <Paragraph position="9"> The derivation step with parameter P(T\[IS~,f~ ) is applicable to the source edges St, under the alignment f, giving rise to the target edges Ti if (i) there is an isomorphism hi from S~ to Si (ii) there is an isomorphism gi from ~ to T~' (iii) for any node v of Ti it must be the case that hi(fi(gi(v))) -- f(v).</Paragraph> <Paragraph position="10"> This last condition ensures that the target graph partitions join up in a way that is compatible with the node alignrn,:nt f, Tile factoring of the translation model into these lexical and structural components means that it will overgenerate because these aspects are not independent in translation between real natural languages. It is therefore appropriate to filter translation hypotheses by re.scoring according to the version of the overall statistical model that included the factors P(Ct)P(ColCt) so that the target language model constrains the output of the translation model. Of course, in this case we need to model the translation relation in the 'reverse' direction. This can be done in a parallel fashion to the forward direction described above.</Paragraph> </Section> </Section> class="xml-element"></Paper>