XML Viewer - j00-1004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-1004_metho.xml
Size: 30,534 bytes
Last Modified: 2025-10-06 14:07:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-1004">
  <Title>Learning Dependency Translation Models as Collections of Finite-State Head Transducers Hiyan Alshawi*</Title>
  <Section position="3" start_page="0" end_page="47" type="metho">
    <SectionTitle>
2. Head Transducers
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="46" type="sub_section">
      <SectionTitle>
2.1 Weighted Finite-State Head Transducers
</SectionTitle>
      <Paragraph position="0"> In this section we describe the basic structure and operation of a weighted head transducer. In some respects, this description is simpler than earlier presentations (e.g., Alshawi 1996); for example, here final states are simply a subset of the transducer states whereas in other work we have described the more general case in which final states are specified by a probability distribution. The simplified description is adequate for the purposes of this paper.</Paragraph>
      <Paragraph position="1"> Formally, a weighted head transducer is a 5-tuple: an alphabet W of input symbols; an alphabet V of output symbols; a finite set Q of states q0 ..... qs; a set of final states F c Q; and a finite set T of state transitions. A transition from state q to state q' has the form (q,q',w,v,o~,fl, cl where w is a member of W or is the empty string c; v is a member of V or C/; the integer o~ is the input position; the integer fl is the output position; and the real number c is the weight or cost of the transition. A transition in which oz = 0 and fl = 0 is called a head transition.</Paragraph>
      <Paragraph position="2"> The interpretation of q, q', w, and v in transitions is similar to left-to-right transducers, i.e., in transitioning from state q to state qt, the transducer &amp;quot;reads&amp;quot; input symbol w and &amp;quot;writes&amp;quot; output symbol v, and as usual if w (or v) is e then no read (respectively write) takes place for the transition. The difference lies in the interpretation of the read position c~ and the write position ft. To interpret the transition positions as transducer actions, we consider notional input and output tapes divided into squares. On such a tape, one square is numbered 0, and the other squares are numbered 1, 2 .... rightwards from square 0, and -1,-2 .... leftwards from square 0 (Figure 1).</Paragraph>
      <Paragraph position="3"> A transition with input position ~ and output position fl is interpreted as reading w from square c~ on the input tape and writing v to square fl of the output tape; if square fl is already occupied, then v is written to the next empty square to the left of  Alshawi, Bangalore, and Douglas Learning Dependency Translation Models &lt;q, q', w, v, a, fl,, c&gt;</Paragraph>
      <Paragraph position="5"> Transition symbols and positions.</Paragraph>
      <Paragraph position="6"> fl if fl &lt; 0, or to the right of fl if fl &gt; 0, and similarly, if input was already read from position a, w is taken from the next unread square to the left of a if a &lt; 0 or to the right of c~ if a ~ 0.</Paragraph>
      <Paragraph position="7"> The operation of a head transducer is nondeterministic. It starts by taking a head transition {q, q', w0, v0, 0, 0, c} where w0 is one of the symbols (not necessarily the leftmost) in the input string. (The valid initial states are therefore implicitly defined as those with an outgoing head transition.) w0 is considered to be at square 0 of the input tape and v0 is output at square 0 of the output tape. Further state transitions may then be taken until a final state in F is reached. For a derivation to be valid, it must read each symbol in the input string exactly once. At the end of a derivation, the output string is formed by taking the sequence of symbols on the target tape, ignoring any empty squares on this tape.</Paragraph>
      <Paragraph position="8"> The cost of a derivation of an input string to an output string by a weighted head transducer is the sum of the costs of transitions taken in the derivation. We can now define the string-to-string transduction function for a head transducer to be the function that maps an input string to the output string produced by the lowest-cost valid derivation taken over all initial states and initial symbols. (Formally, the function is partial in that it is not defined on an input when there are no derivations or when there are multiple outputs with the same minimal cost.) In the transducers produced by the training method described in this paper, the source and target positions are in the set {-1,0,1}, though we have also used hand-coded transducers (Alshawi and Xia 1997) and automatically trained transducers (A1shawl and Douglas 2000) with a larger range of positions.</Paragraph>
    </Section>
    <Section position="2" start_page="46" end_page="47" type="sub_section">
      <SectionTitle>
2.2 Relationship to Standard FSTs
</SectionTitle>
      <Paragraph position="0"> The operation of a traditional left-to-right transducer can be simulated by a head transducer by starting at the leftmost input symbol and setting the positions of the first transition taken to a = 0 and fl = 0, and the positions for subsequent transitions to o~ = 1 and fl = 1. However, we can illustrate the fact that head transducers are more  Head transducer to reverse an input string of arbitrary length in the alphabet {a, b}. expressive than left-to-right transducers by the case of a finite-state head transducer that reverses a string of arbitrary length. (This cannot be performed by a traditional transducer with a finite number of states.) For example, the head transducer described below (and shown in Figure 2) with input alphabet {a, b} will reverse an input string of arbitrary length in that alphabet. The states of the example transducer are Q = {ql, q2} and F = {q2}, and it has the following transitions (costs are ignored here):</Paragraph>
      <Paragraph position="2"> The only possible complete derivations of the transducer read the input string right to left, but write it left to right, thus reversing the string.</Paragraph>
      <Paragraph position="3"> Another similar example is using a finite-state head transducer to convert a palindrome of arbitrary length into one of its component halves. This clearly requires the use of an empty string on some of the output transitions.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="47" end_page="51" type="metho">
    <SectionTitle>
3. Dependency Transduction Models
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="47" end_page="49" type="sub_section">
      <SectionTitle>
3.1 Dependency Transduction using Head Transducers
</SectionTitle>
      <Paragraph position="0"> In this section we describe dependency transduction models, which can be used for machine translation and other transduction tasks. These models consist of a collection of head transducers that are applied hierarchically. Applying the machines hierarchically means that a nonhead transition is interpreted not simply as reading an input-output pair (w, v), but instead as reading and writing a pair of strings headed by (w, v) according to the derivation of a subnetwork.</Paragraph>
      <Paragraph position="1"> For example, the head transducer shown in Figure 3 can be applied recursively in order to convert an arithmetic expression from infix to prefix (Polish) notation (as noted by Lewis and Stearns \[1968\], this transduction cannot be performed by a pushdown transducer).</Paragraph>
      <Paragraph position="2"> In the case of machine translation, the transducers derive pairs of dependency trees, a source language dependency tree and a target dependency tree. A dependency tree for a sentence, in the sense of dependency grammar (for example Hays \[1964\] and Hudson \[1984\]), is a tree in which the words of the sentence appear as nodes (we do not have terminal symbols of the kind used in phrase structure grammar). In such a tree, the parent of a node is its head and the child of a node is the node's dependent. The source and target dependency trees derived by a dependency transduction model are ordered, i.e., there is an ordering on the nodes of each local tree. This  Alshawi, Bangalore, and Douglas Learning Dependency Translation Models b b ~C~ b:b b:b Figure 3 Dependency transduction network mapping bracketed arithmetic expressions from infix to prefix notation.</Paragraph>
      <Paragraph position="3"> I I want to make a collect call I * , \[ quiero hac~ una llamada de cobr~ I Figure 4 Synchronized dependency trees derived for transducing I want to make a collect call into quiero hacer una llamada de cobrar.</Paragraph>
      <Paragraph position="4"> means, in particular, that the target sentence can be constructed directly by a simple recursive traversal of the target dependency tree. Each pair of source and target trees generated is synchronized in the sense to be formalized in Section 4.2. An example is given in Figure 4.</Paragraph>
      <Paragraph position="5"> Head transducers and dependency transduction models are thus related as follows: Each pair of local trees produced by a dependency transduction derivation is the result of a head transducer derivation. Specifically, the input to such a head transducer is the string corresponding to the flattened local source dependency tree. Similarly, the output of the head transducer derivation is the string corresponding to the flattened local target dependency tree. In other words, the head transducer is used to convert a sequence consisting of a headword w and its left and right dependent words to a sequence consisting of a target word v and its left and right dependent words (Figure 5). Since the empty string may appear in a transition in place of a source or target symbol, the number of source and target dependents can be different.</Paragraph>
      <Paragraph position="6"> The cost of a derivation produced by a dependency transduction model is the sum of all the weights of the head transducer derivations involved. When applying a dependency transduction model to language translation, we choose the target string obtained by flattening the target tree of the lowest-cost dependency derivation that also generates the source string.</Paragraph>
      <Paragraph position="7"> We have not yet indicated what weights to use for head transducer transitions. The definition of head transducers as such does not constrain these. However, for a dependency transduction model to be a statistical model for generating pairs of strings, we assign transition weights that are derived from conditional probabilities. Several</Paragraph>
      <Paragraph position="9"> probabilistic parameterizations can be used for this purpose including the following for a transition with headwords w and v and dependent words w' and v': P(q', w', v', fllw, v, q).</Paragraph>
      <Paragraph position="10"> Here q and q' are the from-state and to-state for the transition and a and fl are the source and target positions, as before. We also need parameters P(q0, ql\]w, v) for the probability of choosing a head transition (qo, ql, w,v,O,O) given this pair of headwords. To start the derivation, we need parameters P(roots(wo, vo)) for the probability of choosing w0,v0 as the root nodes of the two trees.</Paragraph>
      <Paragraph position="11"> These model parameters can be used to generate pairs of synchronized dependency trees starting with the topmost nodes of the two trees and proceeding recursively to the leaves. The probability of such a derivation can be expressed as:</Paragraph>
      <Paragraph position="13"> for a derivation in which the dependents of w and v are generated by n transitions.</Paragraph>
    </Section>
    <Section position="2" start_page="49" end_page="51" type="sub_section">
      <SectionTitle>
3.2 Transduction Algorithm
</SectionTitle>
      <Paragraph position="0"> To carry out translation with a dependency transduction model, we apply a dynamic programming search to find the optimal derivation. This algorithm can take as input either word strings, or word lattices produced by a speech recognizer. The algorithm is similar to those for context-free parsing such as chart parsing (Earley 1970) and the CKY algorithm (Younger 1967). Since word string input is a special case of word lattice input, we need only describe the case of lattices.</Paragraph>
      <Paragraph position="1"> We now present a sketch of the transduction algorithm. The algorithm works bottom-up, maintaining a set of configurations. A configuration has the form In1, n2, w, v, q, c, t\] corresponding to a bottom-up partial derivation currently in state q covering an input sequence between nodes nl and n2 of the input lattice, w and v are the topmost  Alshawi, Bangalore, and Douglas Learning Dependency Translation Models nodes in the source and target derivation trees. Only the target tree t is stored in the configuration.</Paragraph>
      <Paragraph position="2"> The algorithm first initializes configurations for the input words, and then performs transitions and optimizations to develop the set of configurations bottom-up: Initialization: For each word edge between nodes n and n ~ in the lattice with source word w0, an initial configuration is constructed for any head transition of the form (q, q', w0, v0, 0, 0, c} Such an initial configuration has the form: \[n, n t, w0, v0, q~, c, v0\] Transition: We show the case of a transition in which a new configuration results from consuming a source dependent wl to the left of a headword w and adding the corresponding target dependent Vl to the right of the target head v. Other cases are similar. The transition applied is: (q, q~, Wl, Vl, -1,1, c'} It is applicable when there are the following head and dependent configurations: \[n2,n3,w,v,q,c,t\] \[nl, n2, Wl, Vl, qf, Cl, tl\] where the dependent configuration is in a final state qf. The result of applying the transition is to add the following to the set of configurations: In1, n3, w, v, q', c + Cl q- C', t'\] where Y is the target dependency tree formed by adding tl as the rightmost dependent of t.</Paragraph>
      <Paragraph position="3"> Optimization: We also require a dynamic programming condition to remove suboptimal (sub)derivations. Whenever there are two configurations \[n, n', w, v, q, Cl, tl\] \[n, n', w, v, q, C2, t2\] and c2 &gt; Cl, the second configuration is removed from the set of configurations.</Paragraph>
      <Paragraph position="4"> If, after all applicable transitions have been taken, there are configurations spanning the entire input lattice, then the one with the lowest cost is the optimal derivation. When there are no such configurations, we take a pragmatic approach in the translation application and simply concatenate the lowest costing of the minimal length sequences of partial derivations that span the entire lattice. A Viterbi-like search of the graph formed by configurations is used to find the optimal sequence of derivations. One of the advantages of middle-out transduction is that robustness is improved through such use of partial derivations when no complete derivations are available.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="51" end_page="56" type="metho">
    <SectionTitle>
4. Training Method
</SectionTitle>
    <Paragraph position="0"> Our training method for head transducer models only requires a set of training examples. Each example, or bitext, consists of a source language string paired with a target language string. In our experiments, the bitexts are transcriptions of spoken English utterances paired with their translations into Spanish or Japanese.</Paragraph>
    <Paragraph position="1"> It is worth emphasizing that we do not necessarily expect the dependency representations produced by the training method to be traditional dependency structures for the two languages. Instead, the aim is to produce bilingual (i.e., synchronized, see below) dependency representations that are appropriate to performing the translation task for a specific language pair or specific bilingual corpus. For example, headwords in both languages are chosen to force a synchronized alignment (for better or worse) in order to simplify cases involving so-called head-switching. This contrasts with one of the traditional approaches (e.g., Dorr 1994; Watanabe 1995) to posing the translation problem, i.e., the approach in which translation problems are seen in terms of bridging the gap between the most natural monolingual representations underlying the sentences of each language.</Paragraph>
    <Paragraph position="2"> The training method has four stages: (i) Compute co-occurrence statistics from the training data. (ii) Search for an optimal synchronized hierarchical alignment for each bitext. (iii) Construct a set of head transducers that can generate these alignments with transition weights derived from maximum likelihood estimation.</Paragraph>
    <Section position="1" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
4.1 Computing Pairing Costs
</SectionTitle>
      <Paragraph position="0"> For each source word w in the data set, assign a cost, the translation pairing cost c(w, v) for all possible translations v into the target language. These translations of the source word may be zero, one, or several target language words (see Section 4.4 for discussion of the multiword case). The assignment of translation pairing costs (effectively a statistical bilingual dictionary) may be done using various statistical measures.</Paragraph>
      <Paragraph position="1"> For this purpose, a suitable statistical function needs to indicate the strength of co-occurrence correlation between source and target words, which we assume is indicative of carrying the same semantic content. Our preferred choice of statistical measure for assigning the costs is the ~ correlation measure (Gale and Church 1991). We apply this statistic to co-occurrence of the source word with all its possible translations in the data set examples. We have found that, at least for our data, this measure leads to better performance than the use of the log probabilities of target words given source words (cf. Brown et al. 1993).</Paragraph>
      <Paragraph position="2"> In addition to the correlation measure, the cost for a pairing includes a distance measure component that penalizes pairings proportionately to the difference between the (normalized) positions of the source and target words in their respective sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="53" type="sub_section">
      <SectionTitle>
4.2 Computing Hierarchical Alignments
</SectionTitle>
      <Paragraph position="0"> As noted earlier, dependency transduction models are generative probabilistic models; each derivation generates a pair of dependency trees. Such a pair can be represented as a synchronized hierarchical alignment of two strings. A hierarchical alignment consists of four functions. The first two functions are an alignment mapping f from source words w to target words f(w) (which may be the empty string ~), and an inverse alignment mapping from target words v to source words fr(v). The inverse mapping is needed to handle mapping of target words to ~; it coincides withf for pairs without source ~. The other two functions are a source head-map g mapping source dependent words w to their heads g(w) in the source string, and a target head-map h mapping target dependent words v to their headwords h(v) in the target string. An  Alshawi, Bangalore, and Douglas Leaning Dependency Translation Models g show me nonstop flights to boston muestreme los vuelos sin escalas a boston g show me z'\ muestr~me nonstop flights to boston los vuelos sin escalas a boston Figure 6 A hierarchical alignment: alignment mappings f and f', and head-maps g and h. example hierarchical alignment is shown in Figure 6 (f and f' are shown separately for clarity).</Paragraph>
      <Paragraph position="1"> A hierarchical alignment is synchronized (i.e., it corresponds to synchronized dependency trees) if these conditions hold: Nonoverlap: If wl # w2, thenf(wl) f(w2), and similarly, if Vl V2, thenf'(vl) # d'(v2).</Paragraph>
      <Paragraph position="2"> Synchronization: if f(w) = v and v # e, then f(g(w)) = h(v), and f'(v) = w. Similarly, ifd'(v) = w and w # e, thend'(h(v)) = g(w), andf(w) = v. Phrase contiguity: The image under f of the maximal substring dominated by a headword w is a contiguous segment of the target string.</Paragraph>
      <Paragraph position="3"> (Here w and v refer to word tokens not symbols (types). We hope that the context of discussion will make the type-token distinction clear in the rest of this article.) The hierarchical alignment in Figure 6 is synchronized.</Paragraph>
      <Paragraph position="4"> Of course, translations of phrases are not always transparently related by a hierarchical alignment. In cases where the mapping between a source and target phrase is unclear (for example, one of the phrases might be an idiom), then the most reasonable choice of hierarchical alignment may be for f and f' to link the heads of the phrases only, all the other words being mapped to e, with no constraints on the monolingual head mappings h and g. (This is the approach we take to compound lexical pairings, discussed in Section 4.4.) In the hierarchical alignments produced by the training method described here, the source and target strings of a bitext are decomposed into three aligned regions, as shown in Figure 7: a head region consisting of headword w in the source and its corresponding targetf(w) in the target string, a left substring region consisting of the source substring to the left of w and its projection under f on the target string, and a right substring region consisting of the source substring to the right of w and its projection underf on the target string. The decomposition is recursive in that the left substring region is decomposed around a left headword wl, and the right substring  region is decomposed around a right headword Wr. This process of decomposition continues for each left and right substring until it only contains a single word. For each bitext there are, in general, multiple such recursive decompositions that satisfy the synchronization constraints for hierarchical alignments. We wish to find such an alignment that respects the co-occurrence statistics of bitexts as well as the phrasal structure implicit in the source and target strings. For this purpose we define a cost function on hierarchical alignments. The cost function is the sum of three terms. The first term is the total of all the translation pairing costs c(w,f(w)) of each source word w and its translation f(w) in the alignment; the second term is proportional to the distance in the source string between dependents wd and their heads g(wa); and the third term is proportional to the distance in the target string between target dependent words va and their heads h(va).</Paragraph>
      <Paragraph position="5"> The hierarchical alignment that minimizes this cost function is computed using a dynamic programming procedure. In this procedure, the pairing costs are first retrieved for each possible source-target pair allowed by the example. Adjacent source substrings are then combined to determine the lowest-cost subalignments for successively larger substrings of the bitext satisfying the constraints stated above. The successively larger substrings eventually span the entire source string, yielding the optimal hierarchical alignment for the bitext. This procedure has O(n 6) complexity in the number of words in the source (or target) sentence. In Alshawi and Douglas (2000) we describe a version of the alignment algorithm in which heads may have an arbitrary number of dependents, and in which the hierarchical alignments for the training corpus are refined by iterative reestimation.</Paragraph>
    </Section>
    <Section position="3" start_page="53" end_page="55" type="sub_section">
      <SectionTitle>
4.3 Constructing Transducers
</SectionTitle>
      <Paragraph position="0"> Building a head transducer involves creating appropriate head transducer states and tracing hypothesized head transducer transitions between them that are consistent with the hierarchical alignment of a bitext.</Paragraph>
      <Paragraph position="1"> The main transitions that are traced in our construction are those that map heads, wl and Wr, of the right and left dependent phrases of w to their translations as indicated by the alignment function f in the hierarchical alignment. The positions of the dependents in the target string are computed by comparing the positions off(wt) and f(Wr) to the position of v = f(w).</Paragraph>
      <Paragraph position="2"> In order to generalize from instances in the training data, some model states arising from different training instances are shared. In particular, in the construction described here, for a given pair (w, v) there is only one final state. (We have also tried using automatic word-clustering techniques to merge states further, but for the limited domain corpora we have used so far, the results are inconclusive.) To specify  Alshawi, Bangalore, and Douglas Learning Dependency Translation Models</Paragraph>
      <Paragraph position="4"> States and transitions constructed for the &amp;quot;swapping&amp;quot; decomposition shown in Figure 7.</Paragraph>
      <Paragraph position="5"> the sharing of states we make use of a one-to-one state-naming function C/ from sequences of strings to transducer states. The same state-naming function is used for all examples in the data set, ensuring that the transducer fragments recorded for the entire data set will form a complete collection of head transducer transition networks. null Figure 7 shows a decomposition in which w has a dependent to either side, v has both dependents to the right, and the alignment is &amp;quot;swapping&amp;quot; (f(wl) is to the right off(wr)). The construction for this decomposition case is illustrated in Figure 8 as part of a finite-state transition diagram, and described in more detail below. (The other transition arrows shown in the diagram will arise from other bitext alignments containing (w,f(w)) pairings.) Other cases covered by our algorithm (e.g., a single left source dependent but no right source dependent, or target dependents on either side of the target head) are simple variants.</Paragraph>
      <Paragraph position="6"> The detailed construction is as follows:  1. Construct a transition from sl = C/(initial) to S 2 = O'(w,f(w), head) mapping the source headword w to the target head f(w) at position 0 in source and target. (In our training construction there is only one initial state sl.) 2. Since the target dependentf(wr) is to the left of target dependentf(wl) (and we are restricting positions to {-1, 0, +1}) the Wr transition is constructed first in order that the target dependent nearest the head is output first.</Paragraph>
      <Paragraph position="7"> Construct a transition from s2 to s3 = c~(w,f(w), swapping, Wr,f(Wr) mapping the source dependent Wr at position +1 to the target dependent f(Wr) at position +1.</Paragraph>
      <Paragraph position="8"> 3. Construct a transition from s3 to s4 = cr(w,f(w),final) mapping the source dependent wl at position -1 to the target dependentf(wl) at position +1.</Paragraph>
      <Paragraph position="9">  If instead the alignment had been as in Figure 9, in which the source dependents are mapped to target dependents in a parallel rather than swapping configuration (the configuration of sin escalas and Boston around flights:los vuelos in Figure 6), the construction is the same, except for the following differences: .</Paragraph>
      <Paragraph position="10"> .</Paragraph>
      <Paragraph position="11"> Since the target dependentf(wl) is to the left of target dependentf(Wr), the wl transition is constructed first in order that the target dependent nearest the head is output first.</Paragraph>
      <Paragraph position="12"> The source and target positions are as shown in Figure 10. Instead of state s3, we use a different state ss = C/(w,f(w),parallel, wl,f(wl)).  States and transitions constructed for the &amp;quot;parallel&amp;quot; decomposition shown in Figure 9. Other states are the same as for the first case. The resulting states and transitions are shown in Figure 10.</Paragraph>
      <Paragraph position="13"> After the construction described above is applied to the entire set of aligned bi-texts in the training set, the counts for transitions are treated as event observation counts of a statistical dependency transduction model with the parameters described in Section 3.1. More specifically, the negated logs of these parameters are used as the weights for transducer transitions.</Paragraph>
    </Section>
    <Section position="4" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
4.4 Multiword Pairings
</SectionTitle>
      <Paragraph position="0"> In the translation application, source word w and target word v are generalized so they can be short substrings (compounds) of the source and target strings. Examples of such multiword pairs are show me:muestrdme and nonstop:sin escalas in Figure 6. The cost for such pairings still uses the same ~ statistic, now taking the observations to be the co-occurrences of the substrings in the training bitexts. However, in order that these costs can be comparable to the costs for simple pairings, they are multiplied by the number of words in the source substring of the pairing. null The use of compounds in pairings does not require any fundamental changes to the hierarchical alignment dynamic programming algorithm, which simply produces dependency trees with nodes that may be compounds. In the transducer construction phase of the training method, one of the words of a compound is taken to be the primary or &amp;quot;real&amp;quot; headword. (In fact, we take the least common word of a compound to be its head.) An extra chain of transitions is constructed to transduce the other words of compounds, if necessary using transitions with epsilon strings. This compilation means that the transduction algorithm is unaffected by the use of compounds when aligning training data, and there is no need for a separate compound identification phase when the transduction algorithm is applied to test data. Some results for different choices of substring lengths can be found in Alshawi, Bangalore, and Douglas (1998).</Paragraph>
      <Paragraph position="1">  Alshawi, Bangalore, and Douglas Learning Dependency Translation Models</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML