File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2016_metho.xml
Size: 12,800 bytes
Last Modified: 2025-10-06 14:09:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2016"> <Title>Dependency-Based Statistical Machine Translation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> * Syntax-based Statistical Translation (Yamada </SectionTitle> <Paragraph position="0"> and Knight, 2001): This model extends the above by allowing all possible permutations of the RHS of the English rules.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> * Statistical Phrase-based Translation (Koehn </SectionTitle> <Paragraph position="0"> et al., 2003): Here &quot;phrase-based&quot; means &quot;subsequence-based&quot;, as there is no guarantee that the phrases learned by the model will have any relation to what we would think of as syntactic phrases.</Paragraph> </Section> <Section position="6" start_page="0" end_page="91" type="metho"> <SectionTitle> * Dependency-based Translation ( VCmejrek et al., </SectionTitle> <Paragraph position="0"> 2003): This model assumes a dependency parser for the foreign language. The syntactic structure and labels are preserved during translation. Transfer is purely lexical. A generator builds an English sentence out of the structure, labels, and translated words.</Paragraph> </Section> <Section position="7" start_page="91" end_page="91" type="metho"> <SectionTitle> 2 System Overview </SectionTitle> <Paragraph position="0"> The basic framework of our system is quite similar to that of VCmejrek et al. (2003) (we reuse many of their ancillary modules). The difference is in how we use the dependency structures. VCmejrek et al.</Paragraph> <Paragraph position="1"> only translate the lexical items. The dependency structure and any features on the nodes are preserved and all other processing is left to the generator. In addition to lexical translation, our system models structural changes and changes to feature values, for although dependency structures are fairly well preserved across languages (Fox, 2002), there are certainly many instances where the structure must be modified.</Paragraph> <Paragraph position="2"> While the entire translation system is too large to discuss in detail here, I will provide brief descriptions of ancillary components. References are provided, where available, for those who want more information. null</Paragraph> <Section position="1" start_page="91" end_page="91" type="sub_section"> <SectionTitle> 2.1 Corpus Preparation </SectionTitle> <Paragraph position="0"> Our parallel Czech-English corpus is comprised of Wall Street Journal articles from 1989. The English data is from the University of Pennsylvania Tree-bank (Marcus et al., 1993; Marcus et al., 1994).</Paragraph> <Paragraph position="1"> The Czech translations of these articles are provided as part of the Prague Dependency Treebank (PDT) (B&quot;ohmov'a et al., 2001). In order to learn the parameters for our model, we must first create aligned dependency structures for the sentence pairs in our corpus. This process begins with the building of dependency structures.</Paragraph> <Paragraph position="2"> Since Czech is a highly inflected language, morphological tagging is extremely helpful for downstream processing. We generate the tags using the system described in (HajiVc and Hladk'a, 1998).</Paragraph> <Paragraph position="3"> The tagged sentences are parsed by the Charniak parser, this time trained on Czech data from the PDT.</Paragraph> <Paragraph position="4"> The resulting phrase structures are converted to tectogrammatical dependency structures via the procedure documented in (B&quot;ohmov'a, 2001). Under this formalism, function words are deleted and any information contained in them is preserved in features attached to the remaining nodes. Finally, functors (such as agent or patient) are automatically assigned to nodes in the tree ( VZabokrtsk'y et al., 2002).</Paragraph> <Paragraph position="5"> On the English side, the process is simpler. We japan automobiledealersassociation... ...</Paragraph> </Section> </Section> <Section position="8" start_page="91" end_page="92" type="metho"> <SectionTitle> NNP NNP NNPS NN </SectionTitle> <Paragraph position="0"> japan automobiledealersassociation... ...NNP NNP NNPS NN parse with the Charniak parser (Charniak, 2000) and convert the resulting phrase-structure trees to a function-argument formalism, which, like the tectogrammatic formalism, removes function words. This conversion is accomplished via deterministic application of approximately 20 rules.</Paragraph> <Section position="1" start_page="91" end_page="92" type="sub_section"> <SectionTitle> 2.2 Aligning the Dependency Structures </SectionTitle> <Paragraph position="0"> We now generate the alignments between the pairs of dependency structures we have created. We begin by producing word alignments with a model very similar to that of IBM Model 4 (Brown et al., 1993).</Paragraph> <Paragraph position="1"> We keep fifty possible alignments and require that each word has at least two possible alignments. We then align phrases based on the alignments of the words in each phrase span. If there is no satisfactory alignment, then we allow for structural mutations. The probabilities for these mutations are refined via another round of alignment. The structural mutations allowed are described below. Examples are shown in phrase-structure format rather than dependency format for ease of explanation.</Paragraph> <Paragraph position="2"> * KEEP: No change. This is the default.</Paragraph> <Paragraph position="3"> * SPLIT: One English phrase aligns with two Czech phrases and splitting the English phrase results in a better alignment. There are three types of split (left, right, middle) whose probabilities are also estimated. In the original structure of Figure 1, English node EN1 would align with Czech nodes CZ1 and CZ2. Splitting the English by adding child node EN3 results in a better alignment.</Paragraph> <Paragraph position="4"> * BUD: This adds a unary level in the English tree in the case when one English node aligns to two Czech nodes, but one of the Czech nodes is the parent of the other. In Figure 2, the Czech has one extra word &quot;spoleVcnost&quot; (&quot;company&quot;) compared with the English. English node EN1 would normally align to both CZ1 and CZ2.</Paragraph> <Paragraph position="5"> Adding a unary node EN2 to the English results in a better alignment.</Paragraph> <Paragraph position="6"> * ERASE: There is no corresponding Czech node for the English one. In Figure 3, the English has two nodes, EN1 and EN2, which have no corresponding Czech nodes. Erasing them brings the Czech and English structures into alignment.</Paragraph> </Section> </Section> <Section position="9" start_page="92" end_page="93" type="metho"> <SectionTitle> 3 Translation Model </SectionTitle> <Paragraph position="0"> Given E, the parse of the English string, our translation model can be formalized as P(F |E). Let E1 ...En be the nodes in the English parse, F be a parse of the Czech string, and F1 ...Fm be the nodes in the Czech parse. Then,</Paragraph> <Paragraph position="2"> We initially make several strong independence assumptions which we hope to eventually weaken.</Paragraph> <Paragraph position="3"> The first is that each Czech parse node is generated independently of every other one. Further, we specify that each English parse node generates exactly one (possibly NULL) Czech parse node.</Paragraph> <Paragraph position="5"> * A list of dependent nodes In order to produce a Czech parse node Fi, we must generate the following: Lemma fi: We generate the Czech lemma fi dependent only on the English word ei.</Paragraph> <Paragraph position="6"> Part of Speech tfi : We generate Czech part of speech tfi dependent on the part of speech of the Czech parent tfpar(i) and the corresponding English part of speech tei.</Paragraph> <Paragraph position="7"> Features < phfi [1],...,phfi [n] >: There are several features (see Table 1) associated with each parse node. Of these, all except IND are typical morphological and analytical features. IND (indicator) is a loosely-specified feature comprised of functors, where assigned, and other words or small phrases (often prepositions) which are attached to the node and indicate something about the node's function in the sentence. (e.g. an IND of &quot;at&quot; could indicate a locative function). We generate each Czech feature phfi [j] dependent only on its corresponding English feature phei[j].</Paragraph> <Paragraph position="8"> Head Position hi: When an English word is aligned to the head of a Czech phrase, the English word is typically also the head of its respective phrase. But, this is not always the case, so we model the probability that the English head will be aligned to either the Czech head or to one of its children. To simplify, we set the probability for each particular child being the head to be uniform in the number of children. The head position is generated independent of the rest of the sentence.</Paragraph> <Paragraph position="9"> Structural Mutation mi: Dependency structures are fairly well preserved across languages, but there are cases when the structures need to be modified. Section 2.2 contains descriptions of the different structural changes which we model. The mutation type is generated independent of the rest of the sentence.</Paragraph> <Section position="1" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 3.1 Model with Independence Assumptions </SectionTitle> <Paragraph position="0"> With all of the independence assumptions described above, the translation model becomes:</Paragraph> <Paragraph position="2"/> </Section> </Section> <Section position="10" start_page="93" end_page="93" type="metho"> <SectionTitle> 4 Training </SectionTitle> <Paragraph position="0"> The Czech and English data are preprocessed (see Section 2.1) and the resulting dependency structures are aligned (see Section 2.2). We estimate the model parameters from this aligned data by maximum likelihood estimation. In addition, we gather the inverse probabilities P(E |F) for use in the figure of merit which guides the decoder's search.</Paragraph> </Section> <Section position="11" start_page="93" end_page="94" type="metho"> <SectionTitle> 5 Decoding </SectionTitle> <Paragraph position="0"> Given a Czech sentence to translate, we first process it as described in Section 2.1. The resulting dependency structure is the input to the decoder. The decoder itself is a best-first decoder whose priority queue holds partially-constructed English nodes.</Paragraph> <Paragraph position="1"> For our figure of merit to guide the search, we use the probability P(E |F). We normalize this using the perplexity (2H) to compensate for the different number of possible values for the features ph[j]. Given two different features whose values have the same probability, the figure of merit for the feature with the greater uncertainty will be boosted. This prevents features with few possible values from monopolizing the search at the expense of the other features. Thus, for feature phei[j], the figure of merit is</Paragraph> <Paragraph position="3"> Since our goal is to build a forest of partial translations, we translate each Czech dependency node independently of the others. (As more conditioning factors are added in the future, we will instead translate small subtrees rather than single nodes.) Each translated node Ei is constructed incrementally in the following order: 1. Choose the head position hi 2. Generate the part of speech tei 3. For j = 1..n, generate phei[j] 4. Choose a structural mutation mi English nodes continue to be generated until either the queue or some other stopping condition is reached (e.g. having a certain number of possible translations for each Czech node). After stopping, we are left with a forest of English dependency nodes or subtrees.</Paragraph> </Section> <Section position="12" start_page="94" end_page="94" type="metho"> <SectionTitle> 6 Language Model </SectionTitle> <Paragraph position="0"> We use a syntax-based language model which was originally developed for use in speech recognition (Charniak, 2001) and later adapted to work with a syntax-based machine translation system (Charniak et al., 2001). This language model requires a forest of partial phrase structures as input. Therefore, the format of the output of the decoder must be changed. This is the inverse transformation of the one performed during corpus preparation. We accomplish this with a statistical tree transformation model whose parameters are estimated during the corpus preparation phase.</Paragraph> </Section> class="xml-element"></Paper>