File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0302_intro.xml
Size: 5,808 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0302"> <Title>ProAlign: Shared Task System Description</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 3 Probability Model </SectionTitle> <Paragraph position="0"> We define the word alignment problem as finding the alignment A that maximizes P(A|E,F). ProAlign models P(A|E,F) directly, using a different decomposition of terms than the model used by IBM (Brown et al., 1993). In the IBM models of translation, alignments exist as artifacts of a stochastic process, where the words in the English sentence generate the words in the French sentence. Our model does not assume that one sentence generates the other. Instead it takes both sentences as given, and uses the sentences to determine an alignment.</Paragraph> <Paragraph position="1"> An alignment A consists of t links {l1,l2,...,lt}, where each lk = l(eik,fjk) for some ik and jk. We will refer to consecutive subsets of A as lji = {li,li+1,...,lj}. Given this notation, P(A|E,F) can be decomposed as follows:</Paragraph> <Paragraph position="3"> At this point, we factor P(lk|E,F,lk[?]11 ) to make computation feasible. Let Ck = {E,F,lk[?]11 } represent the context of lk. Note that both the context Ck and the link lk imply the occurrence of eik and fjk. We can rewrite</Paragraph> <Paragraph position="5"> Here P(lk|eik,fjk) is link probability given a co-occurrence of the two words, which is similar in spirit to Melamed's explicit noise model (Melamed, 2000). This term depends only on the words involved directly in the link. The ratio P(Ck|lk)P(C k|eik,fjk) modifies the link probability, providing context-sensitive information.</Paragraph> <Paragraph position="6"> Ck remains too broad to deal with in practical systems. We will consider only a subset FTk of relevant features of Ck. We will make the Na&quot;ive Bayes-style assumption that these features ft [?] FTk are conditionally independent given either lk or (eik,fjk). This produces a tractable formulation for P(A|E,F):</Paragraph> <Paragraph position="8"> More details on the probability model used by ProAlign are available in (Cherry and Lin, 2003).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Features used in the shared task </SectionTitle> <Paragraph position="0"> For the purposes of the shared task, we use two feature types. Each type could have any number of instantiations for any number of contexts. Note that each feature type is described in terms of the context surrounding a word pair.</Paragraph> <Paragraph position="1"> The first feature type fta concerns surrounding links. It has been observed that words close to each other in the source language tend to remain close to each other in the translation (S. Vogel and Tillmann, 1996). To capture this notion, for any word pair (ei,fj), if a link l(eiprime,fjprime) exists within a window of two words (where i[?]2 [?] iprime [?] i+2 and j[?]2 [?] jprime [?] j+2), then we say that the feature fta(i[?]iprime,j [?]jprime,eiprime) is active for this context. We refer to these as adjacency features.</Paragraph> <Paragraph position="2"> The second feature type ftd uses the English parse tree to capture regularities among grammatical relations between languages. For example, when dealing with French and English, the location of the determiner with respect to its governor is never swapped during translation, while the location of adjectives is swapped frequently. For any word pair (ei,fj), let eiprime be the governor of ei, and let rel be the relationship between them. If a link l(eiprime,fjprime) exists, then we say that the feature ftd(j [?] jprime,rel) is active for this context. We refer to these as dependency features.</Paragraph> <Paragraph position="3"> Take for example Figure 2 which shows a partial alignment with all links completed except for those involving the. Given this sentence pair and English parse tree, we can extract features of both types to assist in the alignment of the1. The word pair (the1,lprime) will have an active adjacency feature fta(+1,+1,host) as well as a dependency feature ftd([?]1,det). These two features will work together to increase the probability of this correct link.</Paragraph> <Paragraph position="4"> the host discovers all the devices In contrast, the incorrect link (the1,les) will have only ftd(+3,det), which will work to lower the link probability, since most determiners are located before their governors. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Training the model </SectionTitle> <Paragraph position="0"> Since we always work from a current alignment, training the model is a simple matter of counting events in the current alignment. Link probability is the number of time two words are linked, divided by the number of times they co-occur. The various feature probabilities can be calculated by also counting the number of times a feature occurs in the context of a linked pair of words, and the number of times the feature is active for co-occurrences of the same word pair.</Paragraph> <Paragraph position="1"> Considering only a single, potentially noisy alignment for a given sentence pair can result in reinforcing errors present in the current alignment during training. To avoid this problem, we sample from a space of probable alignments, as is done in IBM models 3 and above (Brown et al., 1993), and weight counts based on the likelihood of each alignment sampled under the current probability model. To further reduce the impact of rare, and potentially incorrect events, we also smooth our probabilities using m-estimate smoothing (Mitchell, 1997).</Paragraph> </Section> </Section> class="xml-element"></Paper>