File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0825_metho.xml
Size: 5,726 bytes
Last Modified: 2025-10-06 14:10:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0825"> <Title>A Generalized Alignment-Free Phrase Extraction</Title> <Section position="4" start_page="141" end_page="142" type="metho"> <SectionTitle> 3 Length Model: Dynamic Programming </SectionTitle> <Paragraph position="0"> Given the word fertility de nitions in IBM Models (Brown et al., 1993), we can compute a probability to predict phrase length: given the candidate target phrase (English) eI1, and a source phrase (French) of length J, the model gives the estimation of P(J|eI1) via a dynamic programming algorithm using the source word fertilities. Figure 2 shows an example fertility trellis of an English trigram. Each edge between two nodes represents one English word ei. The arc between two nodes represents one candidate non-zero fertility for ei. The fertility of zero (i.e. generating a NULL word) corresponds to the direct edge between two nodes, and in this way, the NULL word is naturally incorporated into this model's representation. Each arc is programming associated with a English word fertility probability P(phi|ei). A path phI1 through the trellis represents the number of French words phi generated by each English word ei. Thus, the probability of generating J words from the English phrase along the Viterbi path is:</Paragraph> <Paragraph position="2"> The Viterbi path is inferred via dynamic programming in the trellis of the lower panel in Figure 2:</Paragraph> <Paragraph position="4"> where PNULL(0|ei) is the probability of generating a NULL word from ei; Pph(k = 1|ei) is the usual word fertility model of generating one French word from the word ei; ph[j, i] is the cost so far for generating j words from i English words ei1 : e1,*** , ei. After computing the cost of ph[J, I], we can trace back the Viterbi path, along which the probability P(J|eI1) of generating J French words from the English phrase eI1 as shown in Eqn. 1.</Paragraph> <Paragraph position="5"> With this phrase length model, for every candidate block, we can compute a phrase level fertility score to estimate to how good the phrase pairs are match in their lengthes.</Paragraph> </Section> <Section position="5" start_page="142" end_page="143" type="metho"> <SectionTitle> 4 Distortion of Centers </SectionTitle> <Paragraph position="0"> The centers of source and target phrases are both illustrated in Figure 1. We compute a simple distortion score to estimate how far away the two centers are in a parallel sentence pair in a sense the block is close to the diagonal.</Paragraph> <Paragraph position="1"> In our algorithm, the source center circledotfj+l j of the phrase fj+lj with length l +1 is simply a normalized relative position de ned as follows:</Paragraph> <Paragraph position="3"> where |F |is the French sentence length.</Paragraph> <Paragraph position="4"> For the center of English phrase ei+ki in the target sentence, we rst de ne the expected corresponding relative center for every French word fjprime using the lexicalized position score as follows:</Paragraph> <Paragraph position="6"> where |E |is the English sentence length. P(fjprime|ei) is the word translation lexicon estimated in IBM Models. i is the position index, which is weighted by the word level translation probabilities; the term of PIi=1 P(fjprime|ei) provides a normalization so that the expected center is within the range of target sentence length. The expected center for ei+ki is simply a average of circledotei+k</Paragraph> <Paragraph position="8"> This is a general framework, and one can certainly plug in other kinds of score schemes or even word alignments to get better estimations.</Paragraph> <Paragraph position="9"> Given the estimated centers of circledotfj+l</Paragraph> <Paragraph position="11"> , we can compute how close they are by the probability of P(circledotei+k</Paragraph> <Paragraph position="13"> ), one can start with a at gaussian model to enforce the point of (circledotei+k</Paragraph> <Paragraph position="15"> ) not too far off the diagonal and build an initial list of phrase pairs, and then compute the histogram to approxi- null Overall, for each source phrase fj+lj , the algorithm rst estimates its normalized relative center in the source sentence, its projected relative center in the target sentence. The scores of the phrase length, center-based distortion, and a lexicon based score are computed for each candidate block A local greedy search is carried out for the best scored phrase pair (fj+lj , ei+ki ).</Paragraph> <Paragraph position="16"> In our submitted system, we computed the following seven base scores for phrase pairs: Pef(fj+lj |ei+ki ), Pfe(ei+ki |fj+lj ), sharing similar function form in Eqn. 5.</Paragraph> <Paragraph position="18"> We compute phrase level relative frequency in both directions: Prf(fj+lj |ei+ki ) and Prf(ei+ki |fj+lj ). We compute two other lexicon scores which were also used in (Vogel et al., 2004): S1(fj+lj |ei+ki ) and</Paragraph> <Paragraph position="20"> In addition, we put the phrase level fertility score computed in section 3 via dynamic programming to be as one additional score for decoding.</Paragraph> <Paragraph position="21"> 2: Output: PhraseSet: Phrase pair collections. 3: Loop over the next sentence pair 4: for j : 0 - |F|[?] 1, 5: for l : 0 - MaxLength, 6: foreach fj+lj 7: compute circledotf and circledotE 8: left = circledotE *|E|-MaxLength, 9: right= circledotE *|E|+MaxLength, 10: for i : left - right, 11: for k : 0 - right, 12: compute circledote of ei+ki , 13: score the phrase pair (fj+lj , ei+ki ), where score = P(circledote|circledotf)P(l|ei+ki )P(fj+lj |ei+ki ) 14: add top-n {(fj+lj , ei+ki )} into PhraseSet.</Paragraph> </Section> class="xml-element"></Paper>