File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0829_metho.xml

Size: 8,033 bytes

Last Modified: 2025-10-06 14:10:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0829">
  <Title>Competitive Grouping in Integrated Phrase Segmentation and Alignment Model</Title>
  <Section position="4" start_page="0" end_page="160" type="metho">
    <SectionTitle>
3 The Core of the Integrated Phrase
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="159" type="sub_section">
      <SectionTitle>
Segmentation and Alignment
</SectionTitle>
      <Paragraph position="0"> The competitive linking algorithm (CLA) (Melamed, 1997) is a greedy word alignment algorithm. It was designed to overcome the problem of indirect associations using a simple heuristic: whenever several word tokens fi in one half of the bilingual corpus co-occur with a particular word token e in the other half of the corpus, the word that is most likely to be e's translation is the one for which the likelihood L(f,e) of translational equivalence is highest. The simplicity of this algorithm depends on a one-to-one alignment assumption. Each word translates to at most one other word. Thus when one pair {f,e} is &amp;quot;linked&amp;quot;, neither f nor e can be aligned with any other words. This assumption renders CLA unusable in phrase level alignment.</Paragraph>
      <Paragraph position="1">  We propose an extension, the competitive grouping, as the core component in the ISA model.</Paragraph>
    </Section>
    <Section position="2" start_page="159" end_page="160" type="sub_section">
      <SectionTitle>
3.1 Competitive Grouping Algorithm (CGA)
</SectionTitle>
      <Paragraph position="0"> The key modification to the competitive linking algorithm is to make it less greedy. When a word pair is found to be the winner of the competition, we allow it to invite its neighbors to join the &amp;quot;winner's club&amp;quot; and group them together as an aligned phrase pair. The one-to-one assumption is thus discarded in CGA. In addition, we introduce the locality assumption for phrase alignment. Locality states that a source phrase of adjacent words can only be aligned to a target phrase composed of adjacent words. This is not true of most language pairs in cases such as the relative clause, passive tense, and prepositional clause, etc.; however this assumption renders the problem tractable. Here is a description of CGA: For a sentence pair {f,e}, represent the word pair statistics for each word pair {f,e} in a two dimensional matrix LIxJ, where L(i,j) = kh2(fi,ej) in  Denote an aligned phrase pair {~f,~e} as a tuple [istart,iend,jstart,jend] where ~f is fistart,fistart+1,...,fiend and similarly for ~e.</Paragraph>
      <Paragraph position="1"> 1. Find i[?] and j[?] such that L(i[?],j[?]) is the highest. Create a seed phrase pair [i[?],i[?],j[?],j[?]] which is simply the word pair {fi[?],ej[?]} itself.</Paragraph>
      <Paragraph position="2"> 2. Expand the current phrase pair [istart,iend,jstart,jend] to the neighboring territory to include adjacent source and target words in the phrase alignment group. There 1kh2 statistics were found to be more discriminative in our experiments than other symmetric word association measures, such as the averaged mutual information, ph2 statistics and Dicecoefficient. null are 8 ways to group new words into the phrase pair. For example, one can expand to the north by including an additional source word fistart[?]1 to be aligned with all the target words in the current group; or one can expand to the northeast by including fistart[?]1 and ejend+1 (Figure 1).</Paragraph>
      <Paragraph position="3"> Two criteria have to be satisfied for each expansion: null (a) If a new source word fiprime is to be grouped, maxjstart[?]j[?]jend L(iprime,j) should be no smaller than max1[?]j[?]J L(iprime,j). Since CGA is a greedy algorithm as described below, this is to guarantee that fiprime will not &amp;quot;regret&amp;quot; the decision of joining the phrase pair because it does not have other &amp;quot;better&amp;quot; target words to be aligned with. Similar constraint is applied if a new target word ejprime is to be grouped.</Paragraph>
      <Paragraph position="4"> (b) The highest value in the newly-expanded area needs to be &amp;quot;similar&amp;quot; to the seed value L(i[?],j[?]).</Paragraph>
      <Paragraph position="5"> Expand the current phrase pair to the largest extend possible as long as both criteria are satisfied. null  3. The locality assumption means that the aligned phrase cannot be aligned again. Therefore, all the source and target words in the phrase pair are marked as &amp;quot;invalid&amp;quot; and will be skipped in the following steps.</Paragraph>
      <Paragraph position="6"> 4. If there is another valid pair {fi,ej}, then re-</Paragraph>
    </Section>
    <Section position="3" start_page="160" end_page="160" type="sub_section">
      <SectionTitle>
3.2 Exploring all possible groupings
</SectionTitle>
      <Paragraph position="0"> The similarity criterion 2-(b) described previously is used to control the granularity of phrase pairs.</Paragraph>
      <Paragraph position="1"> In cases where the pairs {f1f2,e1e2}, {f1,e1} and {f2,e2} are all valid translations pairs, similarity is used to control whether we want to align {f1f2,e1e2} as one phrase pair or two shorter ones.</Paragraph>
      <Paragraph position="2"> The granularity of the phrase pairs is hard to optimize especially when the test data is unknown. On the one hand, we prefer long phrases since interaction among the words in the phrase, for example word sense, morphology and local reordering could be encapsulated. On the other hand, long phrase pairs are less likely to occur in the test data than the shorter ones and may lead to low coverage. To have both long and short phrases in the alignment, we apply a range of similarity thresholds for each of the expansion operations. By applying a low similarity threshold, the expanded phrase pairs tend to be large, while a higher similarity threshold results in shorter phrase pairs. As described above, CGA is a greedy algorithm and the expansion of the seed pair restricts the possible alignments for the rest of the sentence.</Paragraph>
      <Paragraph position="3"> Figure 4 shows an example as we explore all the possible grouping choices in a depth-first search. In the end, all unique phrase pairs along the path traveled are output as phrase translation candidates for the current sentence pair.</Paragraph>
    </Section>
    <Section position="4" start_page="160" end_page="160" type="sub_section">
      <SectionTitle>
3.3 Phrase translation probabilities
</SectionTitle>
      <Paragraph position="0"> Each aligned phrase pair {~f,~e} is assigned a likelihood score L(~f,~e), defined as:</Paragraph>
      <Paragraph position="2"> where i ranges over all words in ~f and similarly j in ~e.</Paragraph>
      <Paragraph position="3"> Given the collected phrase pairs and their likelihood, we estimate the phrase translation probability</Paragraph>
      <Paragraph position="5"> No smoothing is applied to the probabilities.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="160" end_page="160" type="metho">
    <SectionTitle>
4 Learning co-occurrence information
</SectionTitle>
    <Paragraph position="0"> In most cases, word alignment information is not given and is treated as a hidden parameter in the training process. We initialize a word pair co-occurrence frequency by assuming uniform alignment for each sentence pair, i.e. for sentence pair (f,e) where f has I words and e has J words, each word pair {f,e} is considered to be aligned with frequency 1IxJ . These co-occurrence frequencies will be accumulated over the whole corpus to calculate  the initial L(f,e). Then we iterate the ISA model: 1. Apply the competitive grouping algorithm to each sentence pair to find all possible phrase pairs.</Paragraph>
    <Paragraph position="1"> 2. For each identified phrase pair {~f,~e}, increase the co-occurrence counts for all word pairs inside {~f,~e} with weight 1|~f|*|~e|.</Paragraph>
    <Paragraph position="2"> 3. Calculate L(f,e) again and goto Step 1 for several iterations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML