File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1406_intro.xml

Size: 11,594 bytes

Last Modified: 2025-10-06 14:01:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1406">
  <Title>A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
3 Alignment
</SectionTitle>
    <Paragraph position="0"> We consider an alignment of two logical forms to be a set of mappings, such that each mapping is between a node or set of nodes (and the relations between them) in the source LF and a node or set of nodes (and the relations between them) in the target LF, where no node participates in more than one such mapping. In other words, we allow one-to-one, one-to-many, many-to-one and many-to-many mappings but the mappings do not overlap.</Paragraph>
    <Paragraph position="1"> Our alignment algorithm proceeds in two phases. The first phase establishes tentative lexical correspondences between nodes in the source and target LFs. The second phase aligns nodes based on these lexical correspondences as well as structural considerations. The algorithm starts from the nodes with the tightest lexical correspondence (&amp;quot;best-first&amp;quot;) and works outward from these anchor points.</Paragraph>
    <Paragraph position="2"> We first present the algorithm, and then illustrate how it applies to the sentence-pair in Figure-1.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Finding tentative lexical
correspondences
</SectionTitle>
      <Paragraph position="0"> We use a bilingual lexicon that merges data from several sources (CUP, 1995), (SoftArt, 1995), (Langenscheidt, 1997), and inverts target-to-source dictionaries to improve coverage. Our Spanish-English lexicon contains 88,500 translation pairs. We augment this with 19,762 translation correspondences acquired using statistical techniques described by Moore (2001).</Paragraph>
      <Paragraph position="1"> Like Watanabe (2000) and Meyers (2000), we use a lexicon to establish initial tentative word correspondences. However, we have found that even a relatively large bilingual dictionary has only moderately good coverage for our purposes. Hence, we pursue an aggressive matching strategy for establishing tentative word correspondences. Using the bilingual dictionary together with the derivational morphology component in our system (Pentheroudakis, 1993), we find direct translations, translations of morphological bases and derivations, and base and derived forms of translations. Fuzzy string matching is also used to identify possible correspondences. We have found that aggressive over-generation of correspondences at this phase is balanced by the more conservative second phase and results in improved overall alignment quality.</Paragraph>
      <Paragraph position="2"> We also look for matches between components of multi-word expressions and individual words. This allows us to align such expressions that may have been analyzed as a single lexicalized entity in one language but as separate words in the other.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Aligning nodes
</SectionTitle>
      <Paragraph position="0"> Our alignment procedure uses the tentative lexical correspondences established above, as well as structural cues, to create affirmative node alignments. A set of alignment grammar rules licenses only linguistically meaningful alignments. The rules are ordered to create the most unambiguous alignments (&amp;quot;best&amp;quot;) first and use these to disambiguate subsequent alignments. The algorithm and the alignment grammar rules are intended to be applicable across multiple languages. The rules were developed while working primarily with a Spanish-English corpus, but have also been applied to other language pairs such as French, German, and Japanese to/from English.</Paragraph>
      <Paragraph position="1"> The algorithm is as follows: 1. Initialize the set of unaligned source and target nodes to the set of all source and target nodes respectively.</Paragraph>
      <Paragraph position="2"> 2. Attempt to apply the alignment rules in the specified order, to each unaligned node or set of nodes in source and target. If a rule fails to apply to any unaligned node or set of nodes, move to the next rule.</Paragraph>
      <Paragraph position="3">  3. If all rules fail to apply to all nodes, exit. No more alignment is possible. (Note: some nodes may remain unaligned).</Paragraph>
      <Paragraph position="4"> 4. When a rule applies, mark the nodes or sets of nodes to which it applied as aligned to each other and remove them from the lists  of unaligned source and target nodes respectively. Go to step 2 and apply rules again, starting from the first rule.</Paragraph>
      <Paragraph position="5"> The alignment grammar currently consists of 18 rules. Below we provide the specification for some of the most important rules.</Paragraph>
      <Paragraph position="6">  1. Bidirectionally unique translation:Asetof contiguous source nodes S and a set of contiguous target nodes T such that every node in S has a lexical correspondence with every node in T and with no other target node, and every node in T has a lexical correspondence with every node in S and with no other source node. Align S and T to each other.</Paragraph>
      <Paragraph position="7"> 2. Translation + Children: A source node S and a target node T that have a lexical correspondence, such that each child of S andTisalreadyalignedtoachildofthe other. Align S and T to each other.</Paragraph>
      <Paragraph position="8"> 3. Translation + Parent: A source node S and a target node T that have a lexical correspondence, such that a parent P s of S has already been aligned to a parent P t of T.</Paragraph>
      <Paragraph position="9"> Align S and T to each other.</Paragraph>
      <Paragraph position="10"> 4. Verb+Object to Verb: AverbV  a target node T, with the same part-ofspeech, and no unaligned siblings, where a</Paragraph>
      <Paragraph position="12"> of T, and the relationship between P s and S is the same as that between P t and T.</Paragraph>
      <Paragraph position="13"> Align S and T to each other. 6. Child + relationship: Analogous to previous rule but based on previously aligned children instead of parents.  Note that rules 4-6 do not exploit lexical correspondence, relying solely on relationships between nodes being examined and previously aligned nodes.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Alignment Example
</SectionTitle>
      <Paragraph position="0"> In this section, we illustrate the application of the alignment procedure to the example in Figure 1. In the first phase, using the bilingual lexicon, we identify the lexical correspondences depicted in Figure-1a as dotted lines. Note that each of the two instances of hipervinculo has two ambiguous correspondences, and that while the correspondence from Informacion to Hyperlink Information is unique, the reverse is not. Note also that neither the monolingual nor bilingual lexicons have been customized for this domain. For example, there is no entry in either lexicon for Hyperlink_Information.Thisunithas been assembled by general-purpose &amp;quot;Captoid&amp;quot; grammar rules. Similarly, lexical correspondences established for this unit are based on translations found for its individual components, there being no lexicon entry for the captoid as a whole.</Paragraph>
      <Paragraph position="1"> In the next phase, the alignment rules apply to create alignment mappings depicted in Figure-1b as dotted lines.</Paragraph>
      <Paragraph position="2"> Rule-1: Bidirectionally unique translation, applies in three places, creating alignment mappings between direccion and address, usted and you, and clic and click. These are the initial &amp;quot;best&amp;quot; alignments that provide the anchors from which we will work outwards to align the rest of the structure.</Paragraph>
      <Paragraph position="3"> Rule-3: Translation + Parent, applies next to align the instance of hipervinculo that is the child of direccion to hyperlink, which is the child of address. We leverage a previously created alignment (direccion to address)and the structure of the logical form to resolve the ambiguity present at the lexical level.</Paragraph>
      <Paragraph position="4"> Rule-1 now applies (where previously it did not) to create a many-to-one mapping between informacion and hipervinculo to Hyperlink_Information. The uniqueness condition in this rule is now met because the ambiguous alternative was cleared away by the prior application of Rule-3.</Paragraph>
      <Paragraph position="5"> Rule-4: Verb+Object to Verb applies to rollup hacer with its object clic, since the latter is already aligned to a verb. This produces the many-to-one alignment of hacer and clic to</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Transfer mappings with context
</SectionTitle>
      <Paragraph position="0"> Each mapping created during alignment forms the core of a family of mappings emitted by the transfer mapping acquisition procedure. The alignment mapping by itself represents a minimal transfer mapping with no context. In addition, we emit multiple variants, each one expanding the core mapping with varying types and amounts of local context.</Paragraph>
      <Paragraph position="1"> We use linguistic constructs such as noun and verb phrases to provide the boundaries for the context we include. For example, the transfer mapping for an adjective is expanded to include the noun it modifies; the mapping for a modal verb is expanded to include the main verb; the mapping for a main verb is expanded to include its object; mappings for collocations of nouns are emitted individually and as a whole. Mappings may include &amp;quot;wild card&amp;quot; or under-specified nodes, with a part of speech, but no lemma, as shown in Figure 2.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Alignment Post-processing
</SectionTitle>
      <Paragraph position="0"> After we have acquired transfer mappings from our entire training corpus, we compute frequencies for all mappings. We use these to resolve conflicting mappings, i.e. mappings where the source sides of the mapping are identical, but the target sides differ. Currently we resolve the conflict by simply picking the most frequent mapping. Note that this does not imply that we are committed to a single translation for every word across the corpus, since we emitted each mapping with different types and amounts of context (see section 4.1).</Paragraph>
      <Paragraph position="1"> Ideally at least one of these contexts serves to disambiguate the translation. The conflicts being resolved here are those mappings where the necessary context is not present.</Paragraph>
      <Paragraph position="2"> A drawback of this approach is that we are relying on a priori linguistic heuristics to ensure that we have the right context. Our future work plans to address this by iteratively searching for the context that serves to optimally disambiguate (across the entire training corpus) between conflicting mappings.</Paragraph>
      <Paragraph position="3">  During post-processing we also apply a frequency threshold, keeping only mappings seen at least N times (where N is currently 2). This frequency threshold greatly improves the speed of the runtime system, with negligible impact on translation quality (see section 5.6).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML