XML Viewer - c04-1031

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1031_intro.xml
Size: 5,202 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1031">
  <Title>Word to word alignment strategies</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word alignment is the task of identifying translational relations between words in parallel corpora with the aim of re-using them in natural language processing. Typical applications that make use of word alignment techniques are machine translation and multi-lingual lexicography. Several approaches have been proposed for the automatic alignment of words and phrases using statistical techniques and alignment heuristics, e.g. (Brown et al., 1993; Vogel et al., 1996; Garc'ia-Varea et al., 2002; Ahrenberg et al., 1998; Tiedemann, 1999; Tufis and Barbu, 2002; Melamed, 2000). Word alignment usually includes links between so-called multi-word units (MWUs) in cases where lexical items cannot be split into separated words with appropriate translations in another language. See for example the alignment between an English sentence and a Swedish sentence illustrated in figure 1. There are MWUs in both languages aligned to corresponding translations in the other language. The Swedish compound &amp;quot;mittplatsen&amp;quot; corresponds to three words in English (&amp;quot;the middle seat&amp;quot;) and the English verb &amp;quot;dislike&amp;quot; is translated into a Swedish particle verb &amp;quot;tycker om&amp;quot; (English: like) that has been negated using &amp;quot;inte&amp;quot;. Most approaches model Jag tar mittplatsen, vilket jag inte tycker om, men det gor mig inte sa mycket. I take the middle seat, which I dislike, but I am not really put out.  tion (Bellow, 1977) (the Bellow corpus).</Paragraph>
    <Paragraph position="1"> word alignment as links between words in the source language and words in the target language as indicated by the arrows in figure 1.</Paragraph>
    <Paragraph position="2"> However, in cases like the English expression &amp;quot;I am not really put out&amp;quot; which corresponds to the Swedish expression &amp;quot;det g&amp;quot;or mig inte s@a mycket&amp;quot; there is no proper way of connecting single words with each other in order to express this relation. In some approaches such relations are constructed in form of an exhaustive set of links between all word pairs included in both expressions (Melamed, 1998; Mihalcea and Pedersen, 2003). In other approaches complex expressions are identified in a pre-processing step in order to handle them as complex units in the same manner as single words in alignment (Smadja et al., 1996; Ahrenberg et al., 1998; Tiedemann, 1999).</Paragraph>
    <Paragraph position="3"> The one-to-one word linking approach seems to be very limited. However, single word links can be combined in order to describe links be- null tween multi-word units as illustrated in figure 1. In this paper we investigate different alignment strategies using this approach1. For this we apply clue alignment introduced in the next section.</Paragraph>
    <Paragraph position="4"> 2 Word alignment with clues  The clue alignment approach has been presented in (Tiedemann, 2003). Alignment clues represent probabilistic indications of associa1A similar study on statistical alignment models is included in (Och and Ney, 2003).</Paragraph>
    <Paragraph position="5"> tions between lexical items collected from different sources. Declarative clues can be taken from linguistic resources such as bilingual dictionaries. They may also include pre-defined relations between lexical items based on certain features such as parts of speech. Estimated clues are derived from the parallel data using, for example, measures of co-occurrence (e.g. the Dice coefficient (Smadja et al., 1996)), statistical alignment models (e.g. IBM models from statistical machine translation (Brown et al., 1993)), or string similarity measures (e.g. the longest common sub-sequence ratio (Melamed, 1995)). They can also be learned from previously aligned training data using linguistic and contextual features associated with aligned items. Relations between certain word classes with respect to the translational association of words belonging to these classes is one example of such clues that can be learned from aligned training data. In our experiments, for example, we will use clues that indicate relations between lexical items based on their part-of-speech tags and their positions in the sentence relative to each other. They are learned from automatically word-aligned training data.</Paragraph>
    <Paragraph position="6"> The clue alignment approach implements a way of combining association indicators on a word-to-word level. The combination of clues results in a two-dimensional clue matrix. The values in this matrix express the collected evidence of an association between word pairs in bitext segments taken from a parallel corpus.</Paragraph>
    <Paragraph position="7"> Word alignment is then the task of identifying the best links according to the associations indicated in the clue matrix. Several strategies for such an alignment are discussed in the following section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML