File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1082_intro.xml
Size: 3,096 bytes
Last Modified: 2025-10-06 14:03:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1082"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Word Alignment in English-Hindi Parallel Corpus Using Recency-Vector Approach: Some Studies</Title> <Section position="3" start_page="0" end_page="649" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Several approaches including statistical techniques (Gale and Church, 1991; Brown et al., 1993), lexical techniques (Huang and Choi, 2000; Tiedemann, 2003) and hybrid techniques (Ahrenberg et al., 2000), have been pursued to design schemes for word alignment which aims at establishing links between words of a source language and a target language in a parallel corpus. All these schemes rely heavily on rich linguistic resources, either in the form of huge data of parallel texts or various language/grammar related tools, such as parser, tagger, morphological analyser etc.</Paragraph> <Paragraph position="1"> Recency vector based approach has been proposed as an alternative strategy for word alignment. Approaches based on recency vectors typically consider the positions of the word in the corresponding texts rather than sentence boundaries. Two algorithms of this type can be found in (Fung and McKeown, 1994) and (Somers, 1998).</Paragraph> <Paragraph position="2"> The algorithms first compute the position vector Vw for the word w in the text. Typically, Vw is of the form <p1p2 ...pk> , where the pis indicate the positions of the word w in a text T. A new vector Rw, called the recency vector, is computed using the position vector Vw, and is defined as <p2[?]p1,p3[?]p2,...,pk[?]pk[?]1> . In order to compute the alignment of a given word in the source language text, the recency vector of the word is compared with the recency vector of each target language word and the similarity between them is measured by computing a matching cost associated with the recency vectors using dynamic programming. The target language word having the least cost is selected as the aligned word.</Paragraph> <Paragraph position="3"> The results given in the above references show that the algorithms worked quite well in aligning words in parallel corpora of language pairs consisting of various European languages and Chinese, Japanese, taken pair-wise. Precision of about 70% could be achieved using these algorithms.</Paragraph> <Paragraph position="4"> The major advantage of this approach is that it can work even on a relatively small dataset and it does not rely on rich language resources.</Paragraph> <Paragraph position="5"> The above advantage motivated us to study the effectiveness of these algorithms for aligning words in English-Hindi parallel texts. The corpus used for this work is described in Table 1. It has been made manually from three different sources: children's storybooks, English to Hindi translation book material, and advertisements. We shall call the three corpora as Storybook corpus, Sentence corpus and Advertisement corpus, respectively.</Paragraph> </Section> class="xml-element"></Paper>