File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1108_metho.xml
Size: 11,194 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1108"> <Title>Learning Bilingual Translations from Comparable Corpora to Cross-Language Information Retrieval: Hybrid Statistics-based and Linguistics-based Approach</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Researches on corpus-based approaches to machine translation (MT) have been on the rise, particularly because of their promise to provide bilingual terminology and enrich lexical resources such as bilingual dictionaries and thesauri. These approaches generally rely on large text corpora, which play an important role in Natural Language Processing (NLP) and Information Retrieval (IR). Moreover, non-aligned comparable corpora have been given a special interest in bilingual terminology acquisition and lexical resources enrichment (Dagan and Itai, 1994; Dejean et al., 2002; Diab and Finch, 2000; Fung, 2000; Koehn and Knight, 2002; Nakagawa, 2000; Peters and Picchi, 1995; Rapp, 1999; Shahzad and al., 1999; Tanaka and Iwasaki, 1996).</Paragraph> <Paragraph position="1"> Unlike parallel corpora, comparable corpora are collections of texts from pairs or multiples of languages, which can be contrasted because of their common features, in the topic, the domain, the authors or the time period. This property made comparable corpora more abundant, less expensive and more accessible through the World Wide Web.</Paragraph> <Paragraph position="2"> In the present paper, we are concerned by exploiting scarce resources for bilingual terminology acquisition, then evaluations on Cross-Language Information Retrieval (CLIR). CLIR consists of retrieving documents written in one language using queries written in another language. An application is conducted on NTCIR, a large-scale data collection for (Japanese, English) language pair.</Paragraph> <Paragraph position="3"> The remainder of the present paper is organized as follows: Section 2 presents the proposed two-stages approach for bilingual terminology acquisition from comparable corpora. Section 3 describes the integration of linguistic knowledge for pruning the translation candidates. Experiments and evaluations in CLIR are discussed in Sections 4. Section 5 concludes the present paper.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Two-stages Comparable Corpora-based Approach </SectionTitle> <Paragraph position="0"> Our proposed approach to bilingual terminology acquisition from comparable corpora (Sadat et al., 2003; Sadat et al., 2003) is based on the assumption of similar collocation, i.e., If two words are mutual translations, then their most frequent collocates are likely to be mutual translations as well. Moreover, we apply this assumption in both directions of the corpora, i.e., find translations of the source term in the target language corpus but also translations of the target terms in the source language corpus.</Paragraph> <Paragraph position="1"> The proposed two-stages approach for the acquisition, disambiguation and selection of bilingual terminology is described as follows: Bilingual terminology acquisition from source language to target language to yield a first translation model, represented by similarity SIMS!T .</Paragraph> <Paragraph position="2"> Bilingual terminology acquisition from target language to source language to yield a second translation model, represented by similarity SIMT!S.</Paragraph> <Paragraph position="3"> Merge the first and second models to yield a two-stages translation model, based on bi-directional comparable corpora and represented by similarity SIMS$T .</Paragraph> <Paragraph position="4"> We follow strategies of previous researches (Dejean et al., 2002; Fung, 2000; Rapp, 1999) for the first and second translation models and propose a merging strategy for the two-stages translation model (Sadat et al., 2003).</Paragraph> <Paragraph position="5"> First, word frequencies, context word frequencies in surrounding positions (here three-words window) are computed following a statistics-based metrics, the log-likelihood ratio (Dunning, 1993). Context vectors for each source term and each target term are constructed. Next, context vectors of the target words are translated using a preliminary bilingual dictionary. We consider all translation candidates, keeping the same context frequency value as the source term. This step requires a seed lexicon, to expand using the proposed bootstrapping approach of this paper. Similarity vectors are constructed for each pair of source term and target term using the cosine metric (Salton and McGill, 1983).</Paragraph> <Paragraph position="6"> Therefore, similarity vectors SIMS!T and SIMT!S for the first and second models are constructed and merged for a bi-directional acquisition of bilingual terminology from source language to target language. The merging process will keep common pairs of source term and target translation (s,t) which appear in SIMS!T as pairs of (s,t) but also in SIMT!S as pairs of (t,s), to result in combined similarity vectors SIMS$T for each pair (s,t).The product of similarity values of both similarity vectors SIMS!T for pairs (s,t) and SIMT!S for pairs (t,s) will result in similarity values in vectors SIMS$T .</Paragraph> <Paragraph position="7"> Therefore, similarity vectors of the two-stages translation model are expressed as follows:</Paragraph> <Paragraph position="9"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Linguistics-based Pruning </SectionTitle> <Paragraph position="0"> Combining linguistic and statistical methods is becoming increasingly common in computational linguistics, especially as more corpora become available (Klanvans and Tzoukermann, 1996; Sadat et al., 2003). We propose to integrate linguistic concepts into the corpora-based translation model. Morphological knowledge such as Part-of-Speech (POS) tags, context of terms, etc., could be valuable to filter and prune the extracted translation candidates. The objective of the linguistics-based pruning technique is the detection of terms and their translations that are morphologically close enough, i.e., close or similar POS tags. This proposed approach will select a fixed number of equivalents from the set of extracted target translation alternatives that match the Part-of-Speech of the source term.</Paragraph> <Paragraph position="1"> Therefore, POS tags are assigned to each source term (Japanese) via morphological analysis. As well, a target language morphological analysis will assign POS tags to the translation candidates. We restricted the pruning technique to nouns, verbs, adjectives and adverbs, although other POS tags could be treated in a similar way. For Japanese-English1 pair of languages, Japanese nouns (a0a2a1 ) are compared to English nouns (NN) and Japanese verbs (a3 a1 ) to English verbs (VB). Japanese adverbs (</Paragraph> <Paragraph position="3"> are compared to English adverbs (RB) and adjectives (JJ); while, Japanese adjectives (a5a7a6 a1 ) are compared to English adverbs (RB) and adjectives (JJ). This is because most adverbs in Japanese are formed from adjectives. Thus. We select pairs or source term and target translation (s,t) such as:</Paragraph> <Paragraph position="5"> Japanese foreign words (tagged FW) were considered as loanwords, i.e., technical terms and proper nouns imported from foreign languages; and therefore were not pruned with the proposed linguistics-based technique but could be treated via transliteration. null The generated translation alternatives are sorted in decreasing order by similarity values. Rank counts are assigned in increasing order, starting at 1 for the first sorted list item. A fixed number of top-ranked translation alternatives are selected and misleading candidates are discarded.</Paragraph> <Paragraph position="6"> In order to demonstrate the procedure of our translation model, we give an example in Japanese and explain how the English translations are extracted, disambiguated and selected and how the phrasal translation is constructed.</Paragraph> <Paragraph position="7"> Given a simple Japanese query ' a10a7a11a2a10a13a12a2a14a16a15</Paragraph> <Paragraph position="9"> kyougi taikai wa, ajia saidai no supoutsu kyougikai de aru).</Paragraph> <Paragraph position="10"> After segmentation, removing stop words and keeping only content words (nouns, verbs, adverbs, adjectives and foreign words), the associated list of Japanese terms becomes a34a35a10a16a11a19a10 , 1English POS tags NN refers to noun, VB to verb, RB to adverb, JJ to adjective; while Japanese POS tags a36a38a37 refers to a noun, a39a40a37 to a verb, a41a40a37 to an adverb and a42a40a43a44a37 to an adjective, with respect to their extensions.</Paragraph> <Paragraph position="12"> kai). The combined translation model is applied on each source term of the associated list and top-ranked word translation alternatives are selected according to their highest similarities as follows:</Paragraph> <Paragraph position="14"> 0.0588), (assembly, 0.0582), (dialogue, 0.0437), etc.g a34a54a22a55a15 ' (saidai): f(general, 0.0459), (great, 0.0371), (famous, 0.0362), (global, 0.0329), (group, 0.032), (measure, 0.0271), (factor, 0.0268), etc.g a34a35a26a7a27a46a28a56a30 ' (supoutsu): f(sport, 1.098), (union, 0.0399), (day, 0.0392), (international, 0.0375), etc.g a34 a17 ' (kai): f(taikai, 0.0489), (great, 0.0442), (meeting, 0.0365), (gather, 0.0348), (person, 0.0312), etc.g The phrasal translation associated to the Japanese query is formed by selecting a number of top-ranked translation alternatives (here set to 3) and illustrated as follows: 'asia assembly city competition sport representative meeting tournament assembly general great famous sport union day taikai great meeting'. null Linguistics-based pruning was applied on the Japanese terms and the extracted English translation alternatives. Chasen morphological analyzer (Matsumoto and al., 1997)for Japanese has associated POS tags as a0a9a1 (noun) to all Japanese terms:</Paragraph> <Paragraph position="16"> Therefore, English translation alternatives associated with POS tags as nouns (NN) via a morphological analyzer for English (Sekine, 2001)are selected and translation candidates having POS tags other than NN (noun) are discarded. Selected translation alternatives for the Japanese noun a22a16a15 (saidai) become 'group, measure, factor'. As well, the Japanese term ' a17 ' (kai) is associated to the English translations: 'taikai, meeting, person'.</Paragraph> <Paragraph position="17"> The phrasal translation associated to the Japanese query after the linguistics-based pruning is illustrated as follows: 'asia assembly city competition sport representative meeting tournament assembly group measure factor sport union day taikai meeting person'.</Paragraph> <Paragraph position="18"> Possible re-scoring techniques could be applied on phrasal translation in order to select best translation alternatives among the extracted ones.</Paragraph> </Section> class="xml-element"></Paper>