File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1009_intro.xml

Size: 3,150 bytes

Last Modified: 2025-10-06 14:05:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1009">
  <Title>BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Parallel texts (corpora) are useful resources for acquiring a variety of linguistic knowledge (Dangan, 1991; Matsumoto, 1993), especially for machine translation systems which inherently require customizations. Translation dictionaries are, needless to say, the most basic and powerful knowledge source for improving and customizing translation systems. Our research interest lies in automatic generation of translation dictionaries from parallel texts. In this perspective, finding corresponding words or phrases in bilingual texts will be the fundamental factor for accurate translation.</Paragraph>
    <Paragraph position="1"> Statistics-based processing has proven to be very powerful for aligning sentences and words in parallel corpora (Brown, 1991; Gale, 1993; Chen, 1993). Kupiec proposes an Mgorithm for finding ~loun phrases in bilingual corpora (Kupiec, 1993). In this algo o rithm, noui~-phrase candidates are extracted from tagged and aligned parallel texts using a noun phrase recognizer and tile correspondences of these nonn phrases are calculated based on the EM algorithm.</Paragraph>
    <Paragraph position="2"> Accuracy of around 90% has been attained for the Imndred highest ranking con'espondenccs. Statistics-based processing is effective when a relatively large amount of parallel texts is available, i.e. when high frequencies are obtained.</Paragraph>
    <Paragraph position="3"> On the other hand, existing linguistic knowledge can be used for finding corresponding words or phrases in parallel texts. For example, possible target expressions for a source expression provided by a translation system (linguistic knowledge source) can be a key in searching the corresponding expressions in a corpus (Nogami, 1991; Katoh, 1993). Yanramoto (1993) proposes a method for generating a translation dictionary from Japanese/English parallel texts. In this method, English and Japanese compound noun phrases are extracted from parallel texts and their correspondences are searched by matching their possible translations generated by tile existing translation dictionary. However, acquirable noun phrases are limited by tile linguistic generative power of the translation dictionary. Furthernlore, tiffs method utilizes no sentence alignmeat information which can reduce errors in finding noun phrase correspondences.</Paragraph>
    <Paragraph position="4"> This paper proposes a new method for generating an MT dictionary from parallel texts. It utilizes both statistical and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by the above linguistic-based method can be extracted, and a highly accurate translation dictionary is generated from relatively small par:dlel texts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML