File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2175_intro.xml

Size: 5,935 bytes

Last Modified: 2025-10-06 14:05:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2175">
  <Title>Bilingual Text, Matching using Bilingual Dictionary and Statistics</Title>
  <Section position="2" start_page="0" end_page="1076" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Bilingnal (or parallel) texts are useful as resources of linguistic knowledge as well as in applications such as machine translation.</Paragraph>
    <Paragraph position="1"> One of the major approaches to analyzing bilingual texts is the statistical approach. The statistical approach involves the following: alignment of bilingual texts at the sentence level nsing statistical techniques (e.g. Brown, Lai and Mercer (1991), Gale and Church (1993), Chen (1993), and Kay and RSscheisen (1993)), statistical machine translation models (e.g.</Paragraph>
    <Paragraph position="2"> Brown, Cooke, Pietra, Pietra et al. (1990)), finding character-level / word-level / phrase-level correspondences from bilingual texts (e.g. Gale and Church (1991), Church (1993), and Kupiec (1993)), and word sense disambiguation for MT (e.g. Dagan, Itai and Schwall (1991)). In general, the statistical approach does not use existing hand-written bilingual dictionaries, and depends solely upon statistics. For example, sentence alignment of bilingual texts are performed just by measuring sentence lengths in words or in characters (Brown et al., 1991; Gale and Church, 1993), or by statistically estimating word level correspondences (Chen, 1993; Kay and RSscheisen, 1993).</Paragraph>
    <Paragraph position="3"> The statistical approach analyzes unstructured sentences in bilingual texts, and it is claimed that the results are useful enough in real applications such as machine translation and word sense disambiguation.</Paragraph>
    <Paragraph position="4"> However, structured bilingual sentences are undoubtedly more informative and important for future natural language researches. Structured bilingual or multilingual corpora serve as richer sonrces for extracting linguistic knowledge (Klavans and Tzonkermann, 1990; Sadler and Vendelmans, 1990; Kaji, Kida attd Morimoto, 1992; Utsuro, Matsnmoto and Nagao, 1992; Matsumoto, l.shimoto and Utsuro, 1993; Utsuro, Matsumoto and Nagao, 1993). Compared with the statistical approach, those works are quite different in that they use word correspondence information available in hand-written bilingual dictionaries and try to extract structured linguistic knowledge such as structured translation patterns and case frames of verbs. For example, in Matsunloto et al. (1993), we proposed a method for finding structural matching of parallel sentences, making use of word level similarities calculated from a bilingual dictionary and a thesaurus. Then, those structurally matched parallel sentences are used as a source for acquiring lexical knowledge snch as verbal case frames (Utsuro et al., 1992; Utsuro et al., 1993).</Paragraph>
    <Paragraph position="5"> With the aim of acquiring those structnred linguistic knowledge, this paper describes a unilied framework for bilingual text matching by combining existing hand-written bilingual dictionaries and statistical techniques. The process of bilingual text matchin 9 consists of two major steps: sentence alignment and structural matching of bilingual sentences. In those two steps, we use word correspondence information, which is available in hand-written bilingual dictionaries, or not included in bilingual dictionaries but estimated with statistical techniques.</Paragraph>
    <Paragraph position="6"> The reasons why we take the approach of combining bilingual dictionaries and statistics are as follows: Statistical techniques are limited since 1) they require bilingnal texts to be long enough for extracting usefifl statistics, while we need to acquire structured liugnistic knowledge even from bilingual texts of about 100 sentences, 2) even with bilingual texts long enough for statistical techniques, useful statistics can not be extracted for low frequency words. For the reasons 1) and 2), the use of bilingual dictionaries is inevitable in our application. On the other hand, existing hand-written bilingual dictionaries are limited in that available dictionaries are only for daily wm'ds and usually domain specific on-line bilingual dictionaries are not available. Thus, statistical techniques are also inevitable for extracting domain specific word correspondence information uot included in existing bilingual dictionarie'~.</Paragraph>
    <Paragraph position="7"> At present, we are at tile starting point of combining existing bilingual dictionaries and statistical techniques. '\['herefore, as statistical techniques tbr estimating word correspondences not included in bilingual dictionaries, we decided to adopt techniques a.s simple as possible, rather than techniques based-on complex probabilistic translation models such as in  statistical = = UJaPanese ........ text ~ _e.st_im.{,t!,)_n o f English iext _J~l parso parse * Granlllll|r &amp;quot;Dictionary ' 1 f \] penc!ency st rueture _j ( dependency structure J ~..,l~ Word Sinlilarity ( .... /~\[~l ~&amp;quot; ~ &amp;quot; bilingual dictionary X.~matcllmg~../ + thesaurus ~./e@~llN,,,.~ -statistics \[-.,,,p-,,c~e 1 .. &amp;quot;Y'.&amp;quot;P':'&amp;quot;-.. I English Lexical Knowledge, Translation Patterns  Browu ell al. (1990), Brown, Pietra, Pietra slid Mer c, er (1993), and Chen (1993). What we adopt are simple co-occurrence-frequency-based techlfiques in Gale and Churcl, (1991) aaM Kay and lfiSscheisen (1993).</Paragraph>
    <Paragraph position="8"> As techniques for sentence ;flignment, we. adopt also quite a simple method based-.on the number of word correspondence.% without ~tny probabilistic translation models.</Paragraph>
    <Paragraph position="9"> in the following sections, we illustrate the specifications of our bilingual text nlatehing fl'alnework.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML