File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2020_metho.xml

Size: 7,389 bytes

Last Modified: 2025-10-06 14:09:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2020">
  <Title>Effect of Domain-Specific Corpus in Compositional Translation Estimation for Technical Terms</Title>
  <Section position="3" start_page="114" end_page="116" type="metho">
    <SectionTitle>
2 Collecting a Domain/Topic Specific
Corpus
</SectionTitle>
    <Paragraph position="0"> When collecting a domain/topic specific corpus of the language T , for each technical term x</Paragraph>
    <Paragraph position="2"> . Our search engine queries are designed so that documents which describe the technical term x</Paragraph>
    <Paragraph position="4"> is to be ranked high.</Paragraph>
    <Paragraph position="5"> For example, an online glossary is one of such documents. Note that queries in English and those in Japanese do not correspond. When collecting a Japanese corpus, the search engine &amp;quot;goo&amp;quot;</Paragraph>
    <Section position="1" start_page="114" end_page="115" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> An example of compositional translation estimation for the Japanese technical term &amp;quot; ;</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> for the Japanese Technical Term &amp;quot; ; s&amp;quot; s&amp;quot; is shown in Figure 2. First, the Japanese technical term &amp;quot; ; s&amp;quot; is decomposed into its constituents by consulting an existing bilingual lexicon and retrieving Japanese headwords.  In this case, the result of this decomposition can be given as in the cases &amp;quot;a&amp;quot; and &amp;quot;b&amp;quot; (in Figure 2). Then, each constituent is translated into the target language. A confidence score is assigned to the translation of each constituent. Finally, translation candidates are generated by concatenating the translation of those constituents without changing word order. The confidence score of translation candidates are defined as the product of the confidence scores of each constituent. Here, when validating those translation candidates using the domain/topic specific corpus, those which are not observed in the corpus are not regarded as candidates.</Paragraph>
    </Section>
    <Section position="2" start_page="115" end_page="116" type="sub_section">
      <SectionTitle>
3.2 Compiling Bilingual Constituents
Lexicons
</SectionTitle>
      <Paragraph position="0"> This section describes how to compile bilingual constituents lexicons from the translation pairs of the existing bilingual lexicon Eijiro. The underlying idea of augmenting the existing bilingual lexicon with bilingual constituents lexicons is illustrated with the example of Figure 3. Suppose that the existing bilingual lexicon does not include the translation pair &amp;quot;applied : ;&amp;quot;, while it includes many compound translation pairs with the first English word as &amp;quot;applied&amp;quot; and the first  Here, as an existing bilingual lexicon, we use Eijiro(http://www.alc.co.jp/) and bilingual constituents lexicons compiled from the translation pairs of Eijiro (details to be described in the next section).</Paragraph>
      <Paragraph position="2"> In such a case, we align those translation pairs and estimate a bilingual constituent translation pair, which is to be collected into a bilingual constituents lexicon. More specifically, from the existing bilingual lexicon, we first collect translation pairs whose English terms and Japanese terms consist of two constituents into another lexicon P  . We compile &amp;quot;bilingual constituents lexicon (prefix)&amp;quot; from the first constituents of the translation pairs in P  and compile &amp;quot;bilingual constituents lexicon (suffix)&amp;quot; from their second constituents. The numbers of entries in each language and those of translation pairs in those lexicons are shown in Table 1. In the result of our assessment, only 27% of the 667 translation pairs mentioned in Section 1 can be compositionally generated using Eijiro, while the rate increases up to 49% using both Eijiro and &amp;quot;bilingual constituents lexicons&amp;quot;.  In our rough estimation, the upper bound of this rate is about 80%. Improvement from 49% to 80% could be achieved by extending the bilingual constituents lexicons and by introducing constituent reordering rules with prepositions into the process of compositional translation candidate generation.</Paragraph>
    </Section>
    <Section position="3" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
3.3 Score of Translation Pairs in the
Lexicons
</SectionTitle>
      <Paragraph position="0"> This section introduces a confidence score of translation pairs in the various lexicons presented in the previous section. Here, we suppose that the translation pair &lt;s, t&gt; of terms s and t is used when estimating translation from the language of the term s to that of the term t. First, in this paper, we assume that translation pairs follow certain preference rules and can be ordered as below:  1. Translation pairs &lt;s, t&gt; in the existing bilingual lexicon Eijiro, where the term s consists of two or more constituents.</Paragraph>
      <Paragraph position="1"> 2. Translation pairs in the bilingual constituents lexicons whose frequencies in P 2 are high.</Paragraph>
      <Paragraph position="2"> 3. Translation pairs &lt;s, t&gt; in the existing bilingual lexicon Eijiro, where the term s consists of exactly one constituent.</Paragraph>
      <Paragraph position="3"> 4. Translation pairs in the bilingual constituents lexicons whose frequencies in P  are not high.</Paragraph>
      <Paragraph position="4"> As the definition of the confidence score q(&lt;s, t&gt; ) of a translation pair &lt;s, t&gt; , in this paper, we use the following:</Paragraph>
      <Paragraph position="6"> Here, in this paper, we define the confidence score</Paragraph>
      <Paragraph position="8"> as the product of the confidence scores of the  It is necessary to empirically examine whether this definition of the confidence score is optimal or not. However, according to our rough qualitative examination, the results of the confidence scoring seem stable when without a domain/topic specific corpus, even with minor tuning by incorporating certain parameters into the score.</Paragraph>
      <Paragraph position="9"> collecting terms</Paragraph>
      <Paragraph position="11"> If a translation candidate is generated from more than one sequence of translation pairs, the score of the translation candidate is defined as the sum of the score of each sequence.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="116" end_page="116" type="metho">
    <SectionTitle>
4 Translation Candidate Validation
</SectionTitle>
    <Paragraph position="0"> using a Domain/Topic Specific Corpus It is not clear whether translation candidates which are generated by the method described in Section 3 are valid as English or Japanese terms, and it is not also clear whether they belong to the domain/topic. So using a domain/topic specific corpus collected by the method described in Section 2, we examine whether the translation candidates are valid as English or Japanese terms and whether they belong to the domain/topic. In our validation method, given a ranked list of translation candidates, each translation candidate is checked whether it is observed in the corpus, and one which is not observed in the corpus is removed from the list.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML