File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1022_metho.xml

Size: 3,807 bytes

Last Modified: 2025-10-06 14:09:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1022">
  <Title>Collocation Translation Acquisition Using Monolingual Corpora</Title>
  <Section position="4" start_page="22" end_page="22" type="metho">
    <SectionTitle>
4 Collocation translation extraction from two
</SectionTitle>
    <Paragraph position="0"> monolingual corpora This section describes how to extract collocation translation from independent monolingual corpora. First, collocations are extracted from a monolingual triples database. Then, collocation translations are acquired using the triple translation model obtained in section 3.</Paragraph>
    <Section position="1" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.1 Monolingual collocation extraction
</SectionTitle>
      <Paragraph position="0"> As introduced in section 2, much work has been done to extract collocations. Among all the measure metrics, log likelihood ratio (LLR) has proved to give better results (Duning, 1993; Thanopoulos et al., 2002). In this paper, we take LLR as the metric to extract collocations from a dependency triple database.</Paragraph>
      <Paragraph position="1"> For a given Chinese triple ),,(  N is the total counts of all Chinese triples. Those triples whose LLR values are larger than a given threshold are taken as a collocation. This syntax-based collocation has the advantage that it can represent both adjacent and long distance word association. Here, we only extract the three main types of collocation that have been mentioned in section 3.1.</Paragraph>
    </Section>
    <Section position="2" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.2 Collocation translation extraction
</SectionTitle>
      <Paragraph position="0"> For the acquired collocations, we try to extract their translations from the other monolingual Train language model for English triple )(  corpus using the triple translation model trained with the method proposed in section 3.</Paragraph>
      <Paragraph position="1"> Our objective is to acquire collocation translations as translation knowledge for a machine translation system, so only highly reliable collocation translations are extracted. Figure 2 describes the algorithm for Chinese-English collocation translation extraction. It can be seen that the best English triple candidate is extracted as the translation of the given Chinese collocation only if the Chinese collocation is also the best translation candidate of the English triple. But the English triple is not necessarily a collocation. English collocation translations can be extracted in a similar way.</Paragraph>
    </Section>
    <Section position="3" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.3 Implementation of our approach
</SectionTitle>
      <Paragraph position="0"> Our English corpus is from Wall Street Journal (1987-1992) and Associated Press (1988-1990), and the Chinese corpus is from People's Daily (1980-1998). The two corpora are parsed using the  in tables 1 and 2. Token refers to the total number of triple occurrences and Type refers to the number of unique triples in the corpus. Statistic for the extracted Chinese collocations and the collocation translations is shown in Table 3.</Paragraph>
      <Paragraph position="1">  The NLPWin parser is a rule-based parser developed at Microsoft research, which parses several languages including Chinese and English. Its output can be a phrase structure parse tree or a logical form which is represented with dependency triples.</Paragraph>
      <Paragraph position="2">  and E-C translation pairs The translation dictionaries we used in training and translation are combined from two dictionaries: HITDic and NLPWinDic  . The final E-C dictionary contains 126,135 entries, and C-E dictionary contains 91,275 entries.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML