File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1219_intro.xml

Size: 3,061 bytes

Last Modified: 2025-10-06 14:00:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1219">
  <Title>Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Almost all techniques to statistical language processing, including speech recognition, machine translation and information retrieval, are based on words. Although word-based approaches work very well for western languages, where words are well defined, it is difficult to apply to Chinese.</Paragraph>
    <Paragraph position="1"> Chinese sentences are written as characters strings with no spaces between words. Therefore, words in Chinese are actually not well marked in sentences, and there does not exist a commonly accepted Chinese lexicon.</Paragraph>
    <Paragraph position="2"> Furthermore, since new compounds (words formed with at least two characters) are constantly created, it is impossible to list them exhaustively in a lexicon. Therefore, automatic extraction of compounds is an important issue. Traditional extraction approaches used rules. However, compounds extracted in this way are not always desirable. So, human effort is still required to find the preferred compounds from a large compound l This work was done while the author worked for Microsoft Research China as a visiting student.</Paragraph>
    <Paragraph position="3"> candidate list. Some statistical approaches to extract Chinese compounds from corpus have been proposed (Lee-Feng Chien 1997, WU Dekai and Xuanyin XIA 1995, Ming-Wen Wu and Keh-Yih Su 1993) as well, but almost all experiments are based on relatively small corpus, it is not clear whether these methods still work well with large corpus.</Paragraph>
    <Paragraph position="4"> In this paper, we investigate statistical approaches to Chinese compound extraction from very large corpus by using statistical features, namely mutual information and context dependency. There are three main contributions in this paper. First, we apply our procedure on a very large corpus while other experiments were based on small or medium size corpora. We show that better results can be obtained with a large corpus. Second, we examine how the results can be influenced by parameter settings including mutual information and context dependency restrictions.</Paragraph>
    <Paragraph position="5"> It turns out that mutual information mainly affects precision while context dependency affects the count of extracted items. Third, we test the usefulness of the extracted compounds for information retrieval. Our experimental results on IR show that the new compounds have a positive effect on IR.</Paragraph>
    <Paragraph position="6"> The rest of this paper is structured as follows. In section 2, we describe the techniques we used. In section 3, we present several sets of experimental results. In section 4, we outline the related works as well as their results. Finally, we give our conclusions in section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML