File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0316_intro.xml

Size: 2,814 bytes

Last Modified: 2025-10-06 14:06:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0316">
  <Title>Lexicon Effects on Chinese Information Retrieval</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> It is well known that a sentence in Chinese (or several other oriental languages) consists of a continuous string of 'characters' without delimiting white spaces to identify words. In Chinese, the characters are called ideographs. This makes it difficult to do machine studies on these languages since isolated words are needed for many purposes, such as linguistic analysis, machine translation, etc. Automatic methods for correctly isolating words in a sentence -- a process called word segmentation -- is therefore an important and necessary first step to be taken before other analysis can begin. Many researchers have proposed practical methods to resolve this problem such as (Nie et al., 1995, Wu and Tsang, 1995, Jin &amp; Chen, 1996, Ponte &amp; Croft, 1996, Sproat et al., 1996, Sun et al., 1997).</Paragraph>
    <Paragraph position="1"> Information retrieval (IR) deals with the problem of selecting relevant documents for a user need that is expressed in free text. The document collection is usually huge, of gigabyte size, and both queries and documents are domain unrestricted and unpredictable.</Paragraph>
    <Paragraph position="2"> When one does IR in the Chinese language with its peculiar property, then one would assume that accurate word segmentation is also a crucial first step before other processing can begin.</Paragraph>
    <Paragraph position="3"> However, in the recent 5th Text REtrieval Conference (TREC-5) where a fairly large scale Chinese IR experiment was performed \[Kwok and Grunfeld, 199x\], we have demonstrated that a simple word segmentation method, couple with a powerful retrieval algorithm, is sufficient to provide quite good retrieval results. Moreover, experiments by others using even simpler bigram representation of text (i.e.</Paragraph>
    <Paragraph position="4"> all consecutive overlapping two characters), both within and outside the TREC environment, also produce good results \[Ballerini et al., 199x, Buckley et al., 199x, Chien, 1995, Liang et al., 1996\]. This is a bit counter-intuitive because the bigram method leads to three times as large an indexing feature space compared with our segmentation (approximately 1.5 million vs 0.5 million), and one would expect that there are many random, non-content matchings between queries and documents that may adversely affect precision. Apparently, this is not so. Based on this observation, we made some adjustments to our lexicon, and provide some experimental results of the lexicon effects on retrieval effectiveness.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML