File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0316_metho.xml

Size: 17,934 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0316">
  <Title>Lexicon Effects on Chinese Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="142" type="metho">
    <SectionTitle>
2 Short-Word Segmentation
</SectionTitle>
    <Paragraph position="0"> While word segmentation for linguistic analysis may aim at the longest string that carry a specific semantic content, this may not be ideal for IR because one then has to deal with the problem of partial string matching when a query term matches only part of a document term or vice versa. Instead, we aim at segmenting texts into short words of one to three characters long that function like English content terms. Our process is based on the following four steps A to D: A) facts - lookup on a manually created 2175-entry lexicon called L0. This is small, consisting of commonly used words of 1 to 3 characters, with some proper nouns of size 4. Each entry is tagged as 0 (useful: total 1337), 1 (stopword: 671), s (symbol: 88), 6 (numeric: 37), 4 (punctuation: 9), and 2 or 3 for the rules below. Other researchers have used lexicons of hundreds of thousands. We do not have such a large resource; besides, maintenance of such a list is not trivial. We try to remedy this via rules.</Paragraph>
    <Paragraph position="1"> Given an input string, we scan left to right and  perform longest matching when searching on the lexicon. Any match will result in breaking a sentence into smaller chunks of texts. Fig.lb shows the result of processing an original TREC query (Fig.la) after our lexicon lookup process.</Paragraph>
    <Paragraph position="2"> B) rules - for performing further segmentation on chunks. Words in any language are dynamic and one can never capture 'all' Chinese words in a lexicon for segmentation purposes. We attempt to identify some common language usage ad-hoc rules that can be employed to further split the chunks into short words. The rules that we use, together with their rationale and examples and counter-examples are described below: below (ex.7-13). When character p is tagged '2', we also try to identify common words where p is used as a word in the construct yp, and these are entered into the lexicon, yp may or may not be a stopword. This way a string like ..ypx.. would be split 'yp x' rather than 'y px', dictionary entries being of higher precedence. This rule works in many cases, but we believe that our list may be too long, and many words that have content (such as ex.14-15) are stopped.</Paragraph>
    <Paragraph position="3"> Rule 3: xQ, where Q currently has only 2 special characters, are stopwords for any x -- these are tagged '3' and is a complement to Rule 2 (see ex.16-19 and counter-examples ex.20-21).</Paragraph>
    <Paragraph position="4"> Rule D (for double): any two adjacent similar characters xx are considered stopwords -- this identifies double same characters that are often used as adjectives or adverbs that do not carry much content (see ex. 1-3 below). However, some Chinese names do use double same characters (ex.4) and we would 'stop' them wrong. Other cases such as 'Japan Honshu' (ex.5), 'U.S. Congress' (ex.6) requires splitting between the same two characters. In these cases we rely on 'Japan' or 'U.S.' being on the lexicon and identified first before applying this rule.</Paragraph>
    <Paragraph position="5">  (1) ~. daily (z) slowly (3) ~ every uhere Counter-Examples: (4) ~ person name (5) H~21-~I Japan Honshu (e) u.s. congress Rule 2: Px, where P is a small set of 31 special  characters, are stopwords for any x -- these characters are tagged '2' in our lexicon and examples are shown  (7) ~:~ a branch/stick of (8) --~ early (9) --~ together (18) (~,~)~ (this, that) kind (11) (~,~)~ (this, that) time (12) ~ consider to be (13) ~,,~.\]~. in earnest Counter-Examples: (14) --\[\] one country (15) ~ admit mistake  Rule E (for even): any remaining sequence of even number of characters are segmented two by two -- this arises from the observation that 70-80% of Chinese words are 2-characters long, and the rhythm of Chinese are often bi-syllable punctuated with monosyllables and tri-syllables. If one can identify where the single character words occur, the rest of the string quite often can be split as such when it is even. These single characters are often stopwords that hopefully are in our lexicon. Examples 22 to 26 below show chunks that are even, being surrounded by punctuation signs or stopwords. They will be segmented correctly. Examples 27 to 29 show counter-examples with even number of characters that do not obey this rule. In addition, numeric entries are also removed as stopwords although one can often detect a sequence of them and have it identified as a number.</Paragraph>
    <Paragraph position="6"> C) frequency filter - after a first pass through the test corpus via steps A and B, a list of candidate short-words will be generated with their frequency of occurrence. A threshold is used to extract the most commonly occurring ones. These are our new short-words that are 'data-mined' from the corpus itself. D) iteration - using the newly identified short-words of Step C all tagged useful for segmentation purposes, we expand our initial lexicon in step A and re-process the corpus. In theory, we could continue to iterate, but we have only done one round. With a frequency threshold value in Step C of 30, a final lexicon size of 15,234 called L01 was obtained.</Paragraph>
    <Paragraph position="7"> We believe the rules we use for Step B, though  simple, are useful. They naturally do not work always, but may work correctly often enough for IR purposes. Fig.lc shows the results of processing the TREC-5 query #28 based on these rules after Step A.</Paragraph>
    <Paragraph position="8"> Comparison with a manual short word segmentation of the set of 28 TREC-5 queries shows that we achieve 91.3% recall and 83% precision on average. It is possible that these queries are easy to segment. Our method of segmentation is certainly too approximate for other applications such as linguistic analysis, textto-speech, etc. For IR, where the purpose is to detect documents with high probability of relevance rather than exact matching of meaning and is a more forgiving environment, it may be adequate. Besides, one also has other tools in IR to remedy the situation. These are discussed below.</Paragraph>
  </Section>
  <Section position="4" start_page="142" end_page="143" type="metho">
    <SectionTitle>
3 The Retrieval Environment
</SectionTitle>
    <Paragraph position="0"> Our investigations are based on the TREC-5 Chinese collection of 24,988 Xinhua and 139,801 People's Daily news articles totaling about 170 MB. To guard against very long documents which can lead to outlier in frequency estimates, these are divided into subdocuments of about 475 characters in size ending on a paragraph boundary. This produces a total of 247,685 subdocuments which are segmented into short-words as described in Section 2. In addition, the single characters from each word of length two or greater are also used for indexing purposes to guard against wrong segmentation.</Paragraph>
    <Paragraph position="1"> Provided with the TREC-5 collection are 28 very long and rich Chinese topics, mostly on current affairs. They are processed like documents into queries. These topics representing user needs have also been manually judged with respect to the (most fruitful part of the) collection at NIST so that a set of relevant documents for each query is known. This allows retrieval results to be evaluated against known answers.</Paragraph>
    <Paragraph position="2"> For retrieval, we use our PIRCS (acronym for Probabilistic Indexing and Retrieval - Components -System) engine that has been documented elsewhere \[Kwok 1990,1995\] and has participated in the past five TREC experiments with admirable results \[see for example Kwok &amp; Grunfeld 1996\]. PIRCS is an automatic, learning-based IR system that is conceptualized as a 3-layer network and operates via activation spreading. It combines different probabilistic methods of retrieval that can account for local as well as global term usage evidence. Our strategy for ad-hoc retrieval involves two stages. The first is the initial retrieval where a raw query is used directly. The d best-ranked documents from this retrieval are then regarded as relevant without user judgment, and employed as feedback data to train the initial query term weights and to add new terms to the query - query expansion. This process has been called pseudo-feedback. This expanded query retrieval then provides the final result. This second retrieval in general can provide substantially better results than the initial if the initial retrieval is reasonable and has some relevants within the d best-ranked documents. The process is like having a dynamic thesaurus bringing in synonymous or related terms to enrich the raw query.</Paragraph>
    <Paragraph position="3"> As an example of a retrieval, we have shown in Table 1 comparing the TREC-5 Chinese experiment using bigram representation with our method of text segmentation in the PIRCS system. The table is a standard for the TREC evaluation. Precision is defined as the proportion of retrieved documents which are relevant, and recall that of relevant documents which are retrieved. In general when more documents are retrieved, precision falls as recall increases. It can be  seen that the two methods provide quite similar performance - bigram method ranks 2125 of the 2182 known relevant documents within the first 1000 retrieved for the 28 queries while the short-word method has about 5% less, at 2015. The latter has a slight edge in average precision (0.4516 vs 0.4477).</Paragraph>
    <Paragraph position="4"> Average precision is often used as a standard for comparison.</Paragraph>
    <Paragraph position="5"> The precision at different number of documents retrieved, a user-oriented measure, are also comparable in both cases.</Paragraph>
  </Section>
  <Section position="5" start_page="143" end_page="144" type="metho">
    <SectionTitle>
4 Lexicon Effects on Retrieval
</SectionTitle>
    <Paragraph position="0"> In bigram representation of text, no lexicon is used and many meaningless bigrams as well as many that are true stopwords are included. Yet they do not seem to affect retrieval effectiveness. We take this as a clue that stopword removal may not play an important role in Chinese IR and lead us to investigate its effect. We also like to see how lexicon size can affect retrieval.</Paragraph>
    <Paragraph position="1"> Usually one needs as large a dictionary as possible so that many segmentation patterns are available for the system to select the correct one.</Paragraph>
    <Paragraph position="2"> An entry in our lexicon list can serve the purpose of a segmentation marker or, in addition, for detection of stopwords. In our system stopwords can be determined in three ways based on: lexicon, rule or frequency threshold (statistical). The last category arises from Zipfian behavior of terms and is standard for IR processing: features with frequencies that are too high or too low have adverse effects on retrieval effectiveness and efficiency. This is done as a default, and is also performed for bigrams.</Paragraph>
    <Paragraph position="3"> Our lexicon-based stopwords consists of 671 entries in our list tagged as '1'. The major rule-based stopword removal is Rule 2, while others have minor effects because they occur much less often. A run through the collection shows that the number of times tag 1 and Rule 2 were exercised are about 1.9m and 2.1m.</Paragraph>
    <Paragraph position="4"> We have enabled Rules D and E, tags 0,3 and 4 to be effective for segmentation as a default, and perform experiments where the lexicon (tag 1 &amp; 6) and rule-based (Rule 2) stopword removal (and segmentation) can be activated or deactivated as follows: tag 1,6 Rule 2  For example, ExpTyp.2 means lexicon entries with tags 1,6 are used for segmentation only, while those obeying Rule 2 serve to segment and removed as well. An ExpTyp.5 will be explained later. Retrievals using lexicons of four different sizes with long and short versions of the TREC-5 queries were performed and evaluated.</Paragraph>
  </Section>
  <Section position="6" start_page="144" end_page="145" type="metho">
    <SectionTitle>
5 Results and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
5.1 Long Queries
</SectionTitle>
      <Paragraph position="0"> Table 2 tabulates the precision and recall values averaged over 28 long queries using L0, the 2175-entry and L01, the 15234-entry lexicons. In ExpTyp. 1 under L0 for example, where tags 1 &amp; 6 as well as Rule 2 are in effect for segmentation only, an average precision of 0.455 and recall of relevants (at 1000 retrieved) of 2059 out of 2182 are achieved. On average close to 5.96 out of the first 10 retrieved documents are relevant. This is very good performance for a purely statistical retrieval system.</Paragraph>
      <Paragraph position="1"> It is also interesting to see that the small lexicon is sufficient to yield this good result. Indirectly, it shows that our rule-based segmentation (Rule D, E, 2) can define sufficiently good features for retrieval, and remedies our deficiency in lexicon size. When both tag 1,6 entries and Rule 2 are used for stopword removal (ExpTyp.4, L0), average precision remains practically the same at 0.457. Similarly for ExpTyps.2 &amp; 3 L0, where either Rule2 or tag 1,6 are used for stopword removal, effectiveness does not seem to alter much. Removal of tag 1,6 words however decreases the number of relevants slightly from 2060 to around 2040. It appears that the presence of stopwords have little effect on Chinese IR, just as noticed for bigrams. ExpTyp.5 L0 in Table 2 is included as a demonstration of the perils associated with stopword removal. It shows about a 2% drop in average precision as well as in relevants retrieved compared with ExpTyp.4 L0 due to bad result of one single query. Query #19 asks for documents on 'Project Hope', and the Chinese query is shown below. The  word 'hope' is often used in the context of 'We hope to/that..' or 'My hope is ..' and quite non-content bearing. It is not unreasonable to regard it as a stopword in both English and Chinese. However, for this query it is crucial. ExpTyp.5 L0 is done under the same circumstances as ExpTyp.4 L0 except that the word 'hope' is changed to be a stopword (tag 1). This query then practically accounts for all the adverse effect. Since the presence of stopwords has been shown to have a benign effect on Chinese retrieval, it appears advisable to keep them as indexing terms to guard against such unexpected results.</Paragraph>
      <Paragraph position="2"> In Table 2 under L01, we repeat the same experiments using our larger lexicon which is derived from the collection using L0 as the basis. It is seen that the larger lexicon improves average precision by about 1%, from around 0.456 to about 0.461.</Paragraph>
      <Paragraph position="3"> Otherwise, the two sets of experiments are qualitatively similar. Since retrieval is crucially dependent on how well the queries are processed, it appears that the 28 are well-prepared for retrieval using the original 2175-entry lexicon.</Paragraph>
      <Paragraph position="4"> Recently, we further augment our L0 to a larger initial lexicon L1 with 27,147 entries. This derives L11, a 42,822-entry lexicon from the collection based on our segmentation procedure. Results of repeating the retrieval experiments using these two larger lexicons are shown in Table 3. There is incremental improvements in average precision by using the larger lexicon: e.g. for ExpTyp.1, from 0.455 (L0) to 0.463 (Lll), about 2%. The removal of stopwords for Lll  but the peril of accidentally removing a crucial word remains, leading again to about 2% drop in effectiveness (ExpTyp.5 vs 4 L11).</Paragraph>
    </Section>
    <Section position="2" start_page="145" end_page="145" type="sub_section">
      <SectionTitle>
5.2 Short Queries
</SectionTitle>
      <Paragraph position="0"> It has been pointed out that the paragraph-size TREC queries are long and unrealistic because real-life queries are usually very short, like one or two words.</Paragraph>
      <Paragraph position="1"> One or two words, on the other hand, often do not supply sufficient clues to a retrieval engine. To study the effects of lexicons on short queries, we further perform retrievals using only the first sentence of each query that belongs to the 'title' section of an original topic. They average to a few short-words and we hope to see more pronounced effects. These results are shown in Table 4.</Paragraph>
      <Paragraph position="2"> As expected, retrieval effectiveness decreases substantially over 10% compared to the full length queries: from around 0.463 to 0.409 (ExpTyp.1 Lll, Tables 3&amp;4). The larger lexicon L11 also has an edge over L0 (average precision 0.409 vs 0.398 Table 4), and the use stopwords (ExpTyp.4 vs 1 Lll) can improve precision as for long queries, but the accidental removal of a crucial word can lead to a much bigger adverse effect of 6% drop in average precision (ExpTyp.5 vs ExpTyp.4). Especially hard hit is the number of relevants at 1000 retrieved, which decreases by 11% (1962 vs 1732). The reason for this pronounced effect is that when a query is short (like two words 'Project Hope') and a crucial word (' Hope') is removed, what is left for retrieval is practically useless. In long queries however, many other terms are still available to remedy the removed crucial word,</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="145" end_page="145" type="metho">
    <SectionTitle>
&amp; Lll.
</SectionTitle>
    <Paragraph position="0"> and the effect is less pronounced.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML