File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1020_metho.xml
Size: 14,585 bytes
Last Modified: 2025-10-06 14:12:56
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1020"> <Title>A Corpus-Based Statistical Approach to Automatic Book Indexing</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> Sampo Research Institute Abstract </SectionTitle> <Paragraph position="0"> The paper reports on a new approach to automatic generation of back-of-book indexes for Chinese books. Parsing on the level of complete sentential analysis is avoided because of the inefficiency and unavailability of a Chinese Grammar with enough coverage. Instead, fundamental analysis particular to Chinese text called word segmentation is performed to break up characters into a sequence of lexical units equivalent to words in English.</Paragraph> <Paragraph position="1"> The sequence of words then goes through part-of-speech tagging and noun phrase analysis. All these analyses are done using a corpus-based statistical algorithm. Experimental results have shown satisfactory results.</Paragraph> <Paragraph position="2"> I. Introduction Preparing back-of-book indexes is of vital importance to the publishing industry but is a very labor intensive task. Attempts have been made over the years to automate this procedure for the apparent benefits of cost saving, shorter preparation time, and possibility of producing more complete and consistent indexes. Early work involves using occurrence characteristics of contents words \[Borko, 1970\]. Later people came to realize that indexes are often multi-word terms and their generation might involve more elaborated syntactic analysis on phrasal or sentential level \[Salton, 1988; Dillon and McDonald, 1983\]. However, a full syntactical approach \[Salton, 1988\] to this task has real problem with efficiency and coverage for unrestricted text. No viable automatic solution is currently in use.</Paragraph> <Paragraph position="3"> Indexing Chinese books involves another severe obstacle, namely the word segmentation problem. Chinese text consists of a sequence of characters which roughly * This research was supported by ROC National Science Council under Contract NSC 81-0408-E-007-529.</Paragraph> <Paragraph position="4"> correspond to letters in English. However, there are no spaces to mark the beginning and end of a word as in English. Until recently, this problem has been considered difficult to solve without elaborated syntactical and semantic analyses \[Chen, 1988\].</Paragraph> <Paragraph position="5"> Recent research advances may lead to the development of viable book indexing methods for Chinese books. These include the availability of efficient and high precision word segmentation methods for Chinese text \[Chang et al., 1991; Sproat and Shih, 1990; Wang et al., 1990\], the availability of statistical analysis of a Chinese corpus \[Liu et al., 1975\] and large-scale electronic Chinese dictionaries with part-of-speech information \[Chang et al., 1988; BDC, 1992\], the corpus-based statistical part-of-speech tagger \[Church, 1988; DeRose, 1988; Beale, 1988\], as well as phrasal and clausal analyzers \[Church 1988; Ejerhed 1990\]</Paragraph> </Section> <Section position="2" start_page="0" end_page="149" type="metho"> <SectionTitle> 2. Problem description </SectionTitle> <Paragraph position="0"> As being pointed out in \[Salton, 1988\], back-of-book indexes may consist of more than one word that are derived from a noun phrase. Given the text of a book, an indexing system, must perform some kind of phrasal and statistical analysis in order to produce a list of candidate indexes and their occurrence statistics in order to generate indexes as shown in Figure 1 which is an excerpt from the reconstruction of indexes of a book on transformational grammar for Mandarin Chinese \[Tang, 1977\].</Paragraph> <Paragraph position="1"> Before phrasal analysis can be performed, the text must go through the more fundamental morphological and part-of-speech analysis. The morphological analysis for Chinese text is mainly a so-called word segmentation process, which segments a sequence of Chinese character into a sequence of words. See Figure 2 for illustration.</Paragraph> <Paragraph position="2"> The noun phrase generation process described in this paper is based on a corpus-based statistical analysis and does not use an explicit syntactical representation.</Paragraph> <Paragraph position="3"> Examples of noun phrases found are underlined as shown in Figure 2.</Paragraph> <Paragraph position="4"> 3. Generating Indexes 1. V Verbe (Predicative) 2. NC Nouns 3. NP Proper Names or Pronouns 4. A Adjectives (Non-Predicative) 5. P Prepositions 6. ADV Adverbs 7. CJ Conjunctions 8. D Determiners 9. Q Quantifiers 10. CL Ciassifers 11. LOC Locatives 12. ASP Aspect Markers 13. CTS Sententiai Clitics 14. CTN Noun Clitics 15. CTM Modifiers Clitics The constraint satisfaction problem The constraint satisfaction problem involves the assignment of values to variables subject to a set of constraining relations. Examples of CSPs include map coloring, understanding line drawing, and scheduling \[Detcher and Pear, 1988\]. The CSP with binary constraints can be defined as follows: Given a set of n variables XI, X2 .... , Xn and a set of binary constraints Kij, find all possible n-tuples (Xl, x2 ..... Xn) such that each n-tuple is an instantiation of the n variables satisfying (~, x\]) in Kij, for all Kij</Paragraph> <Section position="1" start_page="147" end_page="148" type="sub_section"> <SectionTitle> 3.1. Word Segmentation Segmentation through Constraint Satisfaction </SectionTitle> <Paragraph position="0"> The word segmentation problem for Chinese can be simply stated as follows: Given a Chinese sentence, segment the sentence into words. For example, given we are supposed to segment it into \[ba/liuxianzhong/de/queshi/xiendong/cuo/le/fenxi\] Xian-Zhong Liu's exact action was given an analysis.</Paragraph> <Paragraph position="1"> where ~q (Liu) is a surname and ~ (Xian-Zhong) is a last name. In the following, we will describe a method that extends our previous work on segmentation \[Chang et al., 1991a\] to handle surname-names \[Chang et al., 1991b\].</Paragraph> <Paragraph position="2"> Segmentation is solved as a constraint satisfaction problem.</Paragraph> <Paragraph position="3"> Segmentation as a Constraint Satisfaction Problem The word segmentation problem can be cast as a CSP as follows: Suppose that we are given a sequence of Chinese character (C1, C2 ...... Cn) and are to segment the sequence into subsequences of characters that are either words in the dictionary or surname-names. We can think of a solution to this segmentation problem as an assignment of break~continue (denoted by the symbols '>' and '=' respectively) to each place X i between two adjacent characters Ci and Ci+l:</Paragraph> <Paragraph position="5"> subject to the constraint that the characters between two closest breaks correspond to either a Chinese word in the dictionary or surname-names. (For convenience, we add two more places; one at the beginning, the other at the end.) So the set of constraints can be constructed as follows: For each sequence of characters Ci ..... Cj, (j >= i) which are a Chinese word in the dictionary or a suruame-name, ifj = i, then put (>,>) in Ki-l,i.</Paragraph> <Paragraph position="6"> ifj > i, then put (>,=) in Ki-l,i, (=,=) in Ki,i+l ..... and (=,>) in Kj-1 ,j.</Paragraph> <Paragraph position="7"> For example, consider again the following: The corresponding CSP is</Paragraph> <Paragraph position="9"> are either words in the dictionary or probable surname-names (hypothesized words).</Paragraph> <Paragraph position="10"> Typically, there will be more than one solution to this CSP. So the most probable one with highest product of probability of hypothesized words is chosen to be the solution. Ordinary words are listed in the dictionary along with this kind of probability estimated from a general corpus \[Liu et al., 1975\]. As for proper names such as Chinese surname-names not listed in the dictionary, their probability are approximated by using another corpus containing more than 18,000 names as described in the following subsection.</Paragraph> <Paragraph position="11"> The Problem with Proper Names in Chinese Text Proper nouns account for only about 2% of average Chinese text. However, according to a recent study on word segmentation \[Chang et al., 1991a\], they account for at least 50% of errors made by a typical segmentation system. Moreover, proper names are oftentimes indexes.</Paragraph> <Paragraph position="12"> Therefore their correct segmentation is crucial to automatic generation of back-of-book indexes.</Paragraph> <Paragraph position="13"> The difficulties involved in handling proper names are due to the following: (1) No apparent punctuation marking is given like capitalization in English. (2). Most of characters in proper names have different usage. So this problem has been held impossible to solve in the segmentation process. And it was suggested that proper names are best left untouched in the segmentation process and rely on syntactical and semantic analysis to solve the problem when nothing can be made out of the characters representing them \[Chen, 1988\]. Using the corpus-based statistical approach, we have shown that it is possible to identify most Chinese surname-names (~ ~fi) without using explicit syntactical or semantic representation.</Paragraph> <Paragraph position="14"> Most surnames are single character and some rare ones are of two characters (single-surnames and doublesurnames). Names can be either one or two characters (single-names and double-names). Some characters are more often used for names than others. Currently, there are more double-names than single-name in Taiwan.</Paragraph> <Paragraph position="15"> The formation of hypothesized surname-names is triggered by the recognition of a surname. In the example above, ~r\] (Liu) is one of some 300 surnames.</Paragraph> <Paragraph position="16"> Subsequently, we will take one character and two characters after the surname as probable last names, in this case ~ (Xian) and ~ ~q~ (Xian-Zhong). A general corpus, G and a surname-name corpus N are used to evaluate the probability of a surname-name. For instance, the probability of a most common kind of 3-character name (single-surname/double-name) such as J\[IJ~ ~ is : p( ~rJ~ ~, ) = p( single-surname/double-names in G) x p(~rJ being a surname in N) x P(~ being 1 st character in names in N) x P({~ being 2nd character in names in N) Names of other combinations can be handled similarly.</Paragraph> <Paragraph position="17"> The Algorithm To sum up, the whole process of word segmentation with surname-name identification is as follows:</Paragraph> </Section> <Section position="2" start_page="148" end_page="149" type="sub_section"> <SectionTitle> 3.2. Part-of-speech Tagging </SectionTitle> <Paragraph position="0"> As far as we know, there has been only scarce research done on part-of-speech tagging for Chinese \[Chang et al., 1988; Chen, 1991; Bai and Xia, 1991; BDC, 1992\]. As for English, there are at least three independently developed taggers \[Church 1988; DeRose 1988; Beale 1988\]. We started out using an electronic dictionary \[Chen; 1991; Chang et al., 1988\] with a very elaborated part-of-speech system based on Chao's work \[Chao, 1968\]. Because it is difficult to get sufficient manualy tagged data for a large tag set, we have since switched to another electronic dictionary with some 90,000 entries and a much smaller tag set. The dictionary is actually a bilingual one (Chinese-English) developed by Behavior Design Corporation used to obtain the list of possible part-of-speeches for each segmented word. Currently, the collocation probabilities of part-of-speech are estimated from a manually tagged text of about 4,000 words.</Paragraph> </Section> <Section position="3" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 3.3. Finding Noun Phrases </SectionTitle> <Paragraph position="0"> Instead of using a full-blown parser to find noun phrases, we first mark the noun phrases in the same text of about 4,000 words and compute the statistical characteristics of categoric patterns of noun phrase and then use the statistics in a stochastic algorithm for finding noun phrases in a manner similar to \[Church 1988; Ejerhed 1990\].</Paragraph> <Paragraph position="1"> Extracting keywords from a noun phrase is somewhat heuristic unlike the rigorous approach of using the syntactical structure within the noun phrase in \[Salton, 1988\].</Paragraph> </Section> </Section> <Section position="3" start_page="149" end_page="149" type="metho"> <SectionTitle> 4. The Experimental Results </SectionTitle> <Paragraph position="0"> The algorithm described in Section 3 is currently under development and the programs are written in C and ProFox, and run on an IBM PC compatible machine. The segmentation, tagging, and NP identification parts are completed, while the statistical analysis of the occurrence of NPs is being implemented now. The statistics used in the system consists of four parts: (S1) Appearance counts of 40,032 distinct words from a corpus of 1,000,000 words of Chinese text \[Liu el al., 1975\].</Paragraph> <Paragraph position="1"> names.</Paragraph> <Paragraph position="2"> The performance of the completed parts of the system is as follows: The hit rate of word segmentation is about 97% on the average. For the surname-names alone, we get 90% average hit rate which eliminate about 40% of errors produced by our previous segmentation system. About 98% of part-of-speeches are tagged correctly. And about 95% of the noun phrases are found successfully.</Paragraph> </Section> <Section position="4" start_page="149" end_page="149" type="metho"> <SectionTitle> 5. Concluding Remarks </SectionTitle> <Paragraph position="0"> The preliminary results that we have obtained seem very promising. The approach presented here does not rely on a fully developed Chinese grammar for syntactical analysis on the sentential level. Thus the efficiency in system development and generation of indexes is reasonable and cost of building and maintaining such a system is acceptable. Currently, we are working on (1) handling translated names, (2) improving the hit rate of tagging and NP identification by using a larger and more correctly tagged and marked training corpus, and (3) completion of the statistical analysis of occurrence of noun phrases.</Paragraph> </Section> class="xml-element"></Paper>