File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1122_metho.xml
Size: 17,047 bytes
Last Modified: 2025-10-06 14:09:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1122"> <Title>An Integrated Method for Chinese Unknown Word Extraction 1</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Relative research works </SectionTitle> <Paragraph position="0"> Offline unknown word extraction can be treated as a special kind of Automatic Term Extraction (ATE).</Paragraph> <Paragraph position="1"> There are many research works on ATE. And most successful systems are based on statistics. Many statistical metrics have been proposed, including point-wise mutual information (MI) (Church et al, 1990), mean and variance, hypothesis testing (t-test, chi-square test, etc.), log-likelihood ratio (LR) (Dunning, 1993), statistic language model (Tomokiyo, et al, 2003), and so on. Point-wise MI is often used to find interesting bigrams (collocations). However, MI is actually better to think of it as a measure of independence than of dependence (Manning et al, 1999).</Paragraph> <Paragraph position="2"> LR is one of the most stable methods for ATE so far, and more appropriate for sparse data than other metrics. However, LR is still biased to two frequent words that are rarely adjacent, such as the pair (the, the) (Pantel et al, 2001). On the other aspect, MI and LR metrics are difficult to extend to extract multi-word terms.</Paragraph> <Paragraph position="3"> Relative frequency ratio (RFR) of terms between two different corpora can also be used to discover domain-oriented multi-word terms that are characteristic of a corpus when compared with another (Damerau, 1993). In this paper, RFR values between source corpus and background one will be used to rank the final candidate-list.</Paragraph> <Paragraph position="4"> There are also many hybrid methods combined statistical metrics with linguistic knowledge, such as Part-of-Speech filters (Smadja, 1994). But POS filters are not appropriate for Chinese term extraction. Since all the terms extraction approaches need to access all the possible patterns and find their frequency of occurrence, a highly efficient data structure based on PAT-tree (Chien, 1997), (Chien, 1998) and (Thian et al, 1999) has been used popularly for this purpose. However, PAT-tree still has much space overhead, and is very expensive for construction. Now, we introduce an alternative data structure as Suffix array, with much less space overhead, to commit this task.</Paragraph> <Paragraph position="5"> In this paper, we propose a four-phase offline unknown word extraction method: (a) Construct the Suffix arrays of source text and background corpus.</Paragraph> <Paragraph position="6"> In this phase, Suffix arrays, sorted on both left and right sides context for each occurrence of Chinese character, are constructed. We call them Left-index and Right-index respectively; (b) Extract frequent n-gram candidate terms. In this phase, firstly we extract n-grams, appearing more than one time in different contexts according to Left-index and Right-index of source text, into Left-list and Right-list respectively. Then, we combine Left-list with Rightlist, and extract n-grams which appear in both of them as candidates (C-list, for short). We also compute frequency, context-entropy and relative frequency ratio against background corpus for each candidate in this phase; (c) Filter candidates in C-list with context-entropy and boundary-verification coupled with General Purpose Word Segmentation System (GPWS) (Lou et al, 2001). In this phase, we segment each sentence, where each candidate appears, in the source text with GPWS and eliminate the candidates cross word boundary; (d) Output the final terms on relative frequency ratios.</Paragraph> <Paragraph position="7"> The remainder of our paper is organized as follows: Section 2 describes the candidate terms extraction approach on Suffix array. Section 3 describes the candidates' filter approach on context-entropy and boundary-verification coupled with GPWS. Section 4 describes the relative frequency ratios and output of the final list. Section 5 gives our experimental result and Section 6 gives conclusion and future work.</Paragraph> <Paragraph position="8"> 3 Candidates extraction on Suffix array Suffix array (also known as String PATarray)(Manber et al, 1993) is a compact data structure to handle arbitrary-length strings and performs much powerful on-line string search operations such as the ones supported by PAT-tree, but has less space overhead.</Paragraph> <Paragraph position="9"> Definition 1. Let X = x</Paragraph> <Paragraph position="11"> as a string of length n. For the sake of left and right context sorting, we have extended X by inserting two unique terminators ($, less than all of the characters) as sentinel symbols at both ends of it, i.e. x</Paragraph> <Paragraph position="13"> rays on both sides are assistant data structures for speeding string search.</Paragraph> <Paragraph position="14"> Figure 1 shows a simple Suffix array sorted on left and right context, coupled with the LCP arrays respectively. null We apply the sort-algorithm proposed by (Manber et al, 1993), which takes O(nlogn) in worst cases performance, to construct the Suffix arrays, and sort all the suffix strings in UNICODE order.</Paragraph> <Paragraph position="15"> Figure 2 shows fragments of Suffix arrays of test corpus Xiao Ao Jiang Hu in readable style.</Paragraph> <Paragraph position="16"> Sorted suffix arrays have clustered all similar n-grams (of arbitrary length) into continuous blocks and the frequent string patterns, as the longest common prefix (LCP) of adjacent strings, can be extracted by scanning through the suffix arrays sorted on left context and right respectively.</Paragraph> <Paragraph position="17"> starts at the position of Chinese Character &quot;Dong &quot;, we can extract the repeated n-grams, such as &quot;Dong Fang Bu Bai Bu &quot;, &quot;Dong Fang Bu Bai Wei &quot;, &quot;Dong Fang Bu Bai Zhi &quot;, &quot;Dong Fang Bu Bai Ye &quot;, &quot;Dong Fang Bu Bai Reng Shi &quot;, &quot;Dong Fang Bu Bai &quot;, etc., in turn and skip many substrings, such as &quot;Dong Fang &quot;, &quot;Dong Fang Bu &quot;, etc., because they are not the LCP of adjacent suffix strings and only appear in the upper string &quot;Dong Fang Bu Bai &quot; for their all occurrences. We can apply the same skill on left sorted part which start at the position of Chinese character &quot;Bai &quot;, and extract &quot;Shang Dong Fang Bu Bai &quot;, &quot;Yu Dong Fang Bu Bai &quot;, &quot;Wei Dong Fang Bu Bai &quot;, &quot;Mo Jiao Jiao Zhu Dong Fang Bu Bai &quot;, &quot;Jiao Zhu Dong Fang Bu Bai &quot;, &quot;Dong Fang Bu Bai &quot;, etc., as repeated n-grams and skip many substrings, such as &quot;Bu Bai &quot;, &quot;Fang Bu Bai &quot;, etc., for the same reasons. To extract candidate terms, we can scan through both left and right Suffix arrays and select all repeated n-grams into Left-list and Right-list respectively. The terms, which appear in both lists, can be treated as candidates (denoted by C-list). Extraction procedure can be done efficiently by coupled with the arrays of length of LCP on both sides via stack operations. The length and frequency of candidates can also be computed in this procedure.</Paragraph> <Paragraph position="18"> For example in Figure 2, term &quot;Dong Fang Bu Bai &quot; should appear in both Left-list and Right-list, and it is a good candidate. Yet n-grams &quot;Dong Fang Bu Bai Ye &quot; is not a candidate because even though &quot;Dong Fang Bu Bai Ye &quot; does appear in Right-list, it does not exist in our final Left-list (It always appears as a substring of direct upper string &quot;Lian Dong Fang Bu Bai Ye &quot; according to right part of Figure 2).</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Filter candidate terms </SectionTitle> <Paragraph position="0"> As what show in Table 1, not all the terms in C-list extracted in Section 3 can be treated as significant terms because of their incomplete lexical boundaries. There two kinds of incomplete-boundary terms: (1) terms as substring of significant terms; (2) terms overlapping the boundaries of adjacent significant terms. In this section, we will take measures, including Context-entropy test and boundary-verification with common Segmentor (GPWS) with general lexicon, to eliminate these invalid candidates respectively.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Measure on Context-entropy </SectionTitle> <Paragraph position="0"> According to our investigation, significant terms in specific collection of texts can be used frequently and in different contexts. On the other hand sub-string of significant term almost locates in its corresponding upper string (that is, in fixed context) even through it occur frequently. In this part, we propose a metric Context-entropy as a measure of this feature to filter out substrings of significant terms.</Paragraph> <Paragraph position="1"> Definition 2. Assume o as a candidate term</Paragraph> <Paragraph position="3"> ) in X.</Paragraph> <Paragraph position="4"> Significant terms, which can be used in different context, will get high values of Context-entropy on both sides. And the substrings, which almost emerge because of their upper strings, will get comparative low values. The 3rd and 4th columns of Table 1 show the values of Context-entropy on both sides of a list of candidate terms. Many candidates, which almost emerge because of their direct upper strings, such as &quot;Wo Xing &quot;(in &quot;Ren Wo Xing &quot;(person name)), &quot;Ren Wo &quot;(in &quot;Ren Wo Xing &quot;(person name)) , &quot;Wu Yue Jian &quot;(in &quot;Wu Yue Jian Pai &quot;(organization name)), appear in relatively fixed contexts and should get much lower value(s) of one or both sides of Context-entropy.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Boundary-verification with GPWS </SectionTitle> <Paragraph position="0"> The candidate list of terms includes all of the ngrams, which appear in different context on both sides more than ones. The unique feature of Chinese writing system is that there are no delimiters between words poses a big problem: Many of candidate terms are invalid because of the overlapped factual words' boundary, i.e. these candidates include several fragments of adjacent words, such as &quot;Shan Pai &quot;(overlapping the boundary of common word &quot;Hua Shan &quot;(Hua Mountain)), &quot;Ling Hu Gong &quot;(overlapping the boundary of common word &quot;Gong Zi &quot;(Sir)), etc. listed in Table 2. We eliminate these candidates by verifying boundaries of them with a common Segmentor (GPWS (Lou et al, 2001)) and a general lexicon (with 243,539 words).</Paragraph> <Paragraph position="1"> GPWS was built as shared framework undertaking different CIP applications. It has achieved very good performance and great adaptability across different application domains in disambiguation, identification of proper nouns (including Chinese names, Chinese place names, translated names of foreigners, organization and company names, etc.), identification of high-frequency suffix phrases and numbers. In this part, we ONLY use the utilities of GPWS to perform the Maximum Match (MM) to find the boundaries of words in lexicon, and all of the unknown words (out of our lexicon) will be segmented into pieces. Coupled with GPWS, we propose a voting mechanism for boundary-verification as follows: For each candidate term in C-list as term Begin Declare falseNum as integer for the number of invalid boundary-check of term; Declare trueNum as integer for the number of valid boundary-check of term;</Paragraph> <Paragraph position="3"> For each sentence, in which term appears, in foreground corpus, as sent Assistant with the segmentor, we eliminate 38,697 items of total 117,807 in C-list in 96.85% of precision. Table 2 shows many examples of candidates eliminated by sides-verification with GPWS.</Paragraph> </Section> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> GPWS 5 Relative frequency ratio against background </SectionTitle> <Paragraph position="0"> corpus Relative frequency ratio (RFR) is a useful method to be used to discover characteristic linguistic phenomena of a corpus when compared with another (Damerau, 1993). RFR of term o in corpus X compared with another corpus Y, RFR(o ;X,Y), simply compares the frequency of o in X (denoted as f(o ,X)) to o in Y (denoted as f(o ,Y)): RFR(o ;X,Y) = f(o ,X)/f(o ,Y) RFR of term is based upon the fact that the significant terms will appear frequently in specific collection of text (treated as foreground corpus) but rarely or even not in other quite different corpus (treated as background corpus). The higher of RFR values of the terms, the more informative of the terms will be in foreground corpus than in background one.</Paragraph> <Paragraph position="1"> However, selection of background corpus is an important problem. Degree of difference between foreground and background corpus is rather difficult to measure and it will affect the values of RFR of terms. Commonly, large and general corpora will be treated as background corpus for comparison. In this paper, for our foreground corpus (Xiao Ao Jiang Hu), we experientially select a group of novels of the same author excluding Xiao Ao Jiang Hu as compared background corpus for some reasons as follows: (a) Same author wrote all of the novels, including foreground and background. The unique n-grams in writing style of the author will not emerge on RFR values.</Paragraph> <Paragraph position="2"> (b) All of the novels are in the same category. The specific n-grams for this category will not emerge on RFR values.</Paragraph> <Paragraph position="3"> So, most of the candidate terms with higher RFR values will be more informative and be more significant for the source novel.</Paragraph> <Paragraph position="4"> On the final phase, we will sort all of the filtered candidate terms on RFR values in desc-order so that the forepart of the final list will get high precision for extraction.</Paragraph> <Paragraph position="5"> The last column of Table 1 shows the RFR values of many candidates compared with our background corpus. Many candidates, such as &quot;Liao [?] &quot;, &quot;Ye Bu &quot;, which are frequent in both foreground and background corpus, will get much lower RFR values and will be eliminated from our final top list.</Paragraph> </Section> <Section position="7" start_page="1" end_page="1" type="metho"> <SectionTitle> 6 Experimental result </SectionTitle> <Paragraph position="0"> We use novel Xiao Ao Jiang Hu as foreground corpus compared with the rest of novels of Mr. JIN Yong as background corpus. The total characters of foreground and background corpus are 983,134 and 7,551,555 respectively. We read through the novel Xiao Ao Jiang Hu and 5 graduates manually selected 515 new terms (out of our lexicon) with exact meaning in the novel as follows for the final test: (a) Proper nouns, such as person names: &quot;Ling Hu Chong &quot;, &quot;Dong Fang Bu Bai &quot;, &quot;Ling Hu Da Ge &quot;, place names: &quot;Hei Mu Ya &quot;, &quot; Si Guo Ya &quot;, &quot; Heng Shan Bie Yuan &quot;, organization names: &quot;Ri Yue Shen Jiao &quot;, &quot;Wu Yue Jian Pai &quot; etc. (b) Normal nouns, such as &quot;Pi Xie Jian Pu &quot;, &quot;Xi Xing Da Fa &quot;, etc.</Paragraph> <Paragraph position="1"> (c) Others, such as &quot;Ju Dou &quot;, &quot;Liang Bu &quot;, etc. By our method, we extract 117,807 candidates in this novel. Table 3 shows the result after filtering with Context-entropy on both sides and boundary-verification on different total extracted numbers; We also compared our integrated method to traditional measure LR. On lower total number levels, LR will overrun our method in unknown-word recall, and in turn overrun by us on higher levels. As to precision, our method always keeps ahead.</Paragraph> <Paragraph position="2"> We also notice that both of the methods have much low precision in extraction. To retrieve terms with much certain, we rank the entire final list on RFR values in final phase. Most significant terms will comes in the front of ranked list.</Paragraph> <Paragraph position="3"> Table 3 shows that our method Table 4 shows the top 12 of final list, and Figure 3 shows the performance of our method on different top levels when ranks the final list on RFR values.</Paragraph> </Section> class="xml-element"></Paper>