File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1039_intro.xml
Size: 25,968 bytes
Last Modified: 2025-10-06 14:05:56
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1039"> <Title>Identification and Classification of Proper Nouns in Chinese Texts</Title> <Section position="2" start_page="0" end_page="226" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> A Chinese sentence is composed of a string of characters without any word boundaries, so that to segment Chinese sentences is indispensable in Chinese language processing (Chen, 1990; Chen, 1994). Many word segmentation techniques (Chen & Liu, 1992; Chiang et al., 1992; Sproat & Shih, 1990) have been developed. However, the resolution of unknown words, i.e., those words not in the dictionaries, form the bottleneck. Some papers (Fung & Wu, 1994; Wang et al., 1994) based on Smadja's paradigm (1993) learned an aided dictionary from a corpus to reduce the possibility of unknown words. Chang et al. (1992) proposed a method to extract Chinese personal names from an 11,000-word corpus, and reported 91.87% precision and 80.67% recall. Wang et al. (1992) recognized unregistered names on the basis of titles and a surname-driven rule. Linet al. (1993) presented a model to tackle a very restrictive form of unknown words. Sproat et al. (1994) considered Chinese personal names and transliterations of foreign words. Their performance was 61.83% precision and 80.99% recall on an 12,000-Chinese-character corpus.</Paragraph> <Paragraph position="1"> This paper deals with three kinds of proper nouns - say, Chinese personal names, transliterated personal names and organization names. We not only tell if an unknown word is a proper noun, but also assign it a suitable semantic feature. In other words, '~?~4~ ~' (George Bush) will have a feature of male transliterated personal name when it is identified. Such a rigid treatment will be helpful for further applications such as anaphora resolution (Chen, 1992), sentence alignment (Chert & Chert, 1994; Chen& Wu, 1995), etc. Section 2 describes the training corpora and the testing corpus we used. Sections 3, 4 and 5 propose tile identification and classification methods of Chinese personal names, transliterated personal names and organization names, respectively. Section 6 presents two applications. Section 7 concludes the remarks.</Paragraph> <Paragraph position="2"> 2. Training Corpora and Testing Corpus The proposed methods in this paper integrate the rule-based and the statistics-based models, so that training corpora are needed. To test the performance of language models, a good testing corpus is also necessary. This section introduces all the corpora that are used in the following sections. NTU balanced corpus, which follows the standard of LOB corpus (Johansson, 1986), is the first training corpus. It is segmented by a word segmentation system and is checked manually. In total, this corpus has 113,647 words and 191,173 characters.</Paragraph> <Paragraph position="3"> The second training corpus is extracted from three newspaper corpora (China Times, Liberty Times News and United Daily News). It is just segmented by a word segmentation system without checking manually. Although segmentation errors may exist, this corpus is 23.2 times larger than NTU balanced corpus, so that we can get more reliable word association pairs.</Paragraph> <Paragraph position="4"> The third training corpus is a transliterated personal name corpus. There are 2,692 transliterated personal names, including 1,414 male's names and 1,278 female's names. Those transliterated personal names are selected from a book &quot;English Names For Yon&quot; (Huang, 1992). The last training data is a Chinese personal name corpus. It has 219,738 Chinese personal names and 661,512 characters.</Paragraph> <Paragraph position="5"> Finally, the testing corpus is introduced. We randomly select six different sections from a newspaper corpus (Liberty Times News). The contents are different from the second training corpus. The following shows the statistics of the testing corpus: (a) the political section There are many items of news about the legislature. It includes 23,695 words and 36,059 characters.</Paragraph> <Paragraph position="6"> (b) the social section There are many items of news about police and offenders. It includes 61,846 words and 90,011 characters.</Paragraph> <Paragraph position="7"> (c) the entertainment section There are many items of news about TV stars, programs, and so on. It includes 38,234 words and 55,459 characters.</Paragraph> <Paragraph position="8"> (d) the international section It contains many items of foreign news and has 19,049 words and 29,331 characters.</Paragraph> <Paragraph position="9"> (e) the economic section Many items of news about stock market, money, and so on, are recorded. It includes 39,008 words and 54,124 characters.</Paragraph> <Paragraph position="10"> (f) the sports section All items of news concern sports. It includes 36,971 words and 54,124 characters.</Paragraph> <Paragraph position="11"> Every section has its own characteristics. In the political section, there are many titles. In the social section and the entertainment ~ection, there are many Chinese personal names and organization names. In the international section, transliterated personal names are more than the other two. In the economic section, stock companies often appear. In the sports section, there are many Chinese personal names and transliterated personal names. Because the proper nouns are usually segmented into single characters, they will interfere with one another during identification and classification.</Paragraph> <Section position="1" start_page="222" end_page="222" type="sub_section"> <SectionTitle> 3. Chinese Personal Names 3.1 Structure of Personal Names </SectionTitle> <Paragraph position="0"> Chinese personal names are composed of surnames and names. Most Chinese surnames are single character and some rare ones are two characters.</Paragraph> <Paragraph position="1"> The following shows three different types: (a) Single character like '~', }~', '~' and '-9:-'. (b) Two characters like '~\[~' and' k'(('.</Paragraph> <Paragraph position="2"> (c) Two surnames together like 'JI,~'.</Paragraph> <Paragraph position="3"> Most names are two characters and some rare ones are one character. Theoretically, every character can be considered as names rather than a fixed set. Thus the length of Chinese personal names ranges from 2 to 6 characters.</Paragraph> </Section> <Section position="2" start_page="222" end_page="222" type="sub_section"> <SectionTitle> 3.2 Strategies 3.2.1 Segmentation before Identification </SectionTitle> <Paragraph position="0"> Input text has to be segmented roughly beforehand.</Paragraph> <Paragraph position="1"> This is because many characters have high probabilities to be a Chinese personal name without pre-segmentation. Consider the example '~!,)~l~,@~l+j (@~l~...'. The character '(i@' has a high score to be a surname. In this aspect, '~)~' is easy to be a name. If the input text is not segmented beforehand, it is easy to regard '(q~.~J~' as a Chinese personal name. On the statistical model, this type of errors is difficult to avoid. However, it is easy to capture by pre-segmentation.</Paragraph> </Section> <Section position="3" start_page="222" end_page="224" type="sub_section"> <SectionTitle> 3.2.2 Variation of a Character </SectionTitle> <Paragraph position="0"> How to calculate the score of a candidate is an important issue in this identification system. The paper (Chang et al., 1992) proposes the following formula:</Paragraph> <Paragraph position="2"> This formula has a drawback, i.e., it does not consider the probability of a character to be the other words rather than a surname. Take the two characters '{,~,~' and 'llil\[' as an example. The character '{'~j.' can form '{~', '1&quot;,1/~ ', 'l&quot;~/-iiTfi ', and many other words. On the contrary, the character' I@i' just forms a word '@\[1~', which is a rare word.</Paragraph> <Paragraph position="3"> The difference shows that the former is easier to be used as the other words than the latter. The above formula assigns the same score to '@/:-~' and '@\[(-~', when '{'0C/' and '111~' have the same frequency to be names. Intuitively, '{~lj.~'-'.' does not look like a name, but 'ItlJU~.' does. Thus 'tliI(' should have higher score than '{~l', and the variation of a character should be considered in the formula. In our model, the variation of characters is learned from NTU balanced corpus.</Paragraph> <Paragraph position="4"> Equation (2) defines the original formula. The formula used to calculate P(Ci) is similar to Equation (1). When the variation of a character is considered, Equation (3) is formulated. The variation of a character is measured by the inverse of the frequency of the character to be the other words. Equation (4) is simplified from Equation (3).</Paragraph> <Paragraph position="5"> where Ci is a character in the input sentence, P(Ci) is the probability of Ci to be a surname or a name, #Ci is the frequency of Ci to be a surname or a name, &Ci is the frequency of Ci to contain in tile other words.</Paragraph> <Paragraph position="6"> For different types of surnames, different models are adopted.</Paragraph> <Paragraph position="7"> Because the surnames w~th two characters are always surnames, Model (b) neglects the score of surname part. Models (a) and (c) have two score functions. It avoids the problem of very high score of surnames. Consider the string '1~ ;'it( &quot;)j~J ' \[&quot;.4:J (J~ @<-&quot;. Because of the high scores of the characters '1~' and 'S,', '\[~N~} f',' f:, -NI', ' t:,~:J - 2&quot; and 'J{~ -~' may be identified according to Equation (5). Equation (6) screens out the impossible candidates. The above three models can be extended to single-character names. When a candidate cannot pass the threshold, its last character is cut off and the remaining part is tried again. The threshold is different from the original one. Thresholds are trained from Chinese personal name corpus. We calculate the score of every Chinese personal name in the corpus using the above formulas. The scores for each formula are sorted and the one which is less than 99% of the personal names is considered as a threshold for this fornmla. That is, 99% of the training data can pass the threshold.</Paragraph> <Paragraph position="8"> Text provides many useful clues from three different levels - say, character, sentence and paragraph levels. The baseline model forms the first level, i.e., character level. The following subsections present other clues. Of these, gender is also a clue from character level; title, mutual information and punctuation marks come from sentence level; tile paragraph information is recorded in cache.</Paragraph> <Paragraph position="9"> 3.2.4.1 Clue 1: Title The first is title. Wang et a/. (1992) propose a model based on titles When a title appears before (after) a candidate, it is probably a personal name. For example, '~.-~)NI~, ' and ' ~,~ ~)'~ :~, ~. ~\](' However, there are many ~!~,~ \[ J Thus counterexamples, e.g., ' ~'~ ' ; (~ ~&quot;-4 :~'~ ~ ' f'J &quot; s': Y'.,., we cannot make sure if the characters surrounding a title form a personal name. Even so, title is still a useful clue. it can help determine the boundary of a name. In the example '~.J~J!~'~;~;:i-qj...',' ~-IDJ!SJ'~I~, '' is identified incorrectly. When a title is included in this example, i.e., '~JDJ!~!I~',;I~'~;IJ... ', the error does not occur. In sumnmry, if a title appears, a special bonus is given to the candidate 3.2.4.2 Clue 2: Mutual hfformation Chinese personal names are not always composed of single characters. For example, the name part of the sentence 'l~i~rl)Jv~Jlai~lc'/;~'r~';jtlfj' is a word. How to tell out that a word is a content word or a name is indispensable. Mutual information (Church & Hanks, 1990) provides a measure of word association. The words surrounding a word candidate are checked. When there exists a strong relationship, the word candidate has high probability to be a content word. In the example '1~ ~!J:~l'/ ' ,~<~Jl~?_~.,.', the two words &quot;C I!!:&quot; and '~.~ \[tl' have high mutual reformation, so that 'lI~, i 1~' is not a personal name. Three newspaper corpora (total size is about 2.6 million words) are used to train the word association.</Paragraph> <Paragraph position="10"> 3.2.4.3 Clue 3: Punctuation Marks Personal names usually appear at the head or the tail of a sentence. A candidate is given an extra bonus when it is found from these two places. Candidates surrounding the caesura mark, a Chinese-specific punctuation mark, are treated in the similar way. If some words around this punctuation are personal names, the others are given bonus.</Paragraph> <Paragraph position="11"> 3.2.4.4 Clue 4: Gender There is a special custom in Chinese. A married woman may mark her husband's surname before her surname. That forms type 3 personal name mentioned in Section 3.1, Because a surname may be considered as a name, e.g., '7/' in the personal name ~'~'/~Jt~ and in ,~,~r/,-, v,. the candidates with two tile personal name c~ ~,,~,, possible surnames do not always belong to type 3 personal name. The gender information, i.e., type 3 is always a female, helps us disambiguate the type of personal names. Some Chinese characters have high score for male and some for female. The following lists some typical examples: male: ~, ,~, ~,&quot;, ~t~,-1t(, ~j~, JC/J(~, 9~i, JI(, (~l,)~; female: ~i~, J~, I~, ~l l, ~:~, ~;, {}::, '{~L ~Y, )J:, -+-/ We count the frequencies of the characters to be male and female, and compare these two scores. If the former is larger than the latter, then it is a masculine name. Otherwise, it is a fenfinine name. 3.2.4.5 Clue 5: Cache A personal name may appear lnore than once in a paragraph This phenomenon is useful durmg identification We use cache to store the identified candidates, and reset cache when next paragraph is considered There are four cases shown below when cache is used: (a) CIC2C3 and C1C2C4 are in the cache, and C I C2 is correct.</Paragraph> <Paragraph position="12"> (b) CIC2C3 and C1C2C4 are in the cache, and both are correct.</Paragraph> <Paragraph position="13"> (c) CIC2C3 and C1C2 are in the cache, and C 1 C2C3 is correct.</Paragraph> <Paragraph position="14"> (d) C1C2C3 and C1C2 are in the cache, and CIC2 is correct.</Paragraph> <Paragraph position="15"> Here Ci denotes a Chinese character. It is obvious that case (a) contradicts with case (b). Consider the string ' 5J~J!i}'i;~,'J,)-}:jl','lif~:'j: ~l;~'. A personal ,mum '~! ID\]J?;;\[~, '' is recognized. When another string '}<-I~lJl})~. ~lt/Jxq':C/lJI,.';'_~')'l'lili~,t~' is input, '~,~llllfA)~' and 'Tlldx>l &quot;:' are identified Then we find the two strings '/}'i.J~\]}~)~, ' and ' +'iJ~\]J'.~;a:'lY/' are similar. Here case (a) is correct. However, case (b) also appears very often in newspapers. For example, 'l~lL,kT~ &quot; J~l~J'gfJ~iH~ hi I, ~3...L Two personal names, 'li\[\]/k:~'~/ and 'lT\[~\]Kgt?' are identified In the examples like '~.~t~\[<gif~ &quot; ~...' and '... ~.~ o ', two candidates '~.~;{~t~, ' and' }'#~' will be identified. That belon~ to case (d). Consider the last examples '11',(,1 i*l 1~,1:<..' and '~,(, i~l l~g, rii:j.dz... '. Two candidates '~,({~ i' and '~',(,~&quot; i~l I' will be identified too. Now, case (c) is adopted.</Paragraph> <Paragraph position="16"> in our treatment, a weight is assigned to each entry in the cache. The entry which has clear right boundary has a high weight. Title and punctuation are chics for boundary. For those similar pairs which have different weights, the entry having high weight is selected. If both have high weights, both are chosen. When both have low weights, the score of the second character of a name part is critical. It aetermiues if the character is kept or deleted.</Paragraph> </Section> <Section position="4" start_page="224" end_page="225" type="sub_section"> <SectionTitle> 3.3 Experiments and Discussions </SectionTitle> <Paragraph position="0"> Table l stunmarlzes the identification results of Chinese personal names. Total Chinese personal names in each section are listed in Cohlnm 2.</Paragraph> <Paragraph position="1"> Cohmm 3 shows the precision and the recall of the baseline model. The overall performance is good except for section 4 (the international section) and section 5 (the economic section). The remaining colunms demonstrate the change of performance after the clues discussed m Section 3.2 are considered incrementally. If name part of a candidate is a word, word association is used to measure the relationship between the surrotmding words. The increase of the precision in Cohmm 4 verifies this idea. Theoretically, it shottld not decrease the recall. After checking the result, we find that some unreasonable word association comes from the training corpus. Recall that it is generated by a rough word segmentation system wlthoul manually-checking. The next clue is punctuation.</Paragraph> <Paragraph position="2"> The idea is that the candidates m the beginning or at the end of sentences have larger probabilities to be personal names than they are in other places. It helps some candidates with lower score to pass the threshold, but it cannot avoid the incorrect candidates to pass the threshold Thus, the performance is dangling. Then, title is considered The increase of the recall shows that title works well But it decreases the precision too. From the variation of the performance, we know that cache is powerful. Both the recall and the precismn increase.</Paragraph> <Paragraph position="3"> Finally, gender is joined It is used when two successive characters are candidates of surnames.</Paragraph> <Paragraph position="4"> In other words, it focuses on type 3 personal names.</Paragraph> <Paragraph position="5"> Almost all type 3 personal names are identified correctly. Because this type of personal names is rare in the testing newspaper corpus, the variation is not large. Table 1 shows that our model is good except for section 4 and section 5. There are many proper nouns in the international section, and ahnost all of them are not included m the dictionary.</Paragraph> <Paragraph position="6"> All unknown words disturb one another in segmentation. For example, ' q!ilIC/~J~0,' is a countiy name. It is divided into three single characters by our word segmentation system. From the viewpoint of personal nmne identification, it is easy to regard ' lI,&quot;r~J~ff as a Chinese personal name. Another source of errors is foreign names. Some of them are similar to Chinese personal names, e.g., '~'~!Ui}/i:' and ' <t&jE'.</Paragraph> <Paragraph position="7"> The similar problem occurs in the economic section.</Paragraph> <Paragraph position="8"> There are many company names, and some of them are similar to Chinese personal names. The company name '~\]~\]i)t' is a typical example. In summary, there are three major errors. One is foreign name. They are identified as proper nouns correctly, but are assigned wrong features. About 20% of errors belong to this type. The second type of errors results from the rare surnames, which are not included in the surname table. Some rare surnames are not real surnames. They are just artists' stage names. Near 14% of errors come from this type. The other errors include place names, organization names, and so on.</Paragraph> <Paragraph position="9"> 4. Transliterated Personal Names</Paragraph> </Section> <Section position="5" start_page="225" end_page="225" type="sub_section"> <SectionTitle> 4.1 Structure of Personal Names </SectionTitle> <Paragraph position="0"> Compared with the ideniification of Chinese personal names, the identification of transliterated personal names has the following difficulties: (a) No specific clue like surnames in Chinese personal names to trigger the identification system.</Paragraph> <Paragraph position="1"> (b) No restriction on the length Of a transliterated personal name. It may be composed of a single character or more, e.g., '5%', ' ~'i'\]d', &quot;~'i ~Y', '~.f ~':'}~-' and 'd\[if,~ ~\]~1\]' (c) No large scale transliterated personal name corpus.</Paragraph> <Paragraph position="2"> (d) Ambiguity in classification.</Paragraph> <Paragraph position="3"> For example, 'J~)~' may denote a city or a former American president.</Paragraph> </Section> <Section position="6" start_page="225" end_page="225" type="sub_section"> <SectionTitle> 4.2 Strategies </SectionTitle> <Paragraph position="0"> Almost all foreign names are in transliteration, not in translation. And the base of transliteration is pronunciation of foreign names. Pronunciation is composed of syllables and tones. The major difference of pronunciation between Chinese and English is syllables. The style of syllabic order is specific in transliteration. Consider an example.</Paragraph> <Paragraph position="1"> The transliterated personal name ,~\]z~}. has syllables 'Y ~',7-- ~zv T-- ((~'. Such a syllabic order is rare in Chinese, but is not special for a transliterated string. In other words, the syllabic orders of transliterated strings and general Chinese strings are not similar. Besides, a transliterated name consists of a string of single characters after segmentation. That is, these characters cannot be put together. However, the unrestrictive length of transliterated names and homophones in Chinese result in the need of very large training corpus. The following sections show how to modify the basic idea if a large scale corpus is not available.</Paragraph> <Paragraph position="2"> When a foreign name is transliterated, the selection of homophones is restrictive. Consider an example shown below: Richard Macs ~ll\[~l)~ ~,~, ,~ Those strings following English names have the same pronunciations. The first is usually adopted, and the second is never used. It shows that the characters used in transliteration are selected from some character set. In our model, total 483 characters are trained from our transliterated personal name corpus. They play the similar role of the surnames in the identification of Chinese personal names. If all the characters in a string belong to this set, i.e., they satisfy character condition, they are regarded as a candidate.</Paragraph> <Paragraph position="3"> Because of the unrestrictive length of transliterated names, how to identify their boundary is a problem. Of course, titles and punctuation used in last section can be adopted too. But they do not always appear in the text. Thus another clue should be found.</Paragraph> <Paragraph position="4"> Syllable order may be a clue. Those examples like ' ~r~', '~'J,,' and ' ~' which meet the character condition do not look like transliterated names because their pronunciations are not like foreign names. If there is a large enough transliterated name corpus, the syllable orders can be learned.</Paragraph> <Paragraph position="5"> However, our transliterated corpus only contain 2692 personal names. Thus only the first and the last characters are considered. For each candidate, we check the syllable of the first (the last) character. If the syllable does not belong to the training corpus, the character is deleted. The remaining characters are treated in the similar way.</Paragraph> <Paragraph position="6"> As mentioned in Section 3.2.3, the frequency of a character to be a part of a personal name is important information. The concept may be used here. However, only large scale transliterated personal name corpus can give reliable statistical data. Based on our small training corpus, the range of the application of the information should be narrowed down. We only apply it in a candidate of length 2. This is because it is easy to satisfy the character condition for candidates of the shortest length. For each candidate which has only two characters, we compute the frequency of these two characters to see if it is larger than a threshold. If it is not, it is eliminated. The threshold is determined in the similar way as Section 3.2.3.</Paragraph> </Section> <Section position="7" start_page="225" end_page="226" type="sub_section"> <SectionTitle> 4.3 Experiments and Discussions </SectionTitle> <Paragraph position="0"> The identification system scans a segnlented sentence from left to right. It finds the character string that meets the character condition, syllable condition and frequency condition. Table 2 shows the precision and the recall are both good for sections 3 and 4, i.e., the entertainment and the international sections. However, sections 2 and 5 (lhe social and the economic sections) have bad precision. The average recall tells us that the tri,g~ger to the identification system is nsefnl. The reasons why the recall is not good enough are: some transliterated personal names (e.g., 'C/,~Oi~'j ~' and '~ D{$~') look like Chinese personal names, and the identification of Chinese personal names is done before that of transliterated personal names.</Paragraph> <Paragraph position="1"> Although they are correctly identified as personal names, they are assigned wrong features. Similarly, transliterated nouns like popular brands of automobiles ('7\[~\[!~,'i:' and &quot;l\[~,)l~|~j'), Chinese proper nouns (' ~I\] ~' ' ~.J:x)~' and 'C/t'~ II~') and Chinese personal names ('~ I:~l\]') look like transliterated personal nmnes. That decreases the precision.</Paragraph> <Paragraph position="2"> f~esides these types of nouns, boundary errors affect the precision 1oo. For telling out the error rates from classification, we made another experiment. If the identified results are not classified, the average precision is 81.46% and the average recall is 91.22%.</Paragraph> </Section> </Section> class="xml-element"></Paper>