File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1817_abstr.xml

Size: 11,929 bytes

Last Modified: 2025-10-06 13:42:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1817">
  <Title>Automatic Recognition of Chinese Unknown Words 1 Based on Roles Tagging 2</Title>
  <Section position="1" start_page="2" end_page="2" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper presents a unified solution, which is based on the idea of &amp;quot;roles tagging&amp;quot;, to the complicated problems of Chinese unknown words recognition. In our approach, an unknown word is identified according to its component tokens and context tokens. In order to capture the functions of tokens, we use the concept of roles. Roles are tagged through applying the Viterbi algorithm in the fashion of a POS tagger. In the resulted most probable roles sequence, all the eligible unknown words are recognized through a maximum patterns matching. We have got excellent precision and recalling rates, especially for person names and transliterations. The result and experiments in our system ICTCLAS shows that our approach based on roles tagging is simple yet effective.</Paragraph>
    <Paragraph position="1"> Keywords: Chinese unknown words recognition, roles tagging, word segmentation, Viterbi algorithm.</Paragraph>
    <Paragraph position="2"> Introduction It is well known that word segmentation is a prerequisite to Chinese information processing.</Paragraph>
    <Paragraph position="3"> Previous research and work in word segmentation have made great progresses. However, cases with unknown words are not satisfactory. In general, any lexicon is limited and unable to cover all the words in real texts or speeches. According to our statistics on a 2,305,896-character news corpus from the People's Daily, there are about 1.19% unknown words. But they are difficult to be recalled and often greatly reduce the recognition rate of known words close to them. For example, the sentence &amp;quot; Bu Chang Sun Jia Zheng Zai Gong Zuo . &amp;quot; (Pronunciation: &amp;quot;Bu Zhang Sun Jia Zheng Zai Gong Zuo.&amp;quot;) has two valid segmentations: &amp;quot;Bu Chang / Sun Jia Zheng / Zai /Gong Zuo &amp;quot; (The minister Sun Jiazheng is at work) and &amp;quot;Bu Chang /Sun Jia / Zheng Zai /Gong Zuo &amp;quot; (The minister Sun Jia now is at work). &amp;quot;Sun Jia Zheng &amp;quot; is a person name in the first, while &amp;quot;Sun Jia &amp;quot; is another name in the latter. Meanwhile, the string &amp;quot;Sun Jia Zheng Zai &amp;quot; will lead to overlapping ambiguity and bring a collision between the unknown word &amp;quot; Sun Jia Zheng &amp;quot; (Sun Jiazheng) and &amp;quot; Zheng Zai &amp;quot;(zheng zai; now). What's more, the recognizing precision rates of person names, place names, and transliterations are 91.26%, 69.12%, and 82.83%, respectively, while the recalling rates of them are just 68.77%, 60.47%, and 78.29%, respectively. (Data from official testing in 1999) [Liu (1999)] In a word, unknown words recognition has become one of the biggest stumbling blocks on the way of Chinese lexical analysis. A proper solution is important and urgent.</Paragraph>
    <Paragraph position="4"> Various approaches are taken in Chinese unknown words recognition. They can be broadly categorized into &amp;quot;one-for-one&amp;quot;, &amp;quot;one-for-several&amp;quot; and &amp;quot;one-for-all&amp;quot; based on the number of categories of unknown words, which can be recognized. One-for-one solutions solve a particular problem, such as person name recognition [Song (1993); Ji (2001)], place name recognition [Tan (1999)] and transliteration recognition [Sun (1993)]. Similarly, one-for-several approaches provide one solution for several specific categories of unknown words [Lv (2001); Luo (2001)]. One-for-all solutions, as far as we know, have not been applicable yet [Chen (1999); He (2001)].</Paragraph>
    <Paragraph position="5"> Although currently practicable methods could achieve great precision or recalling rates in some special cases, they have their inherent deficiencies. First of all, rules applied are mostly summarized by linguists through painful study of all kinds of huge &amp;quot;special name libraries&amp;quot; [Luo (2001)]. It's time-consuming, expensive and inflexible. The categories of unknown words are diverse and the amount of such words is huge. With the rapid development of the Internet, this situation is becoming more and more serious. Therefore, it's very difficult to summarize simple yet thorough rules about their compositions and contexts.</Paragraph>
    <Paragraph position="6"> Secondly, the recognition process cannot be activated until some &amp;quot;indicator&amp;quot; tokens are scanned in. For instance, possible surnames or titles often trigger person name recognition on the following 2 or more characters. In the case of place name recognition, the postfixes such as &amp;quot; Xian &amp;quot;(county), &amp;quot; Shi &amp;quot;(city) will activate the recognition on the previous characters. What's more, these methods tend to work only on the monosyllabic tokens, which are obvious fragments after tokenization [Luo (2001); Lv (2001)]. It takes the risk of losing lots of unknown words without any explicit features. Furthermore, this trigger mechanism cannot resolve the ambiguity. For example, unknown word &amp;quot;Fang Lin Shan &amp;quot; (Fang Lin Shan) maybe a person name &amp;quot;Fang / Lin Shan &amp;quot;(Fang Linshan) or a place name &amp;quot;Fang Lin / Shan &amp;quot;(Fanglin Mountain). This paper presents a one-for-all approach based on roles tagging to avoid such problems.</Paragraph>
    <Paragraph position="7"> The process is: tagging tokens after word segmentation with the most probable roles and making unknown words recognition based on roles sequence. The mechanism of roles tagging is just like that of a small and simple Part-Of-Speech tagger.</Paragraph>
    <Paragraph position="8"> The paper is organized as follows: In section 2, we will describe the approach in general.</Paragraph>
    <Paragraph position="9"> Following that, we will present the solution in practice. In the final part, we provide recognition experiments using roles-tagging methods. The result and possible problems are discussed as well.</Paragraph>
    <Paragraph position="10"> 1 Unknown words recognition based on roles tagging</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
1.1 Lexical roles of unknown words
</SectionTitle>
      <Paragraph position="0"> Unknown words are often made up of distinctive components, most of which are monosyllabic characters or short words; in addition, there are some regular relations between unknown words and their locality, especially with their left and right context. As we often write or speak, a Chinese person name is usually comprised of a one-or-two-character surname and a following given name of one or two characters, like &amp;quot;Xiao Jian Qun &amp;quot;(Xiao Jianqun) and &amp;quot; Zhu Ge Liang &amp;quot;(Zhu-Ge Liang). The previous words are mostly titles, occupations or some conjunctive words, such as &amp;quot;Jing Li &amp;quot;(Manager), &amp;quot; Si Ji &amp;quot;(Driver) and &amp;quot; Dui &amp;quot;(To). The following words tend to be verbs such as &amp;quot;Shuo &amp;quot; (to say) , &amp;quot;Biao Shi &amp;quot;(to express). Similar components, contexts and relations can be discovered in place name, transliteration, organization name, or other types of unknown words.</Paragraph>
      <Paragraph position="1"> We define unknown word roles with respect to varied internal components, previous and succeeding contexts and other tokens in a particular sentence. Various roles are extracted according to their functions in the forming of different unknown words. Person names roles and transliterations roles set are shown in table 1a and 1b respectively. Using the roles set of person name, the tokens sequence &amp;quot;Guan / Nei / Chen Lie / Zhou / En / Lai / He / Deng / Ying / Chao Sheng / Qian / Shi Yong / Guo / De / Wu Pin /&amp;quot; (What Zhou Enlai and Deng Yunchao used before death are presented in the museum) will be tagged as &amp;quot;Guan /A</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
1.2 Roles tagging and unknown words recognition
</SectionTitle>
      <Paragraph position="0"> On the one hand, the sentence include words with different roles for a particular category of unknown words, on the other hand, such words can be recognized after identifying their roles sequence. That is: tagging tokens after word segmentation with the most probable roles sequence, then recognizing unknown words by maximum patterns matching on the final roles sequence.</Paragraph>
      <Paragraph position="1"> Roles tagging is similar to Part-Of-Speech tagging. Our tagging process is based on Viterbi Algorithm [Rabiner and Juang (1989)], which is to select the optimum with maximum probability from all possible tag sequences. The methodology and its deduction is given as below: Suppose that T is the tokens sequence after word segmentation and R is the roles sequence for T. We take the role sequence R # with the maximum probability as the best choice. That is:</Paragraph>
      <Paragraph position="3"> According to the Bayes equation, we can get:</Paragraph>
      <Paragraph position="5"> For a particular token sequence, P(T) is a constant. So, We can get E3 based on E1 and E2:</Paragraph>
      <Paragraph position="7"> We may consider T as the observation value sequence while R as the state sequence hidden behind the observation. Now we introduce Hidden Markov Model [Rabiner and Juang (1986)] to resolve such a typical problem:  Now, we can find the most possible token sequence with equation E5. It's a simple application of Viterbi Algorithm.</Paragraph>
      <Paragraph position="8"> The final recognition through maximum pattern matching is not performed on the original texts but performed on roles sequence. The person patterns are {BBCD, BBE, BBZ, BCD, BE, BG, BXD, BZ, CD, FB, Y, XD}. Before matching, we should split the tokens whose roles are like &amp;quot;U&amp;quot; or &amp;quot;V&amp;quot;(which indicate that the related token is generated by internal components and the outside contexts of unknown words) into two proper parts. Such a processing can recall more unknown words and reduce the overlapping collision. As for the above sample sentence, the final roles sequence after splitting is &amp;quot;AAKBCDMBCDLAAAAAA&amp;quot;. Therefore, we can identify the possible person names &amp;quot;Zhou En Lai &amp;quot; and &amp;quot;Deng Ying Chao &amp;quot; according to the recognition pattern &amp;quot;BCD&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
1.3 Automatic acquisition of roles knowledge
</SectionTitle>
      <Paragraph position="0"> As described in E5, the tag sequence R</Paragraph>
      <Paragraph position="2"> decided by two kinds of factors: and . is the probability of a</Paragraph>
      <Paragraph position="4"> given the condition of being tagged with</Paragraph>
      <Paragraph position="6"> useful lexical knowledge for tagging and final recognition. According to laws of large numbers, if the training corpus is large enough, we can acquire the roles knowledge as following:</Paragraph>
      <Paragraph position="8"> ) are extracted from corpus through a training process. The training corpus came from one-month news from the People's Daily with 2,305,896 Chinese characters, which are manually checked after word segmentation and POS tagging (It can be downloaded at icl.pku.edu.cn, the homepage of the Institute of Computational Linguistics, Peking University).</Paragraph>
      <Paragraph position="9"> However, the corpus is tagged with the Part-Of-Speech set. Before training, the original POS tags should be converted to the proper roles by analysing every token in the sentence.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML