File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2039_metho.xml
Size: 11,226 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2039"> <Title>Chinese Unknown Word Identification Using Character-based Tagging and Chunking</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Proposed Method </SectionTitle> <Paragraph position="0"> We shall now describe the 3 steps successively.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Morphological Analysis </SectionTitle> <Paragraph position="0"> ChaSen is a widely used morphological analyzer for Japanese texts (Matsumoto et al., 2002). It achieves over 97% precision for newspaper articles. We assume that Chinese language has similar characteristics with Japanese language to a certain extent, as both languages share semantically heavily loaded characters, i.e. kanji for Japanese, hanzi for Chinese.</Paragraph> <Paragraph position="1"> Based on this assumption, a model for Japanese may do well enough on Chinese. This morphological analyzer is based on Hidden Markov Models. The target is to find the word and POS sequence that maximize the probability. The details can be found in (Matsumoto et al., 2002).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Character Based Features </SectionTitle> <Paragraph position="0"> Character based features allow the chunker to detect unknown words more efficiently. It is especially the case when unknown words overlap known words.</Paragraph> <Paragraph position="1"> For example, ChaSen will segment the phrase &quot;a2 a3a5a4a7a6a5a8 . . . &quot; (Deng Yingchao before death) into &quot;a2 /a3 /a4a9a6 /a8 /. . . &quot; (Deng Ying before next life). If we use word based features, it is impossible to detect the unknown person name &quot;a2 a3a10a4 &quot; because it will not break up the word &quot;a4a11a6 &quot; (next life). Breaking words into characters enables the chunker to look at characters individually and to identify the unknown person name above.</Paragraph> <Paragraph position="2"> The POS tag from the output of morphological analysis is subcategorized to include the position of the character in the word. The list of positions is shown in Table 1. For example, if a word contains three characters, then the first character is a12 POSa13 -B, the second is a12 POSa13 -I and the third is a12 POSa13 -E. A single character word is tagged as a12 POSa13 -S.</Paragraph> <Paragraph position="3"> Character types can also be used as features for chunking. However, the only information at our disposal is the possibility for a character to be a family name. The set of characters used for transliteration may also be useful for retrieving transliterated names.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Chunking with Support Vector Machine </SectionTitle> <Paragraph position="0"> We use a Support Vector Machines-based chunker, YamCha (Kudo and Matsumoto, 2001), to extract unknown words from the output of the morphological analysis. The chunker uses a polynomial kernel of degree 2. Please refer to the paper cited for details. null Basically we would like to classify the characters into 3 categories, B (beginning of a chunk), I (inside a chunk) and O (outside a chunk). A chunk is considered as an unknown word in this case. We can either parse a sentence forwardly, from the beginning of a sentence, or backwardly, from the end of a sentence. There are always some relationships between the unknown words and the their contexts in the sentence. We will use two characters on each left and right side as the context window for chunking.</Paragraph> <Paragraph position="1"> Figure 1 illustrates a snapshot of the chunking process. During forward parsing, to infer the unknown word tag &quot;I&quot; at position i, the chunker uses the features appearing in the solid box. Reverse is done in backward parsing.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> We conducted an open test experiment. A one-month news of year 1998 from the People's Daily was used as the corpus. It contains about 300,000 words (about 1,000,000 characters) with 39 POS tags. The corpus was divided into 2 parts randomly with a size ratio for training/testing of 4/1.</Paragraph> <Paragraph position="1"> All person names and organization names were deleted from the dictionary for extraction. There were 4,690 person names and 2,871 organization names in the corpus. For general unknown word, all words that occurred only once in the corpus were deleted from the dictionary, and were treated as unknown words. 12,730 unknown words were created under this condition.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> We now present the results of our experiments in recall, precision and F-measure, as usual in such experiments. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Person Name Extraction </SectionTitle> <Paragraph position="0"> Table 2 shows the results of person name extraction.</Paragraph> <Paragraph position="1"> The accuracy for retrieving person names was quite satisfiable. We could also extract names overlapping with the next known word. For example, for the sequence &quot;a2 /Ng a3 /Ag a4a14a6 /v a8 /f a15a14a16 /v a17 /v Position Char. POS(best) Family Name Chunk</Paragraph> <Paragraph position="3"> is part of a known word &quot;a4a11a6 &quot;. It could also identify transliterated foreign names such as &quot;a27a29a28a29a30 &quot; - add family name as feature Furthermore, it was proved that if we have the information that a character is a possible character for family name, it helps to increase the accuracy of the system, as the last two rows of Table 2 show.</Paragraph> <Paragraph position="4"> Some person names that could not be extracted are such as in the sequence &quot;a40 /a a41 /q a42 /d a43 /d a44 a45 /a&quot; (Lao Zhang is still very positive). In this example, &quot;a40a29a41a46a42 &quot; was extracted as a person name, however the right name is &quot;a40a10a41 &quot; only. This is because the next character of the unknown ones is a monosyllabic word, thus there is higher possibility that it is joined with the unknown word as a chunk.</Paragraph> <Paragraph position="5"> Another example is &quot;a47 /q a21a46a41 /v a48 /n a49 /n&quot; (The owner Zhang Baojun), where the family name &quot;a41 &quot; has been joined with the known word &quot;a21a50a41 &quot; (suggest) before it. Therefore, the person name &quot;a41a26a48a14a49 &quot; was not extracted (the correct segmentation should be &quot;a47a51a21 /n a41a26a48a14a49 /nr&quot;).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Organization Name Extraction </SectionTitle> <Paragraph position="0"> Table 3 shows the result for organization name extraction. Organization names are best extracted by using backward parsing. This may be explained by the fact that, in Chinese, the last section of a word is usually the keyword showing that it is an organization name, such as, &quot;a52a50a53 &quot; (company), &quot;a54a7a55 &quot; (group), &quot;a56a50a57 &quot; (organization), etc. By parsing the sentence backwardly, these keywords will be first looked at and will have higher possibility to be identified. null There are quite a number of organization names that could not be identified. For example, &quot;a58a60a59a26a61a26a62</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Unknown Words Extraction in General </SectionTitle> <Paragraph position="0"> As mentioned above, we deleted all words that occur only once from the dictionary to artificially create unknown words. Those &quot;unknown words&quot; included common nouns, verbs, numbers, etc. The results for this experiment are shown in Table 4.</Paragraph> <Paragraph position="1"> In general, around 60% accuracy (F-measure) was achieved for unknown word detection, and backward parsing seems doing slightly better than for- null As to ensure that character based chunking is better than word based chunking, we have carried out an experiment with word based chunking as well.</Paragraph> <Paragraph position="2"> The results showed that character based chunking yields better results than word based chunking. The f-measure (a85 word baseda86 vs a85 character baseda86 ) for person name extraction is (81.28 vs 84.69), for organization name is (67.88 vs 70.40), and for general unknown word is (56.96 vs 61.00) respectively.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Comparison with Other Works </SectionTitle> <Paragraph position="0"> There are basically two methods to extract unknown words, statistical and rule based approaches. In this section, we compare our results with previous reported work.</Paragraph> <Paragraph position="1"> (Chen and Ma, 2002) present an approach that automatically generates morphological rules and statistical rules from a training corpus. They use a very large corpus to generate the rules, therefore the rules generated can represent patterns of unknwon words as well. While we use a different corpus for the experiment, it is difficult to perform a comparison.</Paragraph> <Paragraph position="2"> They report a precision of 89% and a recall of 68% for all unknown word types. This is better than our system which achieves only 65% for precision and 58% for recall.</Paragraph> <Paragraph position="3"> In (Shen et al., 1997), local statistics information are used to identify the location of unknown words.</Paragraph> <Paragraph position="4"> They assume that the frequency of the occurences of an unknown word is normally high in a fixed cache size. They have also investigated on the relationship between the size of the cache and its performance.</Paragraph> <Paragraph position="5"> They report that the larger the cache, the higher the recall, but not the case for precision. They report a recall of 54.9%, less than the 58.43% we achieved.</Paragraph> <Paragraph position="6"> (Zhang et al., 2002) suggest a method that is based on role tagging for unknown words recognition. Their method is also based on Markov Models. Our method is closest to the role tagging idea as this latter is also a sort of character based tagging. The extension in our method is that we first do morphological analysis and then use chunking based on SVM for unknown word extraction. In their paper, they report an F-measure of 79.30% in open test environment for person name extraction. Our method seems better with an F-measure of 86.78% for per-son name extraction (for both Chinese and foreign names).</Paragraph> </Section> class="xml-element"></Paper>