File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/c04-1066_concl.xml
Size: 3,609 bytes
Last Modified: 2025-10-06 13:53:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1066"> <Title>Japanese Unknown Word Identification by Character-based Chunking</Title> <Section position="6" start_page="7" end_page="8" type="concl"> <SectionTitle> 5 Summary and Future Direction </SectionTitle> <Paragraph position="0"> We introduce a character-based chunking method for general unknown word identification in Japanese texts. Our method is based on cascading a morphological analyzer and a chunker. The method can identify unknown words regardless of their occurence frequencies.</Paragraph> <Paragraph position="1"> Our research need to include POS guessing for the identified words. One would argue that, once the word boundaries are identified, the POS guessing method in European language can be applied (Brants 2000; Nakagawa 2001). In our preliminary experiments of POS guessing, both SVM and Maximum Entropy with contextual information achieves 93% with a coarse-grained POS set evaluation, but reaches only around 65% with a fine-grained POS set evaluation.</Paragraph> <Paragraph position="2"> The poor result may be due to the &quot;possibility-based POS tagset&quot;. The tagset is not necessarily friendly for statistical morphological analyzer development, but is widely used in Japanese corpus annotation. In the scheme, the fine-grained POS Verbal Noun in Japanese means that the word can be used both as Verbal Noun with verb and General Noun without verb. It is difficult to estimate the POS Verbal Noun, if the word appear in the context without verb. We are currently pursuing the research to better estimate fine-grained POS for the possibility-based POS tagset.</Paragraph> <Paragraph position="3"> Unknown words mean out-of-vocabulary (hereafter OOV) words. The definition of the unknown words depends on the base lexicon. We investigate the relationship between the base lexicon size and the number of OOV words. We examine how the reduction of lexicon size affects the OOV rate in a corpus. When we reduce the size of lexicon, we reject the words in increasing order of frequency in a corpus.</Paragraph> <Paragraph position="4"> Then, we use hits on a web search engine as substitutes for frequencies. We use goo as the search engine and IPADIC (Asahara and Matsumoto, 2002) as the base lexicon. Figure 3 shows the distribution of the hit numbers. The x-axis is the number of hits in the search engine. The y-axis is the number of words which get the number of hits. The curve on the graph is distorted at 100 at which round-off be- null We reduce the size of lexicon according to the number of hits. The rate of known words in a corpus is also reduced along the size of lexicon. Figure 4 shows the rate of known words in RWCP text corpus (Real World Computing Partnership, 1998). The x-axis is the threshold of the hit number. When the hit number of a word is less than the threshold, we regard the word as an unknown word. The left y-axis is the number of known words in the corpus. The right y-axis is the rate of known words in the corpus. Note, when the hit number of a word is 0, we do not remove the word from the corpus, because the word may be a stop word of the web search engine. null When we reject the words less than 1,000 hits from the lexicon, the lexicon size becomes 1/3 and the OOV rate is 1%. When we reject the words less than 10,000 hits from the lexicon, the lexicon size becomes 1/6 and the OOV rate is 3.5%. We use these two data set, namely the lexicons and the definition of out-of-vocabulary words, for evaluation in section 3.1 and 3.2.</Paragraph> </Section> class="xml-element"></Paper>