File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/c04-1066_relat.xml

Size: 5,139 bytes

Last Modified: 2025-10-06 14:15:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1066">
  <Title>Japanese Unknown Word Identification by Character-based Chunking</Title>
  <Section position="5" start_page="2" end_page="7" type="relat">
    <SectionTitle>
4 Related Work
</SectionTitle>
    <Paragraph position="0"> Mori (1996) presents a statistical method based on n-gram model for unknown word identification. The method estimates how likely the input string is to be a word. The method cannot cover low frequency unknown words. Their method achieves 87.4% precision and 73.2% recall by token, 57.1% precision and 69.1% recall by type  on EDR corpus. Ikeya (2000) presents a method to find unknown word boundaries for strings composed by only kanji characters. The  The evaluation of their method depends on the threshold of the confidence F min in their definition. We refer the precision and recall at F min =0.25.</Paragraph>
    <Paragraph position="1"> method also uses the likelihood based on n-gram model. Their method achieves 62.8 (F-Measure) for two kanji character words and 18.2 (F-Measure) for three kanji character words in newspapers domain. Nagata (1999) classifies unknown word types based on the character type combination in an unknown word. They define likelihood for each combination. The context POS information is also used. The method achieves 42.0% recall and 66.4% precision on EDR corpus  .</Paragraph>
    <Paragraph position="2"> Uchimoto (2001) presents Maximum Entropy based methods. They extract all strings less than six characters as the word candidates. Then, they do morphological analysis based on words in lexicon and extracted strings. They use Kyoto University text corpus (Version 2) (Kurohashi and Nagao, 1997) as the text and JUMAN dictionary (Version 3.61) (Kurohashi and Nagao, 1999) as the base lexicon null  . The recall of Uchimoto's method is 82.4% (1,138/1,381) with major POS estimation. We also perform nearly same experiment  . The result of our method is 48.8% precision and 36.2% recall (293/809) with the same training data (newspaper articles from Jan. 1 to Jan. 8, 1995) and test data (articles on Jan. 9, 1995). When we use all of the corpus excluding the test data, the result is 53.7% precision and 42.7% recall (345/809).</Paragraph>
    <Paragraph position="3"> Uchimoto (2003) also adopts their method for CSJ Corpus (Maekawa et al. 2000)  . They present that the recall for short words on the corpus is 55.7% (928/1,667) (without POS information). We try to perform the same experiment. However, we cannot get same version of the corpus. Then, we use CSJ Corpus - Monitor Edition (2002). It only contains short word by the definition of the National Institute of Japanese Language. 80 % of the corpus is used for training and the rest 20 % is for test. The result is 68.4% precision and 61.1% recall (810/1,326)  .</Paragraph>
    <Paragraph position="4">  They do not assume any base lexicon. Base lexicon size 45,027 words (composed by only the words in the corpus), training corpus size 100,000 sentences, test corpus size 100,000 sentences. Unknown words are defined by single occurrence words in the corpus.</Paragraph>
    <Paragraph position="5">  Base lexicon size 180,000 words, training corpus size 7,958 sentences, test corpus size 1,246 sentences OOV (out-ofvocabulary) rate 17.7%. Unknown words are defined by single occurrence words in the corpus.</Paragraph>
    <Paragraph position="6">  The difference is the definition of unknown words. Whereas they define unknown words by the possible word form frequency, we define ones by the stem form frequency.  Training corpus size 678,649 tokens, 83,819 utterances, test corpus size 185,573 tokens, 20,955 utterances OOV rate 0.71%. Single occurence word by the stem form is defined as the unknown word.</Paragraph>
    <Paragraph position="7"> Note, the version of the corpus and the definition of unknown word are different between Uchimoto's one and ours.</Paragraph>
    <Paragraph position="8"> The difference of the result may come from the word unit definition. The word unit in Kyoto University Corpus is longer than the word unit in RWCP text Corpus and the short word of CSJ Corpus. Though our method is good at shorter unknown words, the method is poor at longer words including compounds.</Paragraph>
    <Paragraph position="9"> For Chinese language, Chen (2002) introduces a method using statistical methods and human-aided rules. Their method achieves 89% precision and 68% recall on CKIP lexicon. Zhang (2002) shows a method with role (position) tagging on characters in sentences. Their tagging method is based on Markov model. The role tagging resembles our method in that it is a character-based tagging. Their method achieves 69.88% presicion and 91.65% recall for the Chinese person names recognition in the People's Daily. Goh (2003) also uses a character-based position tagging method by support vector machines. Their method achieves 63.8% precision and 58.4% recall for the Chinese general unknown words in the People's Daily. Our method is one variation of the Goh's method with redundant outputs of a morphological analysis.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML