File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1066_evalu.xml

Size: 7,355 bytes

Last Modified: 2025-10-06 13:59:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1066">
  <Title>Japanese Unknown Word Identification by Character-based Chunking</Title>
  <Section position="4" start_page="2" end_page="2" type="evalu">
    <SectionTitle>
3 Experiments and Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Experiments for measuring Recall
</SectionTitle>
      <Paragraph position="0"> Firstly, we evaluate recall of our method. We use RWCP text corpus (Real World Computing Partnership, 1998) as the gold standard and IPADIC (Version 2.6.3) (Asahara and Matsumoto, 2002) as the base lexicon. We set up two data sets based on the hit number of a web search engine which is shown in Appendix A. Table 3 shows the two data sets.</Paragraph>
      <Paragraph position="1"> Words with lower hit number than the threshold are regarded as unknown. We evaluate how many unknown words in the corpus are identified.</Paragraph>
      <Paragraph position="2">  We perform five fold cross validation and average the five results. We carefully separate the data into the training and test data. The training and test data do not share any unknown word. We evaluate recall and precision on both token and type as follows:  The experiment is conducted only for recall, since it is difficult to make fair judgment of precision in this setting. The accuracy is estimated by the word segmentation defined in the corpus. Nevertheless, there are ambiguities of word segmentation in the corpus. For example, while &amp;quot;NG(Kyoto University)&amp;quot; is defined as one word in a corpus, &amp;quot;G U/G(Osaka University)&amp;quot; is defined as two words in the same corpus. Our analyzer identifies &amp;quot;GUG &amp;quot; as one word based on generalization of &amp;quot;NG &amp;quot;. Then, it will be judged as false in this experiment. We make fairer precision evaluation in the next section. However, since several related works make evaluation in this setting, we also present pre- null &amp;quot;PN&amp;quot; stands for &amp;quot;Proper Noun&amp;quot; Data Set B, forward direction. Shown POSs are higher than the rank 11th by the token sizes. example, an experimental setting &amp;quot;A/for&amp;quot; stands for the data set A with a forward direction chunking, while &amp;quot;A/Back&amp;quot; stands for the data set A with a backward direction chunking. Since there is no significant difference between token and type, our method can detect both high and low frequency words in the corpus. Table 5 shows the recall of each POS in the setting data set B and forward direction chunking. While the recall is slightly poor for the words which include compounds such as organization names and case particle collocations, it achieves high scores for the words which include no compounds such as person names. There are typical errors of conjugational words such as verbs and adjectives which are caused by ambiguities between conjugational suffixes and auxiliary verbs.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Experiments for measuring Precision
</SectionTitle>
      <Paragraph position="0"> Secondly, we evaluate precision of our method manually. We perform unknown word identification on newspaper articles and patent texts.</Paragraph>
      <Paragraph position="1">  Firstly, we examine unknown word identification experiment in newspaper articles. We use articles of Mainichi Shinbun in January 1999 (116,863 sentences). Note that, the model is made by RWCP text corpus, which consists of articles of Mainichi Shinbun in 1994 (about 35,000 sentences).</Paragraph>
      <Paragraph position="2"> We evaluate the models by the number of identified words and precisions. The number of identified words are counted in both token and type. To estimate the precision, 1,000 samples are selected at random with the surrounding context and are showed in KWIC (KeyWord in Context) format.</Paragraph>
      <Paragraph position="3"> One human judge checks the samples. When the selected string can be used as a word, we regard it as a correct answer. The precision is the percentage of correct answers over extracted candidates.</Paragraph>
      <Paragraph position="4"> Concerning with compound words, we reject the words which do not match any constituent of the dependency structure of the largest compound word.</Paragraph>
      <Paragraph position="5"> Figure 2 illustrates judgment for compound words.</Paragraph>
      <Paragraph position="6"> In this example, we permit &amp;quot;y(overseas study)&amp;quot;. However, we reject &amp;quot;y8(short-term overseas)&amp;quot; since it does not compose any constitutent in the compound word.</Paragraph>
      <Paragraph position="7">  We make two models: Model A is composed by data set A in Table 3 and model B is composed by data set B. We make two settings for the direction of chunking, forward (from BOS to EOS) and backward (from EOS to BOS).</Paragraph>
      <Paragraph position="8"> Table 6 shows the precision for newspaper articles. It shows that our method achieves around 95% precision in both models. There is almost no difference in the several settings of the direction and the contextual feature.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Patent Texts
</SectionTitle>
      <Paragraph position="0"> We also examine word identification experiment with patent texts. We use patent texts (25,084 sentences), which are OCR recognized. We evaluate models by the number of extracted words and precisions as in the preceding experiment. In this experiments, the extracted tokens may contain errors of the OCR reader. Thus, we define three categories for the judgment: Correct, Wrong and OCR Error.</Paragraph>
      <Paragraph position="1"> We use the rate of three categories for evaluation.</Paragraph>
      <Paragraph position="2"> Note that, our method does not categorize the outputs into Correct and OCR Error.</Paragraph>
      <Paragraph position="3"> Table 7 shows the precision for patent texts. The backward direction of chunking gets better score than the forward one. Since suffixes are critical clues for the long word identification, the backward direction is effective for this task.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Word Segmentation Accuracy
</SectionTitle>
      <Paragraph position="0"> Thirdly, we evaluate how our method improves word segmentation accuracy. In the preceding experiments, we do chunking with tags in Table 2.</Paragraph>
      <Paragraph position="1"> We can do word segmentation with unknown word processing by annotating B and I tags to known words and rejecting O tag. RWCP text corpus and IPADIC are used for the experiment. We define single occurrence words as unknown words in the corpus. 50% of the corpus (unknown words/all words= 8,274/461,137) is reserved for Markov Model estimation. 40% of the corpus (7,485/368,587) is used for chunking model estimation. 10% of the corpus (1,637/92,222) is used for evaluation. As the base-line model for comparison, we make simple Markov Model using 50% and 90% of the corpus. The results of Table 8 show that the unknown word processing improves word segmentation accuracy.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML