File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2001_metho.xml

Size: 6,253 bytes

Last Modified: 2025-10-06 14:09:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2001">
  <Title>A Classification-based Algorithm for Consistency Check of Part-of-Speech Tagging for Chinese Corpora</Title>
  <Section position="4" start_page="4" end_page="4" type="metho">
    <SectionTitle>
AB and AC
</SectionTitle>
    <Paragraph position="0"> When using our sampled 1M-word training corpus to conduct closed test, we found that consistency check precision changes significantly with the different values of AB and AC. Figure 2 shows the trend when AB varies from 0.1 to 0.9. We used</Paragraph>
  </Section>
  <Section position="5" start_page="4" end_page="5" type="metho">
    <SectionTitle>
3 Consistency Check of POS Tagging
</SectionTitle>
    <Paragraph position="0"> Our consistency check algorithm is based on classification of context vectors of multi-category words. In particular, we first classify context vectors of each multi-category word in the training corpus, and then we conduct the consistency check of POS tagging based on classification results. null</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.1 Similarity between Context Cectors of
Multi-category Words
</SectionTitle>
      <Paragraph position="0"> After constructing context vectors for all multi-category words from their context windows and POS tagging sequences, the similarity of two context vectors is defined as the Euclidean Distance between the two vectors.</Paragraph>
      <Paragraph position="1">  where DC and DD are two arbitrary context vectors of D2 dimensions.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.2 CZ-NN Classification Algorithm
</SectionTitle>
      <Paragraph position="0"> Classification is a process to assign objects that need to be classified to a certain class. In this paper, we used a popular classification method: the CZ-NN algorithm.</Paragraph>
      <Paragraph position="1"> Suppose we have CR classes and a class  B5). The idea of the CZ-NN algorithm is that for each unlabeled object DC, compute the distances between DC and all samples whose class is known, and select CZ samples (CZ nearest neighbors) with the smallest distance. This object DC will be assigned to the class that contains the most samples in the CZ nearest neighbors.</Paragraph>
      <Paragraph position="2"> We now formally define the discriminant function and discriminant rule. Suppose CZ</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.3 Consistency Check Algorithm
</SectionTitle>
      <Paragraph position="0"> In this section, we describe the steps of our classification-based consistency check algorithm in detail.</Paragraph>
      <Paragraph position="1"> Step1: Randomly sampling sentences containing multi-category words and checking their POS tagging manually. For each multi-category word, classifying the context vectors of the sampled POS tagging sequences, so that the context vectors that have the same POS for the multi-category word belong to the same class.</Paragraph>
      <Paragraph position="2"> Step2: Given a context vector DC of a multi-category word CR, calculating the distances between DC and all the context vectors that contains the multi-category word CR in the training corpus, and selecting CZ context vectors with smallest distances. null Step3: According to the CZ-NN algorithm, checking the classes of the CZ nearest context vectors and classifying the vector DC.</Paragraph>
      <Paragraph position="3"> Step4: Comparing the POS of the multi-category word CR in the class that the CZ-NN algorithm assigns DC to and the POS tag of CR. If they are the same, the POS tagging of the multi-category word CR is considered to be consistent, otherwise it is inconsistent.</Paragraph>
      <Paragraph position="4"> The major disadvantage of this algorithm is the difficulty in selecting the value of CZ. If CZ is too small, the classification result is unstable. On the other hand, if CZ is too big, the classification deviation increases.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
3.4 Selecting CZ in Classification Algorithm
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows the consistency check precision values obtained with various CZ values in the CZ-NN algorithm. The precision values are closed  Number of Number of Number of Test Test multi-category the true the identified Recall Precision corpora type words inconsistencies inconsistencies (%) (%) 1M-word closed 127,210 1,147 1,219 (156) 92.67 87.20 500K-word open 64,467 579 583 (86) 85.84 85.24 test results on our 1M-word training corpus, and were obtained by using AB BP BCBMBG and AC BP BCBMBI in the context vector model.</Paragraph>
      <Paragraph position="1">  algorithm.</Paragraph>
      <Paragraph position="2"> As shown in Figure 3, when CZ continues to increase from 6, the precision remains the same. When when CZ reaches to 9, the precision starts declining. Our experiment with other AB and AC values also show similar trends. Hence, we chose</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="5" type="metho">
    <SectionTitle>
CZ BP BI in this paper.
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We evaluated our consistency check algorithm on our 1.5M-word corpus (including 1M-word training corpus) and conducted open and closed tests.</Paragraph>
    <Paragraph position="1"> The results are showed in Table 1.</Paragraph>
    <Paragraph position="2"> The experimental results show two interesting trends. First, the precision and recall of our consistency check algorithm are 87.20% and 92.67% in closed test, respectively, and 85.24% and 85.84% in open test, respectively. Compared to Zhang et al. (Zhang et al., 2004), the precision of consistency check is improved by 2AO3%, and the recall is improved by 10%. The experimental results indicate that the context vector model has great improvements over the one used in Zhang et al. (Zhang et al., 2004). Second, thanks to the great improvement of the recall, to some extent, our consistency check algorithm prevents the happening of events with small probabilities in POS tagging.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML