File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-2141_evalu.xml

Size: 7,061 bytes

Last Modified: 2025-10-06 13:58:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2141">
  <Title>Local context templates for Chinese constituent boundary prediction</Title>
  <Section position="3" start_page="976" end_page="979" type="evalu">
    <SectionTitle>
5. Experimental results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="976" end_page="977" type="sub_section">
      <SectionTitle>
5.1 Training and test data
</SectionTitle>
      <Paragraph position="0"> The training data were extracted fl'Olll two ditTerent parts of annotated Chinese corpus:  1) The small Chinese treebank developed in Peking University(Zhou, 1996b), which consists of the sentences extracted fiom two parts of Chinese texts: (a) test set for Chinese-Englisla machine transhltion systems, (b) Singapore priiaary school textbooks.</Paragraph>
      <Paragraph position="1"> 2) The test suite treebank being developed in Tsinghua University(Zhou and Sun, 1999), which consists of about 10,000 representative Chinese sentences extracted from a large-scale Chinese  bahmced corpus with about 2,000,000 Chinese characters.</Paragraph>
      <Paragraph position="2"> The test data were extracted from the articles of People's Daily and manually annotated with  correct constituent boundary tags. It was also divided into two parts:  1) The ordinary sentences.</Paragraph>
      <Paragraph position="3"> 2) The sentences with keywords for  conjunction structures (such as the conjunctions or special punctuation 'DunHao'). They can be used to test the performance of our prediction algorithm on complex conjunction structures. Table 2 shows some basic statistics of these training and test data. Only the sentences with more than one word were used for training and testing.</Paragraph>
    </Section>
    <Section position="2" start_page="977" end_page="977" type="sub_section">
      <SectionTitle>
5.2 The learned templates
</SectionTitle>
      <Paragraph position="0"> After the three-stage learning procedure, we got four kinds of local context templates. Table 3 shows their different distribution data, where the section 'Type' lists the distribution of different kinds of LCTs and the section 'Token' lists the distribution of total words(i.e, tokens) covered by the LCTs. In the colmnn 'PTs' and 'Ratio', the slash '/' was used to separate the PTs with total frequency threshold 0 and 3.</Paragraph>
      <Paragraph position="1"> More than 66% words in the training corpus can be covered by the unigram and bigram POS projected templates. Then only about 1/3 tokens will be used for training the trigram templates.</Paragraph>
      <Paragraph position="2"> Although the type distribution of the trigram templates shows the tendency of data sparseness (more than 70% trigram projected templates with total fi'equency less than 3), the useful trigram templates (TF&gt;3) still covers about 70% tokens learned. Therefore, we can expect that them can play an important role during constituent boundary prediction in open test set.</Paragraph>
    </Section>
    <Section position="3" start_page="977" end_page="978" type="sub_section">
      <SectionTitle>
5.3 Prediction results
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the performance of the constituent boundary prediction algorithm, the followiug measures were used:  1) The cost time(CT) of the kernal functions(CPU: Celeron TM 366, RAM: 64M). 2) Prediction precision(PP) =  number of words with correct BPs(CortBP) total word number (TWN) For the words with single BP output, the correct condition is: Annotated BP = Predicted BP For the words with nmltiple BP outputs, the correct conditiou is: Annotated BP ~ Predicted BP set Tile prediction results of the two test sets were shown in Table 4 and Table 5, whose first columns list the different template combinations using in the algorithm. In the columns 'CortBP' and 'PP', the slash '/' was used to list the different results of the single and multiple BP outputs.</Paragraph>
      <Paragraph position="1"> After analyzing the experimental results, we found: 1) The POS information in local context is very important for constituent boundary prediction. After using the bigram and trigram POS templates, the prediction accuracy was increased by about 9% and 3% respectively. But the chmacter number information shows lower boundary restriction capability. Their application only results in a slight increase of precision in single-output mode but a slight decrease in lnultiple-output lnode.</Paragraph>
      <Paragraph position="2">  2) Most of the prediction errors can be attributed to the special structures in the sentences, such as col!iunction structures (CSs) or collocation structures. I)ue to the long distance dependencies among them, it's very difficult to assign the conect botmdary lags to the words in these structures only according to the local context templates. The lower overall precision of the test set 2 (about 2% lower than tesl set 1) also indicates the boundary prediction difficulties of the conjunction structures, because there are more CSs in test set 2 than in test set I.</Paragraph>
      <Paragraph position="3"> 3) The accuracy of the multiple outlml results is about 2% better than the single OUtlmt results. But the words with multiple boundary tags constitute only about 10% of the tolal words predicted. Therefore, the multil)le-output mode shows a good trade-off between precision and redundancy. It can be used as the best preprocessing data for the further syntactic parser.</Paragraph>
      <Paragraph position="4"> 4) The maximal ratio of the words set by prqjected templates can reach 80%. It guarantees the higher overall pl'ecisioiL Table 5 Experimental results of the test set 2 5) Tile algoritlml shows high efficiency. It can process about 6,000 words per second (CPU: Celeron TM 366, RAM: 64M).</Paragraph>
    </Section>
    <Section position="4" start_page="978" end_page="979" type="sub_section">
      <SectionTitle>
5.4 Compare with other work
</SectionTitle>
      <Paragraph position="0"> Zhou(1996) proposed a constituent boundary prediction algorithm based on hidden Marcov model(HMM). The Viterbi algorithm was used to find the best boundary path B':</Paragraph>
      <Paragraph position="2"> where the local POS probability l'(C~ \[ b) was computed by backing-off model and the bigram parameters:./(/~, t, , b) and.l(b~ , t,, \[i+l)&amp;quot; To compare its 19erformance with our algorithm, the trigram (POS and POS+CN) information was added up to its backing-off model. Table 6 and Table 7 show the prediction results of lhc HMM-based algorithm, based on the same parameters learned from training set 1 and  The performance of the LCT-based algorithm surpassed the HMM-based algorithm in accuracy(about 1%) and efficiency (about 10 times).</Paragraph>
      <Paragraph position="3"> Another similar work is Sun(1999). The difference lies in the definition of the constituent boundary tags: he defined them between word pair: w; /_? w;~;, not for the word. By using the HMM and Viterbi model, his algorithm showed the similar performance with Zhou(1996) (using</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML