XML Viewer - w97-0104

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-0104_concl.xml
Size: 4,031 bytes
Last Modified: 2025-10-06 13:57:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0104">
  <Title>II I A Statistics-Based Chinese Parser</Title>
  <Section position="7" start_page="5" end_page="11" type="concl">
    <SectionTitle>
4 Experimental results
</SectionTitle>
    <Paragraph position="0"> In the absence of an available annotated Chinese corpus, we had to build a small Chinese treebank for training and evaluating the parser, which consists of the sentences extracted from two parts of Chinese texts: (1) test set for Chinese-English machine translation systems (Text A), (2) Singapore primary school textbooks on Chinese language (Text B). Table 1 shows the basic statistics of these two parts in the treebank.</Paragraph>
    <Paragraph position="1">  Then, the treebank is divided as a training set with 4777 sentences and a test set with 796 sentences based on balanced sampling principle. Figure 4 shows the distributions of sentence length in the training and test sets. In addition, according to the difference of word(including punctuation) number in the  sentence, all sentences in the treebank can be further classified as two sets. One is simple sentence set, in which every sentence has no more than 20 words. The other is complex sentence set, in which every sentence has more than 20 words. Therefore, we will obtain complete knowledge about the performance of the parser by the comparison of it on these two types of sentences. Table 2 shows the distribution dat~ of simple and complex sentences in the training and test sets.</Paragraph>
    <Paragraph position="2">  In order to evaluate the performance of the current Chinese parser, we are using the following measures: 1) Matched precision(MP) = number of correct matched constituents in proposed parse number of matched constituent in proposed parse  number of correct matched constituents in proposed parse number of constituents in treebank parse 3) Crossing Brackets(CBs) ffi number of constituents which violate constituent boundaries with a constituent in the treebank parse.</Paragraph>
    <Paragraph position="3"> The above measures are similar with the PARSEVAL measures defmed in \[Bla91\]. Here, for a matched constituent to be 'correct' it must have the same boundary location with a constituent in the treebank parse.</Paragraph>
    <Paragraph position="4"> 4) Boundary prediction precision(BPP) = number of words with correct constituent boundary prediction number of words in the sentence 5) I,abeled precisign(LP) = number of correcVi'abeled-constituents in proposed parse number of correct matched constituent in proposed parse 6) Sentence parsing ratio(SPg) = number&amp;quot; of sentences having a proposed parse by parser number of input sentences Table 3 shows the experiment results. On a 80Mhz 486 personal computer with 16 megabytes RAM, the parser can parse about 1.38 sentences per second.</Paragraph>
    <Paragraph position="5">  as the widely accepted POS tagging, is based on the following premises: (a) Most constituent boundaries in a Chinese sentence can be predicted according to their local word and POS information, (b) The parsing complexi~be reduced based on constituent boundary prediction. (2) The proof of complete matdhTn-gprinciple and the application of matching restriction schemes guarantee the soundness and efficiency of the matching algorithm.</Paragraph>
    <Paragraph position="6"> (3) To use SCFG rules as a main disambiguation knowledge will cut down the hard work to manually develop a complex and detailed disambiguation rule base.</Paragraph>
    <Paragraph position="7"> Although the experimental results are encouraging, there are many possibilities for improvement of the algorithm. Some unsupervised training methods for SCFG rules, such as inside-outside alg0rithm\[LY90\] and its improved approaches(\[PS92\],\[SYW95\]), should be tried in the absence of large-scale Chinese treebanks. The disambiguation model could be extended to capture context-sensitive statistics\[CC94\] and word statistics(\[EC95\],\[Coi96\]).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML