File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1101_evalu.xml

Size: 5,837 bytes

Last Modified: 2025-10-06 13:59:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1101">
  <Title>Segmentation of Chinese Long Sentences Using Commas</Title>
  <Section position="7" start_page="5" end_page="7" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> For training and testing, we use the Chinese Penn Treebank 2.0 corpus based on 10-fold validation. First, using bracket information, we extract the type (inter-clause comma or intra-clause comma) for each comma, as we defined. The extracted information is used as the standard answer sheet for training and testing.</Paragraph>
    <Paragraph position="1"> We extract the feature vector for each comma, and use support vector machines (SVM) to perform the classification work.</Paragraph>
    <Paragraph position="2"> Performances are evaluated by the following four types of measures: accuracy, recall, F b=1/2 for inter-clause and intra-clause comma respectively, and total accuracy. Each evaluation measure is calculated as follows.</Paragraph>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.1 Classification Using SVM
</SectionTitle>
      <Paragraph position="0"> Support vector machines (SVM) are one of the binary classifiers based on maximum margin strategy introduced by Vapnik (Vapnik, 1995). For many classification works, SVM outputs a state of the art performance.</Paragraph>
      <Paragraph position="2"> There are two advantages in using SVM for classification: null  (1) High generalization performance in high dimensional feature spaces.</Paragraph>
      <Paragraph position="3"> (2) Learning with combination of multiple fea null tures is possible via various kernel functions. Because of these characteristics, many researchers use SVM for natural language processing and obtain satisfactory experimental results (Yamada, 2003).</Paragraph>
      <Paragraph position="4"> In our experiments, we use SVM light (Joachims, 1999) as a classification tool.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
5.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> First, we set the entire left segment and right segment as an input window. Table 6 gives the performance with different kernel functions. The RBF kernel function with g =1.5 outputs the best performance. Therefore, in the following experiments, we use this kernel function only.</Paragraph>
      <Paragraph position="1"> Next, we perform several experiments on how the selection of word window affects performance.</Paragraph>
      <Paragraph position="2"> First, we select the adjoining 3 words of the right and left segment each, indicated as win-3 in table 7.</Paragraph>
      <Paragraph position="3">  The inter-clause comma precision is abbreviated as inter-P. Same way, Inter-R for inter-clause comma recall, ..etc.</Paragraph>
      <Paragraph position="4"> Second, we select the first 2 words and last 3 words of the left segment and the first 3 and last 2 of the right segment, indicated as win 2-3 in table 7. Finally, we use the part of speech sequence as input. As the experimental results show, the part of speech sequence is not a good feature. The features with clausal relevant information obtain a better output. We also find that the word window of first 2-last 3 obtains the best total precision, better than using the entire left and right segments. From this, we conclude that the words at the beginning and end of the segment reveal segment clausal information more effectively than other words in the segment. null</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
5.3 Comparison of Parsing Accuracy with
</SectionTitle>
      <Paragraph position="0"> and without Segmentation Model The next experiment tests how the segmentation model contributes to parsing performance. We use a Chinese dependency parser, which was implemented with the architecture presented by Kim (2001) presents.</Paragraph>
      <Paragraph position="1"> After integrating the segmentation model, the parsing procedure is as follows:  - Part of speech tagging.</Paragraph>
      <Paragraph position="2"> - Long sentence segmentation by comma. - Parsing based on segmentation.</Paragraph>
      <Paragraph position="3">  Table 9 gives a comparison of the results of the original parser with the integrated parser.</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
5.4 Comparison with Related Work
</SectionTitle>
      <Paragraph position="0"> Shiuan and Anns (1996) system obtains the clues for segmenting a complex sentence in English by disambiguating the link words, including the comma. The approach to find the segmentation point by analyzing the specific role of the comma in the sentence seems similar with our approach.</Paragraph>
      <Paragraph position="1"> However, our system differs from theirs as follows:  (1) Shiuan and Anns system sieves out just two  roles for the comma, while ours gives an analysis for the complete usages of the comma.</Paragraph>
      <Paragraph position="2"> (2) Shiuan and Anns system also analyzes the clausal conjunction or subordinating preposition as the segmentation point.</Paragraph>
      <Paragraph position="3"> Although the language for analysis is different, and the training and testing data also differ, the motivation of the two systems is the same. In addition, both systems are evaluated by integrating the original parser. The average accuracy of comma disambiguation in Shiuan and Anns is 93.3% that is higher than ours by 6.2%. However, for parsing accuracy, Shiuan and Anns system improves by 4%(error reduction of 21.2%), while ours improves by 9.6 percent.</Paragraph>
      <Paragraph position="4">  The evaluation measures are used as it is defined in Kim (2001).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML