File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-3014_concl.xml

Size: 5,187 bytes

Last Modified: 2025-10-06 13:55:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-3014">
  <Title>Parsing and Subcategorization Data</Title>
  <Section position="6" start_page="82" end_page="83" type="concl">
    <SectionTitle>
4 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="82" end_page="82" type="sub_section">
      <SectionTitle>
4.1 Use of Parser's Output
</SectionTitle>
      <Paragraph position="0"> In this paper, we have shown that it is not necessarily true that statistical parsers always perform worse when dealing with spoken language.</Paragraph>
      <Paragraph position="1"> The conventional accuracy metrics for parsing (LR/LP) should not be taken as the only metrics in determining the feasibility of applying statistical parsers to spoken language. It is necessary to consider what information we want to extract out of parsers' output and make use of.</Paragraph>
      <Paragraph position="2"> 1. Extraction of SCFs from Corpora: This task usually proceeds in two stages: (i) Use statistical parsers to generate SCCs. (ii) Apply some statistical tests such as the Binomial Hypothesis Test (Brent, 1993) and log-likelihood ratio score (Dunning, 1993) to SCCs to filter out false SCCs on the basis of their reliability and likelihood. Our experiments show that the SCCs generated for spoken language are as accurate as those generated for written language, which suggests that it is feasible to apply the current technology for automatically extracting SCFs from corpora to spoken language.</Paragraph>
      <Paragraph position="3"> 2. Semantic Role Labeling: This task usually operates on parsers' output and the number of dependents of each verb that are correctly retrieved by the parser clearly affects the accuracy of the task. Our experiments show that the parser achieves a much lower accuracy in retrieving dependents from the spoken language than written language. This seems to suggest that a lower accuracy is likely to be achieved for a semantic role labeling task performed on spoken language. We are not aware that this has yet been tried.</Paragraph>
    </Section>
    <Section position="2" start_page="82" end_page="83" type="sub_section">
      <SectionTitle>
4.2 Punctuation and Speech Transcription
Practice
</SectionTitle>
      <Paragraph position="0"> Both our experiments and Roark's experiments show that parsing accuracy measured by LR/LP experiences a sharper decrease for WSJ than Switchboard after we removed punctuation from training and test data. In spoken language, commas are largely used to delimit disfluency elements. As noted in Engel et al. (2002), statistical parsers usually condition the probability of a constituent on the types of its neighboring constituents. The way that commas are used in speech transcription seems to have the effect of increasing the range of neighboring constituents, thus fragmenting the data and making it less reliable. On the other hand, in written texts, commas serve as more reliable cues for parsers to identify phrasal and clausal boundaries.</Paragraph>
      <Paragraph position="1"> In addition, our experiment demonstrates that punctuation does not help much with extraction of SCCs from spoken language. Removing punctuation from both the training and test data results in a less than 0.3% decrease in SR/SP. Furthermore, removing punctuation from both the training and test data actually slightly improves the performance of Bikel's parser in retrieving dependents from spoken language. All these results seem to suggest that adding punctuation in speech transcription is of little help to statistical parsers including at least three state-of-the-art statistical parsers (Collins, 1999; Charniak, 2000; Bikel, 2004). As a result, there may be other good reasons why someone who wants to build a Switchboard-like corpus should choose to provide punctuation, but there is no need to do so simply in order to help parsers.</Paragraph>
      <Paragraph position="2"> However, segmenting utterances into individual units is necessary because statistical parsers require sentence boundaries to be clearly delimited.</Paragraph>
      <Paragraph position="3"> Current statistical parsers are unable to handle an input string consisting of two sentences. For example, when presented with an input string as in (1) and (2), if the two sentences are separated by a period (1), Bikel's parser wrongly treats the second sentence as a sentential complement of the  main verb like in the first sentence. As a result, the extractor generates an SCC NP-S for like, which is incorrect. The parser returns the same parse after we removed the period (2) and let the parser parse it again.</Paragraph>
      <Paragraph position="4"> (1) I like the long hair. It was back in high school.</Paragraph>
      <Paragraph position="5"> (2) I like the long hair It was back in high school. Hence, while adding punctuation in transcribing a Switchboard-like corpus is not of much help to statistical parsers, segmenting utterances into individual units is crucial for statistical parsers. In future work, we plan to develop a system capable of automatically segmenting speech utterances into individual units.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML