File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0706_metho.xml

Size: 12,802 bytes

Last Modified: 2025-10-06 14:07:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0706">
  <Title>Exploring Evidence for Shallow Parsinga0</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experimental Design
</SectionTitle>
    <Paragraph position="0"> In order to run a fair comparison between full parsers and shallow parsers -- which could produce quite different outputs -- we have chosen the task of identifying the phrase structure of a sentence. This structure can be easily extracted from the outcome of a full parser and a shallow parser can be trained specifically on this task.</Paragraph>
    <Paragraph position="1"> There is no agreement on how to define phrases in sentences. The definition could depend on downstream applications and could range from simple syntactic patterns to message units people use in conversations. For the purpose of this study, we chose to use two different definitions.</Paragraph>
    <Paragraph position="2"> Both can be formally defined and they reflect different levels of shallow parsing patterns.</Paragraph>
    <Paragraph position="3"> The first is the one used in the chunking competition in CoNLL-2000 (Tjong Kim Sang and Buchholz, 2000). In this case, a full parse tree is represented in a flat form, producing a representation as in the example above. The goal in this case is therefore to accurately predict a collection of a2a3a2 different types of phrases. The chunk types are based on the syntactic category part of the bracket label in the Treebank. Roughly, a chunk contains everything to the left of and including the syntactic head of the constituent of the same name. The phrases are: adjective phrase (ADJP), adverb phrase (ADVP), conjunction phrase (CONJP), interjection phrase (INTJ), list marker (LST), noun phrase (NP), preposition phrase (PP), particle (PRT), subordinated clause (SBAR), unlike coordinated phrase (UCP), verb phrase (VP). (See details in (Tjong Kim Sang and Buchholz, 2000).) The second definition used is that of atomic phrases. An atomic phrase represents the most basic phrase with no nested sub-phrases. For example, in the parse tree,</Paragraph>
    <Paragraph position="5"> Pierre Vinken, 61 years, the board, a nonexecutive director and Nov.</Paragraph>
    <Paragraph position="6"> 29 are atomic phrases while other higher-level phrases are not. That is, an atomic phrase denotes a tightly coupled message unit which is just above the level of single words.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Parsers
</SectionTitle>
      <Paragraph position="0"> We perform our comparison using two state-of-the-art parsers. For the full parser, we use the one developed by Michael Collins (Collins, 1996; Collins, 1997) -- one of the most accurate full parsers around. It represents a full parse tree as a set of basic phrases and a set of dependency relationships between them. Statistical learning techniques are used to compute the probabilities of these phrases and of candidate dependency relations occurring in that sentence. After that, it will choose the candidate parse tree with the highest probability as output. The experiments use the version that was trained (by Collins) on sections 02-21 of the Penn Treebank. The reported results for the full parse tree (on section 23) are recall/precision of 88.1/87.5 (Collins, 1997).</Paragraph>
      <Paragraph position="1"> The shallow parser used is the SNoW-based CSCL parser (Punyakanok and Roth, 2001; Munoz et al., 1999). SNoW (Carleson et al., 1999; Roth, 1998) is a multi-class classifier that is specifically tailored for learning in domains in which the potential number of information sources (features) taking part in decisions is very large, of which NLP is a principal example. It works by learning a sparse network of linear functions over a pre-defined or incrementally learned feature space. Typically, SNoW is used as a classifier, and predicts using a winner-take-all mechanism over the activation value of the target classes. However, in addition to the prediction, it provides a reliable confidence level in the prediction, which enables its use in an inference algorithm that combines predictors to produce a coherent inference. Indeed, in CSCL (constraint satisfaction with classifiers), SNoW is used to learn several different classifiers - each detects the beginning or end of a phrase of some type (noun phrase, verb phrase, etc.). The outcomes of these classifiers are then combined in a way that satisfies some constraints - non-overlapping constraints in this case - using an efficient constraint satisfaction mechanism that makes use of the confidence in the classifier's outcomes.</Paragraph>
      <Paragraph position="2"> Since earlier versions of the SNoW based CSCL were used only to identify single phrases (Punyakanok and Roth, 2001; Munoz et al., 1999) and never to identify a collection of several phrases at the same time, as we do here, we also trained and tested it under the exact conditions of CoNLL-2000 (Tjong Kim Sang and Buchholz, 2000) to compare it to other shallow parsers. Table 1 shows that it ranks among the top shallow parsers evaluated there 1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Data
</SectionTitle>
      <Paragraph position="0"> Training was done on the Penn Treebank (Marcus et al., 1993) Wall Street Journal data, sections 02-21. To train the CSCL shallow parser we had first to convert the WSJ data to a flat format that directly provides the phrase annotations. This is done using the &amp;quot;Chunklink&amp;quot; program provided for CoNLL-2000 (Tjong Kim Sang and Buchholz, 2000).</Paragraph>
      <Paragraph position="1"> Testing was done on two types of data. For the first experiment, we used the WSJ section 00 (which contains about 45,000 tokens and 23,500 phrases). The goal here was simply to evaluate the full parser and the shallow parser on text that is similar to the one they were trained on.</Paragraph>
      <Paragraph position="2"> 1We note that some of the variations in the results are due to variations in experimental methodology rather than parser's quality. For example, in [KM00], rather than learning a classifier for each of the a10a11a10 different phrases, a discriminator is learned for each of the a12a14a13a16a15</Paragraph>
      <Paragraph position="4"> phrase pairs which, statistically, yields better results. [Hal00] also uses a18 different parsers and reports the results of some voting mechanism on top of these.</Paragraph>
      <Paragraph position="5"> Our robustness test (section 3.2) makes use of section 4 in the Switchboard (SWB) data (which contains about 57,000 tokens and 17,000 phrases), taken from Treebank 3. The Switchboard data contains conversation records transcribed from phone calls. The goal here was two fold. First, to evaluate the parsers on a data source that is different from the training source. More importantly, the goal was to evaluate the parsers on low quality data and observe the absolute performance as well as relative degradation in performance. null The following sentence is a typical example of the SWB data.</Paragraph>
      <Paragraph position="7"> The fact that it has some missing words, repeated words and frequent interruptions makes it a suitable data to test robustness of parsers.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Representation
</SectionTitle>
      <Paragraph position="0"> We had to do some work in order to unify the input and output representations for both parsers.</Paragraph>
      <Paragraph position="1"> Both parsers take sentences annotated with POS tags as their input. We used the POS tags in the WSJ and converted both the WSJ and the SWB data into the parsers' slightly different input formats. We also had to convert the outcomes of the parsers in order to evaluate them in a fair way.</Paragraph>
      <Paragraph position="2"> We choose CoNLL-2000's chunking format as our standard output format and converted Collins' parser outcome into this format.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Performance Measure
</SectionTitle>
      <Paragraph position="0"> The results are reported in terms of precision, recall, and a5a19a6a21a20a23a22a25a24 a2a27a26 as defined below:  We have used the evaluation procedure of CoNLL-2000 to produce the results below. Although we do not report significance results here, note that all experiments were done on tens of thousands of instances and clearly all differences and ratios measured are statistically significant.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Performance
</SectionTitle>
      <Paragraph position="0"> We start by reporting the results in which we compare the full parser and the shallow parser on the &amp;quot;clean&amp;quot; WSJ data. Table 2 shows the results on identifying all phrases -- chunking in CoNLL-2000 (Tjong Kim Sang and Buchholz, 2000) terminology. The results show that for the tasks of identifying phrases, learning directly, as done by the shallow parser outperforms the outcome from the full parser.</Paragraph>
      <Paragraph position="1">  fication (chunking) for the full and the shallow parser on the WSJ data. Results are shown for an (weighted) average of 11 types of phrases as well as for two of the most common phrases, NP and  Next, we compared the performance of the parsers on the task of identifying atomic phrases2. Here, again, the shallow parser exhibits significantly better performance. Table 3 shows the results of extracting atomic phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Robustness
</SectionTitle>
      <Paragraph position="0"> Next we present the results of evaluating the robustness of the parsers on lower quality data. Table 4 describes the results of evaluating the same parsers as above, (both trained as before on the 2As a side note -- the fact that the same program could be trained to recognize patterns of different level in such an easy way, only by changing the annotations of the training data, could also be viewed as an advantage of the shallow parsing paradigm.</Paragraph>
      <Paragraph position="1">  identification on the WSJ data. Results are shown for an (weighted) average of 11 types of phrases as well as for the most common phrase, NP. VP occurs very infrequently as an atomic phrase.</Paragraph>
      <Paragraph position="2">  dent that on this data the difference between the performance of the two parsers is even more significant. null  call for phrase identification (chunking) on the Switchboard data. Results are shown for an (weighted) average of 11 types of phrases as well as for two of the most common phrases, NP, VP.  This is shown more clearly in Table 5 which compares the relative degradation in performance each of the parsers suffers when moving from the WSJ to the SWB data (Table 2 vs. Table 4). While the performances of both parsers goes down when they are tested on the SWB, relative to the WSJ performance, it is clear that the shallow parser's performance degrades more gracefully. These results clearly indicate the higher-level robustness of the shallow parser.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Discussion
</SectionTitle>
      <Paragraph position="0"> Analyzing the results shown above is outside the scope of this short abstract. We will only provide one example that might shed some light on the reasons for the more significant degradation in the results of the full parser. Table 6 exhibits the results of chunking as given by Collins' parser. The four columns are the original words, POS tags, and the phrases -- encoded using the BIO scheme  (B- beginning of phrase; I- inside the phrase; Ooutside the phrase) -- with the true annotation and Collins' annotation.</Paragraph>
      <Paragraph position="1"> The mistakes in the phrase identification (e.g., in &amp;quot;word processing applications&amp;quot;) seem to be a result of assuming, perhaps due to the &amp;quot;um&amp;quot; and additional punctuation marks, that this is a separate sentence, rather than a phrase. Under this assumption, the full parser tries to make it a complete sentence and decides that &amp;quot;processing&amp;quot; is a &amp;quot;verb&amp;quot; in the parsing result. This seems to be a typical example for mistakes made by the full parser.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML