XML Viewer - w04-0845

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0845_metho.xml
Size: 13,094 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0845">
  <Title>Semantic Role Labeling with Boosting, SVMs, Maximum Entropy, SNOW, and Decision Lists</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experimental Features
</SectionTitle>
    <Paragraph position="0"> This section describes the features that were used for the SRL task. Since the non-restricted SRL task is essentially a classification task, each parse constituent that was known to correspond to a frame element was considered to be a sample.</Paragraph>
    <Paragraph position="1"> The features that we used for each sample have been previously shown to be helpful for the SRL task (Gildea and Jurafsky, 2002). Some of these features can be obtained directly from the Framenet annotations: The name of the frame.</Paragraph>
    <Paragraph position="2"> The lexical unit of the sentence -- i.e. the lexical identity of the target word in the sentence. The general part-of-speech tag of the target word.</Paragraph>
    <Paragraph position="3"> The &amp;quot;phrase type&amp;quot; of the constituent -- i.e. the syntactic category (e.g. NP, VP) that the constituent falls into.</Paragraph>
    <Paragraph position="4"> The &amp;quot;grammatical function&amp;quot; (e.g. subject, object, modifier, etc) of the constituent, with respect to the target word.</Paragraph>
    <Paragraph position="5"> The position (e.g. before, after) of the constituent, with respect to the target word.</Paragraph>
    <Paragraph position="6"> In addition to the above features, we also extracted a set of features which required the use of some statistical NLP tools: Transitivity and voice of the target word --The sentence was first part-of-speech tagged and chunked with the fnTBL transformation-based learning tools (Ngai and Florian, 2001). Simple heuristics were then used to deduce the transitivity voice of the target word.</Paragraph>
    <Paragraph position="7"> Head word (and its part-of-speech tag) of the constituent -- After POS tagging, a syntactic parser (Collins, 1997) was then used to obtain the parse tree for the sentence. The head word (and the POS tag of the head word) of</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Association for Computational Linguistics
</SectionTitle>
      <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems the syntactic parse constituent whose span corresponded most closely to the candidate constituent was then assumed to be the head word of the candidate constituent.</Paragraph>
      <Paragraph position="1"> The resulting training data set consisted of 51,366 constituent samples with a total of 151 frame element types. These ranged from &amp;quot;Descriptor&amp;quot; (3520 constituents) to &amp;quot;Baggage&amp;quot; and &amp;quot;Carrier&amp;quot; (1 constituent each). This training data was randomly partitioned into a 80/20 &amp;quot;development training&amp;quot; and &amp;quot;validation&amp;quot; set.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"> The previous section described the features that were extracted for each constituent. This section will describe the experiment methodology as well as the learning systems used to construct the models. null Our systems had originally been trained on the entire development training (devtrain) set, generating one global model per system. However, on closer examination of the task, it quickly became evident that distinguishing between 151 possible outcomes was a difficult task for any system. It was also not clear that there was going to be a lot of information that could be generalized across frame types. We therefore partitioned the data by frame, so that one model would be trained for each frame. (This was also the approach taken by (Gildea and Jurafsky, 2002).) Some of our individual systems tried both approaches; the results are compared in the following subsections. For comparison purposes, a baseline model was constructed by simply classifying all constituents with the most frequently-seen (in the training set) frame element for the frame.</Paragraph>
    <Paragraph position="1"> In total, five individual systems were trained for the SRL task, and four ensemble models were generated by using various combinations of the individual systems. With one exception, all of the individual systems were constructed using off-the-shelf machine learning software. The following subsections describe each system; however, it should be noted that some of the individual systems were not officially entered as competing systems; therefore, their scores are not listed in the final rankings.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Boosting
</SectionTitle>
      <Paragraph position="0"> The most successful of our individual systems is based on boosting, a powerful machine learning algorithm which has been shown to achieve good results on NLP problems in the past. Our system was constructed around the Boostexter soft- null ments boosting on top of decision stumps (decision trees of one level), and was originally designed for text classification. The same system also participated in the Senseval-3 lexical sample tasks for Chinese and English, as well as the Multilingual lexical sample task (Carpuat et al., 2004).</Paragraph>
      <Paragraph position="1"> Table 1 compares the results of training one single overall boosting model (Single) versus training separate models for each frame (Frame). It can be seen that training frame-specific models produces a small improvement over the single model. The frame-specific model was used in all of the ensemble systems, and was also entered into the competition as an individual system (hkpust-boost).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Support Vector Machines
</SectionTitle>
      <Paragraph position="0"> The second of our individual systems was based on support vector machines, and implemented using the TinySVM software package (Boser et al., 1992).</Paragraph>
      <Paragraph position="1"> Since SVMs are binary classifiers, we used a one-against-all method to reduce the SRL task to a binary classification problem. One model is constructed for each possible frame element and the task of the model is to decide, for a given constituent, whether it should be classified with that frame element. Since it is possible for all the binary classifiers to decide on &amp;quot;NOT-&lt;element&gt;&amp;quot;, the model is effectively allowed to pass on samples that it is not confident about. This results in a very precise model, but unfortunately at a significant hit to recall.</Paragraph>
      <Paragraph position="2"> A number of kernel parameter settings were investigated, and the best performance was achieved with a polynomial kernel of degree 4. The rest of the parameters were left at the default values. Table 2 shows the results of the best SVM model on the validation set. This model participated in the all of the ensemble systems, and was also entered into the competition as an individual system.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Maximum Entropy
</SectionTitle>
      <Paragraph position="0"> The third of our individual systems was based on the maximum entropy model, and implemented on top of the YASMET package (Och, 2002). Like the boosting model, the maximum entropy system also participated in the Senseval-3 lexical sample tasks for Chinese and English, as well as the Multilingual lexical sample task (Carpuat et al., 2004).</Paragraph>
      <Paragraph position="1"> Our maximum entropy models can be classified into two main approaches. Both approaches used the frame-partitioned data. The more conventional approach (&amp;quot;multi&amp;quot;) then trained one model per frame; that model would be responsible for classifying a constituent belonging to that frame with one of several possible frame elements. The second approach (binary) used the same approach as the SVM models, and trained one binary one-against-all classifier for each frame type-frame element combination. (Unlike the boosting models, a single maximum entropy model could not be trained for all possible frame types and elements, since YASMET crashed on the sheer size of the feature space.)</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the results for the maximum entropy models. As would have been expected, the binary model achieves very high levels of precision, but at considerable expense of recall. Both systems were eventually used in the some of the ensemble models but were not submitted as individual contestants. null</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 SNOW
</SectionTitle>
      <Paragraph position="0"> The fourth of our individual systems is based on SNOW -- Sparse Network Of Winnows (Mu'noz et al., 1999).</Paragraph>
      <Paragraph position="1"> The development approach for the SNOW models was similar to that of the boosting models. Two main model types were generated: one which generated a single overall model for all the possible frame elements, and one which generated one model per frame type. Due to a bug in the coding which was not discovered until the last minute, however, the results for the frame-separated model were invalidated. The single model system was eventually used in some of the ensemble systems, but not entered as an official contestant. Table 4 shows the results.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Decision Lists
</SectionTitle>
      <Paragraph position="0"> The final individual system was a decision list implementation contributed from the Swarthmore College team (Wicentowski et al., 2004), which participated in some of the lexical sample tasks.</Paragraph>
      <Paragraph position="1"> The Swarthmore team followed the frame-separated approach in building the decision list models. Table 5 shows the result on the validation set. This system participated in some of the final ensemble systems as well as being an official participant (hkpust-swat-dl).</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
sults
3.6 Ensemble Systems
</SectionTitle>
      <Paragraph position="0"> Classifier combination, where the results of different models are combined in some way to make a new model, has been well studied in the literature.</Paragraph>
      <Paragraph position="1"> A successful combined classifier can result in the combined model outperforming the best base models, as the advantages of one model make up for the shortcomings of another.</Paragraph>
      <Paragraph position="2"> Classifier combination is most successful when the base models are biased differently. That condition applies to our set of base models, and it was reasonable to make an attempt at combining them.</Paragraph>
      <Paragraph position="3"> Since the performances of our systems spanned a large range, we did not want to use a simple majority vote in creating the combined system. Rather, we used a set of heuristics which trusted the most precise systems (the SVM and the binary maximum entropy) when they made a prediction, or a combination of the others when they did not.</Paragraph>
      <Paragraph position="4"> Table 6 shows the results of the top-scoring combined systems which were entered as official contestants. As expected, the best of our combined systems outperformed the best base model.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Test Set Results
</SectionTitle>
    <Paragraph position="0"> Table 7 shows the test set results for all systems which participated in some way in the official competition, either as part of a combined system or as an individual contestant.</Paragraph>
    <Paragraph position="1"> Model Prec. Recall Attempted svm, boosting, maxent (binary) (hkpolyust-all(a)) 0.874 0.867 99.2% boosting (hkpolyust-boost) 0.859 0.852 0.846% svm, boosting, maxent (binary), DL (hkpolyust-swat(a)) 0.902 0.849 94.1% svm, boosting, maxent (binary), DL, snow (hkpolyust-swat(b)) 0.908 0.846 93.2% svm, boosting, maxent (multi), DL, snow (hkpolyust-all(b)) 0.905 0.846 93.5%  The top-performing system is the combined system that uses the SVM, boosting and the binary implementation of maximum entropy. Of the individual systems, boosting performs the best, even outperforming 3 of the combined systems. The SVM suffers from its high-precision approach, as does the binary implementation of maximum entropy. The rest of the systems fall somewhere in between.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML