XML Viewer - w04-0817

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-0817_concl.xml
Size: 8,689 bytes
Last Modified: 2025-10-06 13:54:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0817">
  <Title>Semantic Role Labelling with Similarity-Based Generalization Using EM-based Clustering</Title>
  <Section position="7" start_page="0" end_page="0" type="concl">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> We first give the final results of our systems on the test set according to the official evaluation software.</Paragraph>
    <Paragraph position="1"> Then we discuss detailed results on a development set we randomly extracted from the training data.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Final Results
</SectionTitle>
      <Paragraph position="0"> We submitted the results of two models. One was produced using the maximum entropy learner, including all features of Sec. 3 and with the three most helpful generalisation techniques (EM head lemma, EM path, and Peripherals). For the second model we used the MBL learner trained on all features, with no additional training data1. The performance of the two models is shown in Table 1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Detailed Results
</SectionTitle>
      <Paragraph position="0"> For a detailed evaluation, we randomly split off 10% of the training data to form development sets. In this section, we report results of two such splits to take chance variation into account.</Paragraph>
      <Paragraph position="1"> For time reasons, this detailed evaluation was performed using our own evaluation software, which is based on our internal constituent-based representation. This software gives the same tendencies (improvements / deteriorations) as the official software, but absolute values differ; so we restrict ourselves to reporting relative figures.</Paragraph>
      <Paragraph position="2"> Basis for Comparison. All following models are compared against a set of basic models trained on all features of Sec. 3. Table 2 gives the results for these models, using our own scoring software.</Paragraph>
      <Paragraph position="3"> Contribution of Features. We computed the contribution of individual features by leaving out each feature in turn. Table 3 shows the results, averaging  over the two splits. The features that contributed most to the performance were the same for both learners: the label assigned by the EM-based model, the phrase type, and whether the path had been seen to lead to a frame element. The relative position to the target helped in one MBL and one Maxent run. Interestingly, the Maxent learner profits from the probability with which the EM-based model assigns its label, while MBL does not.</Paragraph>
      <Paragraph position="4"> Generalisation. To measure the effect of each of the similarity measures listed in Sec. 5, we tested them individually using the Maximum Entropy learner with all features.</Paragraph>
      <Paragraph position="5"> As mentioned above, training instances of one frame were generalised and then added to the training instances of another, retaining only part of the features in the generalisation. Table 4 shows the features retained for each similarity measure, as well as the number of additional instances generated, summed over all frames. We empirically determined the optimal parameter values as: For FN-h (sem) and FN-h (syn), 1 level in the hierarchy; for EM head, a weight threshold of a81 a7a25a86a16a87 , and for EM path, a weight threshold of a81 a7</Paragraph>
      <Paragraph position="7"> Table 5 gives the improvements made over the baseline through adding data gained by each FN hierarchy (sem): a88 10,000 instances head lemma FN hierarchy (syn): a88 10,000 instances phrase type, path, prep., path seen, is subcategorised, voice, target POS Peripherals: a88 55,000 instances head lemma, phrase type, path, prep., path seen, is subcategorised, voice, target POS EM head: a88 1,000,000 instances head lemma EM path: a88 433,000 instances phrase type, mother phrase type, path, path length, prep., path seen, is subcategorised, voice, target POS  generalisation strategy. Results are shown in points F-score and individually for both training/development splits. EM-based clustering proved to be helpful, showing both the highest single improvement (EM path) and the highest consistent improvement (EM head), while all other generalisations show mixed results.</Paragraph>
      <Paragraph position="8"> Combining the three most promising generalisation techniques (Peripherals, EM head, and EM path) led to an improvement of 0.7 points F-score for split 1 and 1.1 points F-score for split 2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Discussion.
</SectionTitle>
      <Paragraph position="0"> Feature quality. The features that improved the learners' performance most are EM-based label, phrase type and the &amp;quot;path seen as FE&amp;quot;. The other features did not show much impact for us. The Maxent learner was negatively affected by sentence-level features such as the subcat frame and &amp;quot;is subcategorised&amp;quot;. null Comparing the learners. In a comparable basic setting (all features, no generalisation), the Memory-Based learner easily outperforms the Max-ent learner, according to our scoring scheme. However, the official scoring scheme determines the Memory-based learner's performance at more than 10 points F-score below the Maxent learner. We intend to run the Memory-based learner with generalisation data for a more comprehensive comparison. Generalisation. Gildea and Jurafsky (2002) report an improvement of 1.6% through generalisation, which is roughly comparable to our figures. The two strategies share the common idea of exploiting role similarities, but the realisations are converse: Gildea and Jurafsky manually compact similar frame elements into 18 abstract, frameindependent roles, whereas we keep the roles frame-specific but augment the training data for each by automatically discovered similarities.</Paragraph>
      <Paragraph position="1"> One reason for the disappointing performance of the FrameNet hierarchy-based generalisation strategies may be simply the amount of data, as shown by Table 4: FN-h (sem) and FN-h (syn) each only yield 10,000 additional instances as compared to around 1,000,000 for EM head. That the reliability of the results roughly seems to go up with the number of additional instances generated (Peripherals: ca. 50,000, EM-Path: ca. 400,000) fits this argumentation well.</Paragraph>
      <Paragraph position="2"> The input to the EM path clusters is a tuple of the path, target voice and preposition information.</Paragraph>
      <Paragraph position="3"> In the resulting model, generalisation over voice worked well, yielding clusters containing both active and passive alternations of similar frame elements. However, prepositions were distributed more arbitrarily. While this may indicate problems of clustering with more structured forms of input, it may also just be a consequence of noisy input, as the preposition feature has not had much impact either on the learners' performance.</Paragraph>
      <Paragraph position="4"> The EM head strategy adds large amounts of head lemma instances, which probably alleviates the sparse data problem that makes the head lemma feature virtually useless. Another way of capitalising on this type of information would be to use the FN hierarchy generalisation to derive more input for EM-based clustering and see if this indirect use of generalisation still improves semantic role assignment. Interestingly, the EM head strategy and the EM-based clustering feature, both geared at solving the same sparse data problem, do not cancel each other out. In future work, we will try to combine the EM head strategy with the FrameNet hierarchy to derive more input for the clustering model to see if this can improve the present generalisation results.</Paragraph>
      <Paragraph position="5"> Comparison with CoNLL. We recently studied semantic role labelling in the context of the CoNLL shared task (Baldewein et al., 2004). The two key differences to this study were that the semantic roles in question were PropBank roles and that only shallow information was available. Our system there showed two main differences to the current system: the overall level of accuracy was lower, and EM-based clustering did not improve the performance. While the performance difference is evidently a consequence of only shallow information being available, it remains an interesting open question why EM-based clustering could improve one system, but not the other.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML