File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/m92-1005_evalu.xml

Size: 9,275 bytes

Last Modified: 2025-10-06 14:00:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1005">
  <Title>TEMPLATE Lazy Merger</Title>
  <Section position="5" start_page="71" end_page="75" type="evalu">
    <SectionTitle>
1.5 Results
</SectionTitle>
    <Paragraph position="0"> As discussed in the preceding section, we expected the single-sentence messages to show less template overgeneration than the multi-sentence messages (1ST vs. 1MT and NST vs. 2MT).</Paragraph>
    <Paragraph position="1"> However, exactly the opposite occurred: the median overgeneration score (ALL-TEMPLATES, all systems) for 1ST was 77%, compared to 48% for 1MT (and, though not directly comparable, 56% for 2MT) 6. These relative results held for the top 8 systems as well. These results axe shown in Figure 5; the stripe indicates the median, the dark region is encompasses the middle two quaxtiles, and the brackets indicate the range of the data. Outliers axe plotted as additional lines. The overall results are summarized in Table 1. We conclude that problems in relevance filtering for the 1ST messages vastly overshadowed any affect of lazy merger problems.</Paragraph>
    <Paragraph position="2">  The other hypothesis associated with lazy merger was missing slots fills, measured on the eNote that we could not include the overgeneration result for the NST set, because these values were measured on partial messages, invalidating all scores other than MATCHED-ONLY.</Paragraph>
    <Paragraph position="3">  MATCHED-ONLY data (which allows us to use all four test subsets). Table 2 shows &amp;quot;undergeneration&amp;quot; for these four test sets, where undergeneration is defined as Missing/Possible. In this case, the results are consistent with our hypothesis of lazy merger. However, it turns out that they are equally consistent with another hypothesis, namely that the number of missing slots fills will be correlated with the number of possible slots per template. Since templates generated from a single clause are typically much more sparse than templates generated from multiple clauses, this appears to be at least as good an explanation of the observed results. The second row of Table 2 shows the average number of slot fills for each class. Note that NST has the lowest undergeneration score, and the fewest slot fills, followed by 1ST, followed by 1MT and finally 2MT.</Paragraph>
    <Paragraph position="4">  For greedy merger, we hypothesized that multi-template messages would show more missing templates, as well as more spurious and incorrect slot fills (comparing 1ST to NST and 1MT to 2MT). Again, the NST test subset could not be used in looking at spurious templates. Comparing 1MT to 2MT, the results were as expected: 1MT had 51% undergeneration (Missing/Possible using the ALL-TEMPLATES figures), and 2MT had 59~0, averaged over all of the systems; the difference was more pronounced for the top 8 systems (1MT undergeneration was 38%, 2MT was 49%). The 1ST results were 54% (40% for the top 8 systems), higher than 1MT, perhaps due to losing some templates because of faulty relevance filtering. These figures are shown in  The second prediction about greedy merger concerned incorrect slot fills, resulting from combining fills from two different clauses. This was calculated by dividing the number of incorrect fills over the number of actual fills, for the MATCHED-ONLY measure. Here the results were negative. The average over all systems showed 1ST equal to NST and 1MT greater than 2MT.</Paragraph>
    <Paragraph position="5"> For the top 8 systems, the difference between 1MT and 2MT disappeared as well. The dom- null inant affect was that the multi-sentence per template sets (1MT, 2MT) had more than twice the number incorrect compared to the single-sentence per template sets (1ST, NST); the figures are given in Table 4. It is unclear how to interpret these results, except to note that there were twice as many fills generated for the 1MT and 2MT sets (10 per template, on average), as for the 1ST and NST sets (around 5 fills/template).</Paragraph>
    <Paragraph position="6"> Finally, we predicted that the 1ST subset would be the easiest, and the 2MT set the hardest overall, measured in terms of the F-score. Here, the affects of the poor performance on the 1ST set were quite striking. For example, Figure 6 slhows a plot the F-score for 1ST vs. F-score for the whole of TST3. Only 3 systems (Hughes, BBN, NYU) did better on 1ST than on TST3 as  On the other hand, if we plot F-scores for 1MT against F-scores for TST3, the distribution is much more even (see Figure 7). In general, most systems scored substantially better on the 1MT set (39% F-score on ALL-TEMPLATES) than on the 1ST set (28%), contrary to the predictions. However, the score on 1MT was higher than the score on 2MT, as predicted (39%  vs. 29%). There was a somewhat smaller effect for the top 8 systems, shown in Table 5 below. Figure 8 shows graphically the relationship of the ALL-TEMPLATES F-score for the top 8 systems. Five of the eight systems do much better on 1MT, while the other three systems do slightly worse.</Paragraph>
    <Paragraph position="7"> The overall results of these tests are summarized in Figure 9.</Paragraph>
    <Section position="1" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
1.6 Conclusions
</SectionTitle>
      <Paragraph position="0"> We can draw several conclusions from this experiment. First, the 1ST message subset turned out to be quite anomalous. It was harder than the 1MT set, as seen in the F-scores, as well as in the overgeneration results. This is most likely attributable to a relevance filtering problem.</Paragraph>
      <Paragraph position="1"> The 1ST messages were peculiar in that the the single relevant sentence was embedded in a message that was generally focused on something else; the relevant event was only mentioned as background, or in passing. Understandably, the systems had trouble picking out the one relevant sentence amidst a text of otherwise irrelevant information.</Paragraph>
      <Paragraph position="2"> The second finding is that the 2MT subset was indeed harder than the 1MT set; six out  of the 8 top systems did worse on 2MT than on 1MT, as measured by the ALL-TEMPLATES F-score. It seems possible that at least some of this may be due to greedy merger problems, supported by the somewhat greater template undergeneration for 2MT relative to 1MT.</Paragraph>
      <Paragraph position="3"> Next, a surprising result was the relative consistency of the behavior of the various systems with respect to the message subsets. In general, most results held regardless of whether the results were obtained by averaging across all systems, or over just the top 8 systems. Given the enormous variation in system maturity and performance, this is quite surprising, and leads to the hypothesis that some messages may simply be harder than others, across all systems.</Paragraph>
      <Paragraph position="4"> Finally, at least anecdotally, many systems reported instances of both these problems. It may be that the affects of these discourse level problems were masked at times by other problems (relevance filtering, for example). Nonetheless, we can conclude that lazy merger and greedy merger are real problems in discourse processing.</Paragraph>
      <Paragraph position="5"> The results of this test suggest several further research directions and possible future adjunct tests. First, the problem of distinguishing between relevant and irrelevant information caused significant performance degradation, as evidenced by the difference between F-scores for MATCHED-ONLY and F-scores for ALL-TEMPLATES. This should be investigated further, possibly by looking at system performance on the irrelevant messages as well.</Paragraph>
      <Paragraph position="6"> Second, it may be worth investigating some measure of the relative difficulty across messages, for example, by computing performance statistics across messages rather than across systems.</Paragraph>
      <Paragraph position="7"> We would expect to see significant variation in these scores, and this might lead us to understand better what constitutes a hard message. Apparently, subset 1ST constituted such a set.</Paragraph>
      <Paragraph position="8"> Third, this paper analyzed the results averaged over systems, with no attempt to compare individual systems. The question remains as to whether these measures will provide some useful  diagnostics or insights to individual system developers, although that investigation was beyond the scope of this paper.</Paragraph>
      <Paragraph position="9"> In conclusion, this adjunct test was admittedly crude, with too few messages and many uncontrolled variables. Nonetheless, the test provided new and unexpected insights into some variables affecting system performance. In addition, the adjunct test methodology adopted here is of interest because the test was carried out simply by rescoring various subsets of the original test - thus avoiding the need to conduct a separate test, with different input. Also, it was primarily a &amp;quot;within system&amp;quot; test - that is, each system was compared to itself, rather than to other sites. For these reasons, this methodology is worth exploring in the design of future adjunct tests.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML