File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1005_metho.xml

Size: 10,749 bytes

Last Modified: 2025-10-06 14:13:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1005">
  <Title>TEMPLATE Lazy Merger</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
AN ADJUNCT TEST FOR
DISCOURSE PROCESSING IN MUC-41
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Goal of the Adjunct Test
</SectionTitle>
      <Paragraph position="0"> The motivation for this adjunct test came from an exploratory study done by Beth Sundheim during MUC-3. This study showed a degradation in correctness of message processing as the information distribution in the message became more complex, that is, as slot fills were drawn from larger portions of the message and required more discourse processing to extract the information and reassemble it correctly in the required template(s). The study also suggested that systems did worse on messages requiring multiple templates than on single-template messages.</Paragraph>
      <Paragraph position="1"> These observations led us define the MUC-4 adjunct test to examine two hypotheses related to discourse complexity and expected system performance:</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* The Source Complexity Hypothesis
</SectionTitle>
    <Paragraph position="0"> The more complex the distribution of the source information for filling a given slot or template (the more sentences, and the more widely separated the sentences), the more difficult it will be to process the message correctly.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="69" type="metho">
    <SectionTitle>
* The Output Complexity Hypothesis
</SectionTitle>
    <Paragraph position="0"> The more complex the output (in terms of number of templates), the harder it will be to process the message correctly.</Paragraph>
    <Paragraph position="1"> We began with the assumption that most systems use some variant of the following stages in creating templates:  1. Relevance filtering to weed out irrelevant portions of a message and flag relevant sentences; 2. Sentence level processing to extract information from individual units (clauses, sentences); 3. Discourse processing to establish co-reference and to merge coreferential events; 4. Template generation from the underlying sets of events, mapping events into templates.  In designing the adjunct test, our goal was to focus on the third stage, discourse processing, and to design a test that would measure differences in system performance relative to the complexity of the required discourse processing tasks. However, in complex systems such as these, it is extremely difficult to isolate one stage of processing for testing. There are many  things that can cause failure aside from discourse processing: failure to detect relevant events, failure to understand the individual sentence or clause, failure to map the information correctly into the template. Indeed, as discussed below, effects due to faulty relevance filtering masked some of the discourse issues of interest. Nonetheless, the results provide some unexpected and interesting insights into what may cause some messages to be more difficult to process than others.</Paragraph>
    <Section position="1" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
1.2 To Merge or Not To Merge
</SectionTitle>
      <Paragraph position="0"> In order to design a test, we focused on the event merger problem: deciding whether two clauses describe a single event or distinct events. We can distinguish two possible types of error: Lazy Merger Two clauses describe a single event and .should be merged (at the template level), but the system fails to merge them (see Figure 1). This problem can occur any time a template requires more than one clause to fill the template correctly. Typically, lazy merger results in spurious templates (overgeneration at the template level); it may also result in missing slot fills.</Paragraph>
      <Paragraph position="1"> Greedy Merger Two clauses describe two different events and should not be merged. This can happen in particular when a message requires the generation of multiple templates (see Figure 2). Greedy merger typically results in missing templates and possibly in incorrect or spurious slot fills.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="69" type="sub_section">
      <SectionTitle>
1.3 Experimental Design
</SectionTitle>
      <Paragraph position="0"> In order to investigate problems caused by lazy merger and greedy merger, we defined two conditions: single sentence vs. multi-sentence source for a template, to test for lazy merger; and  single template vs. multi-template output, to test for greedy merger. The cross product of these conditions defines four message subsets (see Figure 3):</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="69" end_page="71" type="metho">
    <SectionTitle>
* 2MT Messages
</SectionTitle>
    <Paragraph position="0"> Generate two templates, each requiring multiple sentences to fill. These messages should be the hardest set, since they will be subject to both lazy merger and greedy merger problems. They should show lazy merger problems relative to the NST set and greedy merger problems relative to the 1MT set;.</Paragraph>
    <Paragraph position="1"> We then examined the TST3 message set and found messages to populate each subset.</Paragraph>
    <Paragraph position="2"> The adjunct test thus required four separate scoring runs, one for each subset. A total of 23 messages were involved, 4-8 messages and 6-10 templates per subset (see Appendix 1 for the test set composition). Messages containing optional templates were rejected 2, and of course irrelevant messages did not fit into any test set. In general, messages that were &amp;quot;mixed&amp;quot; also did not fit into any subset.</Paragraph>
    <Paragraph position="3"> Unfortunately, it turned out there were problems with this methodology. The first problem was that there were few instances of templates meeting these specifications, other than the 1MT set. In particular, there were few multi-template messages where all templates were derived from only a single sentence (the NST set). To try to preserve this set, we compromised by scoring just those templates within each message that were generated from single sentences, which in turn meant that we could not use the MATCtIED-SPURIOUS or ALL-TEMPLATE measures, since these require scoring all of the templates associated with a given message.</Paragraph>
    <Paragraph position="4"> The second problem had to do with the single-sentence, single-template messages (the 1ST set). It turned out that these messages were raxe, and quite different in character from the more common 1MT messages which generated a template from multiple sentences. Clearly, the 1ST subset posed a particularly hard problem in terms of relevance filtering - how to process the one relevant sentence in the message, in the face of the &amp;quot;noise&amp;quot; of the rest of the message. For this reason, the results on 1ST turned out to be more about relevance filtering than about discourse processing. This is discussed in more detail below.</Paragraph>
    <Section position="1" start_page="69" end_page="71" type="sub_section">
      <SectionTitle>
1.4 Measuring Lazy Merger and Greedy Merger
</SectionTitle>
      <Paragraph position="0"> Using these four message subsets, we then asked how lazy merger and greedy merger would affect the various scores reported by the scoring program. The effects included both slot-level effects (missing slot fills, incorrect or spurious slot fills within the expected template), and template level effects (spurious templates, missing templates). Slot-level effects could be measured in terms of the MATCHED-ONLY calculations. Missing templates could be measured using the MATCHED-MISSING (or ALL-TEMPLATES) metrics, and spurious templates in terms of the MATCHED-SPURIOUS (or ALL-TEMPLATES) metrics.</Paragraph>
      <Paragraph position="1"> We expected lazy merger to produce extra templates, measured as overgeneration in the MATCHED-SPURIOUS metric 3. Lazy merger also should lead to missing slot fills, where information from the second event should have been folded into the template, but instead led  generation minus MATCHED-ONLY overgeneration. Since MATCHED-ONLY overgeneration measures slot level overgeneration, the difference would separate out only the template level overgeneration. However, in the measurements below, the ALL-TEMPLATE metric alone was used.</Paragraph>
      <Paragraph position="2">  to generation of a new template. This could be measured by slot level undergeneration, defined as Missing/Possible using the MATCHED-ONLY metric 4.</Paragraph>
      <Paragraph position="3"> Since lazy merger problems arise when multiple clauses/sentences contain information, redundancy might offset some of these problems. If the same piece of information were to occur in several places, this would increase the probability of recall on that slot. This might also have an affect on precision, by increasing the number of correctly filled slots, relative to those filled incorrectly.</Paragraph>
      <Paragraph position="4"> Greedy merger could result in lower recall at the template level, because it would produce too few templates, each with too much information in it (spurious or incorrect fills). The missing templates would cause undergeneration, namely a lower ratio of filled slots to possible slots in the MATCHED-MISSING or ALL-TEMPLATES measures, and a corresponding decrease in recall.</Paragraph>
      <Paragraph position="5"> Greedy merger could also result in incorrect fills, when fills from two clauses axe incorrectly combined in a single slot. This could be measured by the number incorrect slot fills over number of actual fills in the MATCHED-ONLY data.</Paragraph>
      <Paragraph position="6"> Failure to filter irrelevant clauses could affect all the results by providing additional events which could be made into (spurious) templates or merged incorrectly. Spurious templates cause overgeneration and loss of precision (measured in MATCHED-SPURIOUS or ALL-TEMPLATES) 5, and, incorrect merger of events can cause spurious or incorrect slot fills (lower precision and possibly lower recall in MATCHED-ONLY).</Paragraph>
      <Paragraph position="7"> Figure 4 illustrates the relation of the four test subsets, and the hypothesized findings. Note that we compare sets 1ST vs. 1MT and NST vs. 2MT for issues of lazy merger; and sets 1ST vs. NST and 1MT vs. 2MT for greedy merger. Finally, we expect 1ST to show higher precision and recall (higher F-score) than 2MT.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML