File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-1039_evalu.xml
Size: 7,765 bytes
Last Modified: 2025-10-06 13:58:33
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1039"> <Title>Unsupervised Discovery of Scenario-Level Patterns for Information Extraction</Title> <Section position="7" start_page="285" end_page="287" type="evalu"> <SectionTitle> 4 Results 1 </SectionTitle> <Paragraph position="0"> An objective measure of goodness of a pattern o. 9 is not trivial to establish since the patterns cannot be used for extraction directly, without be- o. s ing properly incorporated into the knowledge base. Thus, the discovery procedure does not o. v lend itself easily to MUC-style evaluations, since 0.6 a pattern lacks information about which events it induces and which slots its arguments should 0.5 fill.</Paragraph> <Paragraph position="1"> However, it is possible to apply some objec- o. a tive measures of performance. One way we evaluated the system is by noting that in addition o. to growing the pattern set, the procedure also grows the relevance of documents. The latter o. 2 can be objectively evaluated. 0.1 We used a test corpus of 100 MUC-6 formaltraining documents (which were included in the o main development corpus of about 6000 documents) plus another 150 documents picked at random from the main corpus and judged by hand. These judgements constituted the ground truth and were used only for evaluation, (not in the discovery procedure).</Paragraph> <Section position="1" start_page="285" end_page="285" type="sub_section"> <SectionTitle> 4.1 Text Filtering </SectionTitle> <Paragraph position="0"> Figure 1 shows the recall/precision measures with respect to the test corpus of 250 documents, over a span of 60 generations, starting with the seed set in table 3.4. The Seed patterns matched 184 of the 5963 documents, yielding an initial recall of .11 and precision of .93; by the last generation it searched through 982 documents with non-zero relevance, and ended with .80 precision and .78 recall. This facet of the discovery procedure is closely related to the MUC '%ext-filtering&quot; sub-task, where the systems are judged at the level of documents rather than event slots. It is interesting to compare the results with other MUC-6 participants, shown anonymously in figure 2. Considering recall and precision separately, the discovery procedure attains values comparable to those achieved by some of the participants, all of which were either heavily-supervised or manually coded systems. It is important to bear in mind that the discovery procedure had no benefit of training material, or any information beyond the seed pattern set.</Paragraph> <Paragraph position="2"> &quot;&quot; .... i ..~. i Re~aa ---x--- .......... ~ .......... ~ ......... ~ ........... ~ .......... ~ .......... ~ .........</Paragraph> <Paragraph position="3"> ...... iiiiiiiiiiiiiilEi ........... ........</Paragraph> <Paragraph position="4"> /...</Paragraph> <Paragraph position="5"> .......... ~ .......... '.&quot; ........ &quot;: .......... ~ ........... r .......... ~ .......... ! ......... 2111111ji.. iii121122;1211111;ii122221ilSiiii12112121SiiiiSiii: ......</Paragraph> </Section> <Section position="2" start_page="285" end_page="285" type="sub_section"> <SectionTitle> ment Succession 4.2 Choice of Test Corpus </SectionTitle> <Paragraph position="0"> Figure 2 shows two evaluations of our discovery procedure, tested against the original MUC-6 corpus of 100 documents, and against our test corpus, which consists of an additional 150 documents judged manually. The two plots in the figure show a slight difference in results, indicating that in some sense, the MUC corpus was more &quot;random&quot;, or that our expanded corpus was somewhat skewed in favor of more common patterns that the system is able to find more easily.</Paragraph> </Section> <Section position="3" start_page="285" end_page="286" type="sub_section"> <SectionTitle> 4.3 Choice of Evaluation Metric </SectionTitle> <Paragraph position="0"> The graphs shown in Figures 1 and 2 are based on an &quot;objective&quot; measure we adopted during the experiments. This is the same measure of relevance used internally by the discovery procedure on each iteration (relative to the &quot;truth&quot; of relevance scores of the previous iteration), and is not quite the standard measure used for text filtering in IR. According to this measure, the system gets a score for each document based on the relevance which it assigned to the document.</Paragraph> <Paragraph position="1"> Thus if the system .assigned relevance of X percent to a relevant document, it only received X ....... ........ ............... --- ! ........ ....... i i i :: i i i i i ~ ...... i ........ i ........ \[ ....... ? ........ f ............ T ....... ...... J ........ i ........ i. ....... i ........ ~ ....... .; ........... ..: ........ i....:</Paragraph> <Paragraph position="3"> percent on the recall score for classifying that document correctly. Similarly, if the system assigned relevance Y to an irrelevant document, it was penalized only for the mis-classified Y percent on the precision score. To make our results more comparable to those of other MUC competitors, we chose a cut-off point and force the system to make a binary relevance decision on each document. The cut-off of 0.5 seemed optimal from empirical observations. Figure 3 shows a noticeable improvement in scores, when using our continuous, &quot;objective&quot; measure, vs. the cut-off measure, with the entire graph essentially translated to the right for a gain of almost 10 percentage points of recall.</Paragraph> </Section> <Section position="4" start_page="286" end_page="287" type="sub_section"> <SectionTitle> 4.4 Evaluating Patterns </SectionTitle> <Paragraph position="0"> Another effective, if simple, measure of performanceis how many of the patterns the procedure found, and comparing them with those used by an extraction engine which was manually constructed for the same task. Our MUC-6 system used approximately 75 clause level patterns, with 30 distinct verbal heads. In one conservative experiment, we observed that the discovery procedure found 17 of these verbs, or 57%. However, it also found at least 8 verbs the manual system lacked, which seemed relevant to the scenario: company-bring-person-\[as/officer\] 12 person-come-\[to+eompanv\]-\[as+oZScer\] person-rejoin- company-\[as + o25cer\] person-{ ret , conti, e, remai, ,stay}-\[as + o25cer\] person-pursue-interest At the risk of igniting a philosophical debate over what is or is not relevant to a scenario, we note that the first four of these verbs are evidently essential to the scenario in the strictest definition, since they imply changes of post. The next three are &quot;staying&quot; verbs, and are actually also needed, since higher-level inferences required in tracking events for long-range merging over documents, require knowledge of persons occupying posts, rather than only assuming or leaving them. The most curious one is &quot;person-pursue-interesf'; surprisingly, it too is useful, even in the strictest MUC sense, cf., (muc, 1995). Systems are judged on filling a slot called &quot;other-organization&quot;, indicating from or to which company the person came or went.</Paragraph> <Paragraph position="1"> This pattern is consistently used in text to indinbracketed constituents are outside of the central SVO triplet, included here for clarity.</Paragraph> <Paragraph position="2"> cate that the person left to pursue other, undisclosed interests, the knowledge of which would relieve the system from seeking other information in order to fill this slot. This is to say that here strict evaluation is elusive.</Paragraph> </Section> </Section> class="xml-element"></Paper>