File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1018_evalu.xml
Size: 7,090 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1018"> <Title>Playing the Telephone Game: Determining the Hierarchical Structure of Perspective and Speech Expressions</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results and Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation </SectionTitle> <Paragraph position="0"> How do we evaluate the performance of an automatic method of determining the hierarchical structure of pse's? Lin (1995) proposes a method for evaluating dependency parses: the score for a sentence is the fraction of correct parent links identified; the score for the corpus is the average sentence score. Formally, the score for a 11The annotators also performed coreference resolution on sources.</Paragraph> <Paragraph position="1"> 12Under certain circumstances, such as paragraph-long quotes, the writer of a sentence will not be the same as the writer of a document. In such sentences, the NRRC corpus contains additional pse's for any other sources besides the writer of the document. Since we are concerned in this work only with one sentence at a time, we discard all such implicit pse's besides the writer of the sentence. Also, in a few cases, more than one pse in a sentence was marked as having the writer as its source. We believe this to be an error and so discarded all but one writer pse.</Paragraph> <Paragraph position="2"> metric size heurOne heurTwo decTree dency score, &quot;perf&quot; is the fraction of sentences whose structure was identified perfectly, and &quot;bin&quot; is the performance of the binary classifier (broken down for positive and negative instances). &quot;Size&quot; is the number of sentences or pse pairs.</Paragraph> <Paragraph position="3"> where S is the set of all sentences in the corpus, Non writer pseprimes(s) is the set of non-writer pse's in sentence s, parent(pse) is the correct parent of pse, and autopar(pse) is the automatically identified parent of pse.</Paragraph> <Paragraph position="4"> We also present results using two other (related) metrics. The &quot;perf&quot; metric measures the fraction of sentences whose structure is determined entirely correctly (i.e. &quot;perf&quot;ectly). &quot;Bin&quot; is the accuracy of the binary classifier (with a 0.5 threshold) on the instances created from the test corpus. We also report the performance on positive and negative instances.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> We compare the learning-based approach (decTree) to the heuristic-based approaches introduced in Section 3 -- heurOne assumes that all pse's are attached to the writer's implicit pse; heurTwo is the parse-based heuristic that relies solely on the dominance relation13.</Paragraph> <Paragraph position="1"> We use 10-fold cross-validation on the evaluation data to generate training and test data (although the heuristics, of course, do not require training).</Paragraph> <Paragraph position="2"> The results of the decision tree method and the two heuristics are presented in Table 3.</Paragraph> <Paragraph position="3"> 13That is, heurTwo attaches a pse to the pse most immediately dominating it in the dependency tree. If no other pse dominates it, a pse is attached to the writer's pse.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Discussion </SectionTitle> <Paragraph position="0"> Encouragingly, our machine learning method uniformly and significantly14 outperforms the two heuristic methods, on all metrics and in sentences with any number of pse's. The difference is most striking in the &quot;perf&quot; metric, which is perhaps the most intuitive. Also, the syntax-based heuristic (heurTwo) significantly15 outperforms heurOne, confirming our intuitions that syntax is important in this task.</Paragraph> <Paragraph position="1"> As the binary classifer sees many more negative instances than positive, it is unsurprising that its performance is much better on negative instances. This suggests that we might benefit from machine learning methods for dealing with unbalanced datasets. Examining the errors of the machine learning system on the development set, we see that for half of the pse's with erroneously identified parents, the parent is either the writer's pse, or a pse like &quot;said&quot; in sentences 4 and 5 having scope over the entire sentence. For example, 7. &quot;Our concern is whether persons used to the role of policy implementors can objectively assess and critique executive policies which impinge on human rights,&quot; said Ramdas.</Paragraph> <Paragraph position="2"> Our model chose the parent of &quot;assess and critique&quot; to be &quot;said&quot; rather than &quot;concern.&quot; We also see from Table 4 that the model performs more poorly on sentences with more pse's. We believe that this reflects a weakness in our decision to combine binary decisions, because the model has learned that in general, a &quot;said&quot; or writer's pse (near the root of the structure) is likely to be the parent, while it sees many fewer examples of pse's such as &quot;concern&quot; that lie in the middle of the tree.</Paragraph> <Paragraph position="3"> Although we have ignored the distinction throughout this paper, error analysis suggests speech event pse's behave differently than private state pse's with respect to how closely syntax reflects their hierarchical structure. It may behoove us to add features to allow the model to take this into account. Other sources of error include erroneous sentence boundary detection, parenthetical statements (which the parser does not treat correctly for our purposes) and other parse errors, partial quotations, as well as some errors in the annotation.</Paragraph> <Paragraph position="4"> Examining the learned trees is difficult because of their size, but looking at one tree to depth three 14p < 0.01, using an approximate randomization test with 9,999 trials. See (Eisner, 1996, page 17) and (Chinchor et al., 1993, pages 430-433) for descriptions of this method.</Paragraph> <Paragraph position="5"> 15Using the same test as above, p < 0.01, except for the performance on sentences with more than 5 pse's, because of the small amount of data, where p < 0.02.</Paragraph> <Paragraph position="6"> reveals a fairly intuitive model. Ignoring the probabilities, the tree decides pseparent is the parent of psetarget if and only if pseparent is the writer's pse (and psetarget is not in quotation marks), or if pseparent is the word &quot;said.&quot; For all the trees learned, the root feature was either the writer pse test or the partial-parse-based domination feature.</Paragraph> </Section> </Section> class="xml-element"></Paper>