File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1006_evalu.xml

Size: 6,077 bytes

Last Modified: 2025-10-06 13:59:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1006">
  <Title>Learning to recognize features of valid textual entailments</Title>
  <Section position="8" start_page="45" end_page="46" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We present results based on the First PASCAL RTE Challenge, which used a development set containing 567 pairs and a test set containing 800 pairs. The data sets are balanced to contain equal numbers of yes and no answers. The RTE Challenge recommended two evaluation metrics: raw accuracy and confidence weighted score (CWS). The CWS is computed as follows: for each positive integer k up to the size of the test set, we compute accuracy over the k most confident predictions. The CWS is then the average, over k, of these partial accuracies. Like raw accuracy, it lies in the interval [0, 1], but it will exceed raw accuracy to the degree that predictions are well-calibrated.</Paragraph>
    <Paragraph position="1"> Several characteristics of the RTE problems should be emphasized. Examples are derived from a broad variety of sources, including newswire; therefore systems must be domain-independent. The inferences required are, from a human perspective, fairly superficial: no long chains of reasoning are involved. However, there are &amp;quot;trick&amp;quot; questions expressly designed to foil simplistic techniques. The definition of entailment is informal and approximate: whether a competent speaker with basic knowledge of the world would typically infer the hypothesis from the text. Entailments will certainly depend on linguistic knowledge, and may also depend on world knowledge; however, the scope of required  CWS stands for confidence weighted score (see text).</Paragraph>
    <Paragraph position="2"> world knowledge is left unspecified.4 Despite the informality of the problem definition, human judges exhibit very good agreement on the RTE task, with agreement rate of 91-96% (Dagan et al., 2005). In principle, then, the upper bound for machine performance is quite high. In practice, however, the RTE task is exceedingly difficult for computers. Participants in the first PASCAL RTE workshop reported accuracy from 49% to 59%, and CWS from 50.0% to 69.0% (Dagan et al., 2005).</Paragraph>
    <Paragraph position="3"> Table 2 shows results for a range of systems and testing conditions. We report accuracy and CWS on each RTE data set. The baseline for all experiments is random guessing, which always attains 50% accuracy. We show comparable results from recent systems based on lexical similarity (Jijkoun and de Rijke, 2005), graph alignment (Haghighi et al., 2005), weighted abduction (Raina et al., 2005), and a mixed system including theorem proving (Bos and Markert, 2005).</Paragraph>
    <Paragraph position="4"> We then show results for our system under several different training regimes. The row labeled &amp;quot;alignment only&amp;quot; describes experiments in which all features except the alignment score are turned off. We predict entailment just in case the alignment score exceeds a threshold which is optimized on development data. &amp;quot;Hand-tuning&amp;quot; describes experiments in which all features are on, but no training occurs; rather, weights are set by hand, according to human intuition. Finally, &amp;quot;learning&amp;quot; describes experiments in which all features are on, and feature weights are trained on the development data. The 4Each RTE problem is also tagged as belonging to one of seven tasks. Previous work (Raina et al., 2005) has shown that conditioning on task can significantly improve accuracy. In this work, however, we ignore the task variable, and none of the results shown in table 2 reflect optimization by task.</Paragraph>
    <Paragraph position="5"> figures reported for development data performance therefore reflect overfitting; while such results are not a fair measure of overall performance, they can help us assess the adequacy of our feature set: if our features have failed to capture relevant aspects of the problem, we should expect poor performance even when overfitting. It is therefore encouraging to see CWS above 70%. Finally, the figures reported for test data performance are the fairest basis for comparison. These are significantly better than our results for alignment only (Fisher's exact test, p &lt; 0.05), indicating that we gain real value from our features. However, the gain over comparable results from other teams is not significant at the p &lt; 0.05 level.</Paragraph>
    <Paragraph position="6"> A curious observation is that the results for hand-tuned weights are as good or better than results for learned weights. A possible explanation runs as follows. Most of the features represent high-level patterns which arise only occasionally. Because the training data contains only a few hundred examples, many features are active in just a handful of instances; their learned weights are therefore quite noisy. Indeed, a feature which is expected to favor entailment may even wind up with a negative weight: the modal feature weak yes is an example.</Paragraph>
    <Paragraph position="7"> As shown in table 3, the learned weight for this feature was strongly negative -- but this resulted from a single training example in which the feature was active but the hypothesis was not entailed. In such cases, we shouldn't expect good generalization to test data, and human intuition about the &amp;quot;value&amp;quot; of specific features may be more reliable.</Paragraph>
    <Paragraph position="8"> Table 3 shows the values learned for selected feature weights. As expected, the features added adjunct in all context, modal yes, and text is factive were all found to be strong indicators of entailment, while date insert, date modifier insert, widening from text to hyp all indicate lack of entailment. Interestingly, text has neg marker and text &amp; hyp diff polarity were also found to disfavor entailment; while this outcome is sensible, it was not anticipated or designed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML