File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-2047_evalu.xml

Size: 3,190 bytes

Last Modified: 2025-10-06 13:59:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2047">
  <Title>Engineering Management The Chinese University of Hong Kong</Title>
  <Section position="6" start_page="186" end_page="187" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We selected the features used in Quarc (Riloff and Thelen, 2000) to establish the reference performance level. In our experiments, the 24 rules in Quarc are transferred6 to ME features: If contains(Q,{start, begin}) and contains(S,{start, begin, since, year}) Then Score(S)+=20 fj(x,y) = 1 (0&lt; j &lt;25) if Q is a when question that contains start or begin and C contains start, begin, since or year ; fj(x,y) = 0 otherwise.</Paragraph>
    <Paragraph position="1"> In addition to the Quarc features, we resolved ve pronouns (he, him, his, she and her) in the stories based on the annotation in the corpora. The result of using Quarc features in the ME framework is 38.3% HumSent accuracy on the Remedia test set.</Paragraph>
    <Paragraph position="2"> This is lower than the result (40%) obtained by our re-implementation of Quarc that uses handcrafted scores. A possible explanation is that handcrafted scores are more reliable than ME, since humans can generalize the score even for sparse data.</Paragraph>
    <Paragraph position="3"> Therefore, we re ned our reference performance level by combining the ME models (MEM) and handcrafted models (HCM). Suppose the score of a question-answer pair is score(Q,Ci), the conditional probability that Ci answers Q in HCM is:</Paragraph>
    <Paragraph position="5"> We combined the probabilities from MEM and HCM in the following manner: scoreprime(Q, Ci) = aMEM(Q, Ci) + (1 [?] a)HCM(Q, Ci). To obtain the optimal a, we partitioned the training set into four bins. The ME models are trained on three different bins; the optimal a is determined on the other bins. By trying different bins combinations and different a such that 0 &lt; a &lt; 1 with interval 0.1, we obtained the average optimal a = 0.15 and 0.9 from the Remedia and ChungHwa training sets respectively7. Our baseline used the combined ME models and handcrafted models to achieve 40.3% and 70.6% HumSent accuracy in the Remedia and ChungHwa test sets respectively.</Paragraph>
    <Paragraph position="6"> We set up our experiments such that the linguistic features are applied incrementally - (i) First , we use only POS tags of matching words among questions  and candidate answer sentences. (ii) Then we add POS tags of the matching dependencies. (iii) We apply only GR features from MINIPAR. (iv) All features are used. These four feature sets are denoted as +wp, +wp+dp, +mini and +wp+dp+mini respectively. The results are shown in Figure 3 for the Remedia and ChungHwa test sets.</Paragraph>
    <Paragraph position="7"> With the signi cance level 0.05, the pairwise t-test (for every question) to the statistical signi cance of the improvements shows that the p-value is 0.009 and 0.025 for the Remedia and ChungHwa test sets respectively. The deep syntactic features significantly improve the performance over the baseline system on the Remedia and ChungHwa test sets8.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML