XML Viewer - p06-1051

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1051_metho.xml
Size: 25,078 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1051">
  <Title>Automatic learning of textual entailments with cross-pair similarities</Title>
  <Section position="5" start_page="401" end_page="402" type="metho">
    <SectionTitle>
3 Challenges in learning from examples
</SectionTitle>
    <Paragraph position="0"> In the introductory section, we have shown that, to carry out automatic learning from examples, we need to define a cross-pair similarity measure. Its definition is not straightforward as it should detect whether two pairs (Tprime,Hprime) and (Tprimeprime,Hprimeprime) realize the same rewrite rules. This measure should consider pairs similar when: (1) Tprime and Hprime are structurally similar to Tprimeprime and Hprimeprime, respectively and (2) the lexical relations within the pair (Tprime,Hprime) are compatible with those in (Tprimeprime,Hprimeprime). Typically, T and H show a certain degree of overlapping, thus, lexical relations (e.g., between the same words) determine word movements from T to H (or vice versa). This is important to model the syntactic/lexical similarity between example pairs. Indeed, if we encode such movements in the syntactic parse trees of texts and hypotheses, we can use interesting similarity measures defined for syntactic parsing, e.g., the tree kernel devised in (Collins and Duffy, 2002).</Paragraph>
    <Paragraph position="1"> To consider structural and lexical relation similarity, we augment syntactic trees with placeholders which identify linked words. More in detail: - We detect links between words wt in T that are equal, similar, or semantically dependent on words wh in H. We call anchors the pairs (wt,wh) and we associate them with placeholders. For example, in Fig. 1, the placeholder 2&amp;quot; indicates the (companies,companies) anchor between T1 and H1. This allows us to derive the word movements between text and hypothesis.</Paragraph>
    <Paragraph position="2"> - We align the trees of the two texts Tprime and Tprimeprime as well as the tree of the two hypotheses Hprime and Hprimeprime by considering the word movements. We find a correct mapping between placeholders of the two hypothesis Hprime and Hprimeprime and apply it to the tree of Hprimeprime to substitute its placeholders. The same mapping is used to substitute the placeholders in Tprimeprime. This mapping should maximize the structural similarity between the four trees by considering that placeholders augment the node labels. Hence, the cross-pair similarity computation is reduced to the tree similarity computation.</Paragraph>
    <Paragraph position="3"> The above steps define an effective cross-pair similarity that can be applied to the example in  are quite different from those T1 and H1, but we can rely on the structural properties expressed by their bold subtrees. These are more similar to the subtrees of T1 and H1 than those of T1 and H2, respectively. Indeed, H1 and H3 share the production NP - DT JJ NN NNS while H2 and H3 do  not. Consequently, to decide if (T3,H3) is a valid entailment, we should rely on the decision made for (T1,H1). Note also that the dashed lines connecting placeholders of two texts (hypotheses) indicate structurally equivalent nodes. For instance, the dashed line between 3 and b links the main verbs both in the texts T1 and T3 and in the hypotheses H1 and H3. After substituting 3 with b and 2 with a , we can detect if T1 and T3 share the bold subtree S - NP 2 VP 3 . As such subtree is shared also by H1 and H3, the words within the pair (T1,H1) are correlated similarly to the words in (T3,H3).</Paragraph>
    <Paragraph position="4"> The above example emphasizes that we need to derive the best mapping between placeholder sets. It can be obtained as follows: let Aprime and Aprimeprime be the placeholders of (Tprime,Hprime) and (Tprimeprime,Hprimeprime), respectively, without loss of generality, we consider |Aprime|[?]|Aprimeprime |and we align a subset of Aprime to Aprimeprime. The best alignment is the one that maximizes the syntactic and lexical overlapping of the two subtrees induced by the aligned set of anchors.</Paragraph>
    <Paragraph position="5"> More precisely, let C be the set of all bijective mappings from aprime [?] Aprime : |aprime |= |Aprimeprime |to Aprimeprime, an element c [?] C is a substitution function. We define as the best alignment the one determined</Paragraph>
    <Paragraph position="7"> where (a) t(S,c) returns the syntactic tree of the hypothesis (text) S with placeholders replaced by means of the substitution c, (b) i is the identity substitution and (c) KT(t1,t2) is a function that measures the similarity between the two trees t1 and t2 (for more details see Sec. 4.2). For example, the cmax between (T1,H1) and (T3,H3) is {(2', a'),(2&amp;quot;, a&amp;quot;),(3, b ),(4, c)}.</Paragraph>
  </Section>
  <Section position="6" start_page="402" end_page="404" type="metho">
    <SectionTitle>
4 Similarity Models
</SectionTitle>
    <Paragraph position="0"> In this section we describe how anchors are found at the level of a single pair (T,H) (Sec. 4.1). The anchoring process gives the direct possibility of  implementing an inter-pair similarity that can be used as a baseline approach or in combination with the cross-pair similarity. This latter will be implemented with tree kernel functions over syntactic structures (Sec. 4.2).</Paragraph>
    <Section position="1" start_page="403" end_page="403" type="sub_section">
      <SectionTitle>
4.1 Anchoring and Lexical Similarity
</SectionTitle>
      <Paragraph position="0"> The algorithm that we design to find the anchors is based on similarity functions between words or more complex expressions. Our approach is in line with many other researches (e.g., (Corley and Mihalcea, 2005; Glickman et al., 2005)).</Paragraph>
      <Paragraph position="1"> Given the set of content words (verbs, nouns, adjectives, and adverbs) WT and WH of the two sentences T and H, respectively, the set of anchors A [?] WT xWH is built using a similarity measure between two words simw(wt,wh). Each element</Paragraph>
      <Paragraph position="3"> According to these properties, elements in WH can participate in more than one anchor and conversely more than one element in WH can be linked to a single element w [?] WT.</Paragraph>
      <Paragraph position="4"> The similarity simw(wt,wh) can be defined using different indicators and resources. First of all, two words are maximally similar if these have the same surface form wt = wh. Second, we can use one of the WordNet (Miller, 1995) similarities indicated with d(lw,lwprime) (in line with what was done in (Corley and Mihalcea, 2005)) and different relation between words such as the lexical entailment between verbs (Ent) and derivationally relation between words (Der). Finally, we use the edit distance measure lev(wt,wh) to capture the similarity between words that are missed by the previous analysis for misspelling errors or for the lack of derivationally forms not coded in WordNet.</Paragraph>
      <Paragraph position="5"> As result, given the syntactic category cw [?] {noun,verb,adjective,adverb} and the lemmatized form lw of a word w, the similarity measure between two words w and wprime is defined as follows:</Paragraph>
      <Paragraph position="7"> (2) It is worth noticing that, the above measure is not a pure similarity measure as it includes the entailment relation that does not represent synonymy or similarity between verbs. To emphasize the contribution of each used resource, in the experimental section, we will compare Eq. 2 with some versions that exclude some word relations.</Paragraph>
      <Paragraph position="8"> The above word similarity measure can be used to compute the similarity between T and H. In line with (Corley and Mihalcea, 2005), we define</Paragraph>
      <Paragraph position="10"> where idf(w) is the inverse document frequency of the word w. For sake of comparison, we consider also the corresponding more classical version that does not apply the inverse document</Paragraph>
      <Paragraph position="12"> ?From the above intra-pair similarities, s1 and s2, we can obtain the baseline cross-pair similarities based on only lexical information:</Paragraph>
      <Paragraph position="14"> where i [?] {1,2}. In the next section we define a novel cross-pair similarity that takes into account syntactic evidence by means of tree kernel functions. null</Paragraph>
    </Section>
    <Section position="2" start_page="403" end_page="404" type="sub_section">
      <SectionTitle>
4.2 Cross-pair syntactic kernels
</SectionTitle>
      <Paragraph position="0"> Section 3 has shown that to measure the syntactic similarity between two pairs, (Tprime,Hprime) and (Tprimeprime,Hprimeprime), we should capture the number of common subtrees between texts and hypotheses that share the same anchoring scheme. The best alignment between anchor sets, i.e. the best substitution cmax, can be found with Eq. 1. As the corresponding maximum quantifies the alignment degree, we could define a cross-pair similarity as follows:</Paragraph>
      <Paragraph position="2"> where as KT(t1,t2) we use the tree kernel function defined in (Collins and Duffy, 2002). This evaluates the number of subtrees shared by t1 and t2, thus defining an implicit substructure space.</Paragraph>
      <Paragraph position="3"> Formally, given a subtree space F = {f1,f2,...,f|F|}, the indicator function Ii(n) is equal to 1 if the target fi is rooted at node n and equal to 0 otherwise. A tree-kernel function over t1 and t2 is KT(t1,t2) =summationtext</Paragraph>
      <Paragraph position="5"> are the sets of the t1's and t2's nodes, respectively.</Paragraph>
      <Paragraph position="6"> In turn [?](n1,n2) = summationtext|F|i=1 ll(fi)Ii(n1)Ii(n2),  where 0 [?] l [?] 1 and l(fi) is the number of levels of the subtree fi. Thus ll(fi) assigns a lower weight to larger fragments. When l = 1, [?] is equal to the number of common fragments rooted at nodes n1 and n2. As described in (Collins and Duffy, 2002), [?] can be computed in O(|Nt1 |x |Nt2|).</Paragraph>
      <Paragraph position="7"> The KT function has been proven to be a valid kernel, i.e. its associated Gram matrix is positivesemidefinite. Some basic operations on kernel functions, e.g. the sum, are closed with respect to the set of valid kernels. Thus, if the maximum held such property, Eq. 6 would be a valid kernel and we could use it in kernel based machines like SVMs. Unfortunately, a counterexample illustrated in (Boughorbel et al., 2004) shows that the max function does not produce valid kernels in general.</Paragraph>
      <Paragraph position="8"> However, we observe that: (1) Ks((Tprime,Hprime),(Tprimeprime,Hprimeprime)) is a symmetric function since the set of transformation C are always computed with respect to the pair that has the largest anchor set; (2) in (Haasdonk, 2005), it is shown that when kernel functions are not positive semidefinite, SVMs still solve a data separation problem in pseudo Euclidean spaces.</Paragraph>
      <Paragraph position="9"> The drawback is that the solution may be only a local optimum. Therefore, we can experiment Eq. 6 with SVMs and observe if the empirical results are satisfactory. Section 6 shows that the solutions found by Eq. 6 produce accuracy higher than those evaluated on previous automatic textual entailment recognition approaches.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="404" end_page="405" type="metho">
    <SectionTitle>
5 Refining cross-pair syntactic similarity
</SectionTitle>
    <Paragraph position="0"> In the previous section we have defined the intra and the cross pair similarity. The former does not show relevant implementation issues whereas the latter should be optimized to favor its applicability with SVMs. The Eq. 6 improvement depends on three factors: (1) its computation complexity; (2) a correct marking of tree nodes with placeholders; and, (3) the pruning of irrelevant information in large syntactic trees.</Paragraph>
    <Section position="1" start_page="404" end_page="404" type="sub_section">
      <SectionTitle>
5.1 Controlling the computational cost
</SectionTitle>
      <Paragraph position="0"> The computational cost of cross-pair similarity between two tree pairs (Eq. 6) depends on the size of C. This is combinatorial in the size of Aprime and Aprimeprime, i.e. |C |= (|Aprime|[?]|Aprimeprime|)!|Aprimeprime|! if |Aprime |[?]|Aprimeprime|. Thus we should keep the sizes of Aprime and Aprimeprime reasonably small.</Paragraph>
      <Paragraph position="1"> To reduce the number of placeholders, we consider the notion of chunk defined in (Abney, 1996), i.e., not recursive kernels of noun, verb, adjective, and adverb phrases. When placeholders are in a single chunk both in the text and hypothesis we assign them the same name. For example, Fig. 1 shows the placeholders 2' and 2&amp;quot; that are substituted by the placeholder 2 . The placeholder reduction procedure also gives the possibility of resolving the ambiguity still present in the anchor set A (see Sec. 4.1). A way to eliminate the ambiguous anchors is to select the ones that reduce the final number of placeholders.</Paragraph>
    </Section>
    <Section position="2" start_page="404" end_page="404" type="sub_section">
      <SectionTitle>
5.2 Augmenting tree nodes with placeholders
</SectionTitle>
      <Paragraph position="0"> Anchors are mainly used to extract relevant syntactic subtrees between pairs of text and hypothesis. We also use them to characterize the syntactic information expressed by such subtrees. Indeed, Eq. 6 depends on the number of common subtrees between two pairs. Such subtrees are matched when they have the same node labels.</Paragraph>
      <Paragraph position="1"> Thus, to keep track of the argument movements, we augment the node labels with placeholders.</Paragraph>
      <Paragraph position="2"> The larger number of placeholders two hypotheses (texts) match the larger the number of their common substructures is (i.e. higher similarity).</Paragraph>
      <Paragraph position="3"> Thus, it is really important where placeholders are inserted.</Paragraph>
      <Paragraph position="4"> For example, the sentences in the pair (T1,H1) have related subjects 2 and related main verbs 3 . The same occurs in the sentences of the pair (T3,H3), respectively a and b . To obtain such node marking, the placeholders are propagated in the syntactic tree, from the leaves1 to the target nodes according to the head of constituents. The example of Fig. 1 shows that the placeholder 0 climbs up to the node governing all the NPs.</Paragraph>
    </Section>
    <Section position="3" start_page="404" end_page="405" type="sub_section">
      <SectionTitle>
5.3 Pruning irrelevant information in large
</SectionTitle>
      <Paragraph position="0"> text trees Often only a portion of the parse trees is relevant to detect entailments. For instance, let us consider the following pair from the RTE 2005 corpus: 1To increase the generalization capacity of the tree kernel function we choose not to assign any placeholder to the leaves.</Paragraph>
      <Paragraph position="2"> T &amp;quot;Ron Gainsford, chief executive of the TSI, said: &amp;quot;It is a major concern to us that parents could be unwittingly exposing their children to the risk of sun damage, thinking they are better protected than they actually are.&amp;quot; H &amp;quot;Ron Gainsford is the chief executive of the TSI.&amp;quot; Only the bold part of T supports the implication; the rest is useless and also misleading: if we used it to compute the similarity it would reduce the importance of the relevant part. Moreover, as we normalize the syntactic tree kernel (KT ) with respect to the size of the two trees, we need to focus only on the part relevant to the implication.</Paragraph>
      <Paragraph position="3"> The anchored leaves are good indicators of relevant parts but also some other parts may be very relevant. For example, the function word not plays an important role. Another example is given by the word insurance in H1 and mountain in H3 (see Fig. 1). They support the implication T1 = H1 and T1 = H3 as well as cash supports T1 notdblarrowrightH2. By removing these words and the related structures, we cannot determine the correct implications of the first two and the incorrect implication of the second one. Thus, we keep all the words that are immediately related to relevant constituents.</Paragraph>
      <Paragraph position="4"> The reduction procedure can be formally expressed as follows: given a syntactic tree t, the set of its nodes N(t), and a set of anchors, we build a tree tprime with all the nodes Nprime that are anchors or ancestors of any anchor. Moreover, we add to tprime the leaf nodes of the original tree t that are direct children of the nodes in Nprime. We apply such procedure only to the syntactic trees of texts before the computation of the kernel function.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="405" end_page="407" type="metho">
    <SectionTitle>
6 Experimental investigation
</SectionTitle>
    <Paragraph position="0"> The aim of the experiments is twofold: we show that (a) entailment recognition rules can be learned from examples and (b) our kernel functions over syntactic structures are effective to derive syntactic properties. The above goals can be achieved by comparing the different intra and cross pair similarity measures.</Paragraph>
    <Section position="1" start_page="405" end_page="405" type="sub_section">
      <SectionTitle>
6.1 Experimental settings
</SectionTitle>
      <Paragraph position="0"> For the experiments, we used the Recognizing Textual Entailment Challenge data sets, which we name as follows: - D1, T1 and D2, T2, are the development and the test sets of the first (Dagan et al., 2005) and second (Bar Haim et al., 2006) challenges, respectively. D1 contains 567 examples whereas T1, D2 and T2 have all the same size, i.e. 800 training/testing instances. The positive examples constitute the 50% of the data.</Paragraph>
      <Paragraph position="1"> - ALL is the union of D1, D2, and T1, which we also split in 70%-30%. This set is useful to test if we can learn entailments from the data prepared in the two different challenges.</Paragraph>
      <Paragraph position="2"> - D2(50%)prime and D2(50%)primeprime is a random split of D2. It is possible that the data sets of the two competitions are quite different thus we created this homogeneous split.</Paragraph>
      <Paragraph position="3"> We also used the following resources: - The Charniak parser (Charniak, 2000) and the morphalemmatiser (Minnen et al., 2001) to carry out the syntactic and morphological analysis.</Paragraph>
      <Paragraph position="4"> - WordNet 2.0 (Miller, 1995) to extract both the verbs in entailment, Ent set, and the derivationally related words, Der set.</Paragraph>
      <Paragraph position="5"> - The wn::similarity package (Pedersen et al., 2004) to compute the Jiang&amp;Conrath (J&amp;C) distance (Jiang and Conrath, 1997) as in (Corley and Mihalcea, 2005). This is one of the best figure method which provides a similarity score in the [0,1] interval. We used it to implement the d(lw,lwprime) function.</Paragraph>
      <Paragraph position="6"> - A selected portion of the British National Corpus2 to compute the inverse document frequency (idf). We assigned the maximum idf to words not found in the BNC.</Paragraph>
      <Paragraph position="7"> - SVM-light-TK3 (Moschitti, 2006) which encodes the basic tree kernel function, KT, in SVM-light (Joachims, 1999). We used such software to implement Ks (Eq. 6), K1, K2 (Eq. 5) and Ks + Ki kernels. The latter combines our new kernel with traditional approaches (i [?]{1,2}).</Paragraph>
    </Section>
    <Section position="2" start_page="405" end_page="406" type="sub_section">
      <SectionTitle>
6.2 Results and analysis
</SectionTitle>
      <Paragraph position="0"> Table 1 reports the results of different similarity kernels on the different training and test splits described in the previous section. The table is organized as follows: The first 5 rows (Experiment settings) report the intra-pair similarity measures defined in Section 4.1, the 6th row refers to only the idf similarity metric whereas the following two rows report the cross-pair similarity carried out with Eq. 6 with (Synt Trees with placeholders) and without (Only Synt Trees) augmenting the trees with placeholders, respectively. Each column in the Experiment</Paragraph>
    </Section>
    <Section position="3" start_page="406" end_page="407" type="sub_section">
      <SectionTitle>
Experiment Settings
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> settings indicates a different intra-pair similarity measure built by means of a combination of basic similarity approaches. These are specified with the check sign [?]. For example, Column 5 refers to a model using: the surface word form similarity, the d(lw,lwprime) similarity and the idf.</Paragraph>
      <Paragraph position="3"> The next 5 rows show the accuracy on the data sets and splits used for the experiments and the next row reports the average and Std. Dev. over the previous 5 results. Finally, the last two rows report the accuracy on ALL dataset split in 70/30% and on the whole ALL dataset used for training and T2 for testing.</Paragraph>
      <Paragraph position="4"> ?From the table we note the following aspects: - First, the lexical-based distance kernels K1 and K2 (Eq. 5) show accuracy significantly higher than the random baseline, i.e. 50%. In all the datasets (except for the first one), the simw(T,H) similarity based on the lexical overlap (first column) provides an accuracy essentially similar to the best lexical-based distance method.</Paragraph>
      <Paragraph position="5"> - Second, the dataset &amp;quot;Train:D1-Test:T1&amp;quot; allows us to compare our models with the ones of the first RTE challenge (Dagan et al., 2005). The accuracy reported for the best systems, i.e. 58.6% (Glickman et al., 2005; Bayer et al., 2005), is not significantly different from the result obtained with K1 that uses the idf.</Paragraph>
      <Paragraph position="6"> - Third, the dramatic improvement observed in (Corley and Mihalcea, 2005) on the dataset &amp;quot;Train:D1-Test:T1&amp;quot; is given by the idf rather than the use of the J&amp;C similarity (second vs. third columns). The use of J&amp;C with the idf decreases the accuracy of the idf alone.</Paragraph>
      <Paragraph position="7"> - Next, our approach (last column) is significantly better than all the other methods as it provides the best result for each combination of training and test sets. On the &amp;quot;Train:D1-Test:T1&amp;quot; test set, it exceeds the accuracy of the current state-of-the-art models (Glickman et al., 2005; Bayer et al., 2005) by about 4.4 absolute percent points (63% vs. 58.6%) and 4% over our best lexical similarity measure. By comparing the average on all datasets, our system improves on all the methods by at least 3 absolute percent points.</Paragraph>
      <Paragraph position="8"> - Finally, the accuracy produced by Synt Trees with placeholders is higher than the one obtained with Only Synt Trees. Thus, the use of placeholders is fundamental to automatically learn entailments from examples.</Paragraph>
      <Paragraph position="9">  Hereafter we show some instances selected from the first experiment &amp;quot;Train:T1-Test:D1&amp;quot;. They were correctly classified by our overall model (last column) and miss-classified by the models in the seventh and in the eighth columns.</Paragraph>
      <Paragraph position="10"> The first is an example in entailment:</Paragraph>
      <Paragraph position="12"> ducer in the world, was once a supporter of Osama bin Laden and his associates who led attacks against the</Paragraph>
      <Paragraph position="14"> T &amp;quot;Harvey Weinstein, the co-chairman of Miramax, who was instrumental in popularizing both independent and foreign films with broad audiences, agrees.&amp;quot; H &amp;quot;Harvey Weinstein is the co-chairman of Miramax.&amp;quot;  The rewrite rule is: &amp;quot;X, Y, ...&amp;quot; implies &amp;quot;X is Y&amp;quot;. This rule is also described in (Hearst, 1992). A more interesting rule relates the following two sentences which are not in entailment: T notdblarrowrightH (id: 2045) T &amp;quot;Mrs. Lane, who has been a Director since 1989, is Special Assistant to the Board of Trustees and to the President of Stanford University.&amp;quot; H &amp;quot;Mrs. Lane is the president of Stanford University.&amp;quot; It was correctly classified using instances like the following: T notdblarrowrightH (id: 2044) T &amp;quot;Jacqueline B. Wender is Assistant to the President of Stanford University.&amp;quot; H &amp;quot;Jacqueline B. Wender is the President of Stanford University.&amp;quot; T notdblarrowrightH (id: 2069) T &amp;quot;Grieving father Christopher Yavelow hopes to deliver one million letters to the queen of Holland to bring his children home.&amp;quot; H &amp;quot;Christopher Yavelow is the queen of Holland.&amp;quot; Here, the implicit rule is: &amp;quot;X (VP (V ...) (NP (to Y) ...)&amp;quot; does not imply &amp;quot;X is Y&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML