File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1005_metho.xml
Size: 21,420 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1005"> <Title>Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Convolution Kernels </SectionTitle> <Paragraph position="0"> Convolution Kernels were proposed as a concept of kernels for a discrete structure. This framework defines a kernel function between input objects by applying convolution &quot;sub-kernels&quot; that are the kernels for the decompositions (parts) of the objects.</Paragraph> <Paragraph position="1"> Let a13 be a positive integer and a14a15a7a16a14a18a17a19a7a21a20a22a20a21a20a10a7a10a14a24a23 be nonempty, separable metric spaces. This paper focuses on the special case that a14a15a7a10a14 a17 a7a22a20a21a20a22a20a25a7a16a14 a23 are countable sets. We start witha5a27a26a15a14 as a composite structure and a28a30a29a31a5 a17 a7a22a20a21a20a22a20a10a7a16a5 a23 as its &quot;parts&quot;, where a5a33a32a34a26a35a14a36a32 . a37 is defined as a relation on the seta14 a17a39a38</Paragraph> <Paragraph position="3"> Suppose a5a8a7a10a9a58a26a59a14 , a28 be the parts of a5 with a28a60a29 a5 a17 a7a22a20a21a20a22a20a10a7a10a5 a23 , and a61 be the parts of a9 with a61a62a29a63a9 a17 a7a21a20a22a20a21a20a19a7a16a9 a23 . Then, the similarity a2a64a3a47a5a8a7a10a9a12a11 between a5 and a9 is defined as the following generalized convolution:</Paragraph> <Paragraph position="5"> We note that Convolution Kernels are abstract concepts, and that instances of them are determined by the definition of sub-kernel a2a35a32a57a3a47a5a100a32a57a7a10a9a74a32a89a11 . The Tree Kernel (Collins and Duffy, 2001) and String Subsequence Kernel (SSK) (Lodhi et al., 2002), developed in the NLP field, are examples of Convolution Kernels instances.</Paragraph> <Paragraph position="6"> An explicit definition of both the Tree Kernel and</Paragraph> <Paragraph position="8"> Conceptually, we enumerate all sub-structures occurring in a5 and a9 , where a115 represents the total number of possible sub-structures in the objects. a116 , the feature mapping from the sample space to the feature space, is given by a116a100a3a6a5a8a11a62a29 a3a10a116 a17 a3a47a5a8a11a19a7a21a20a22a20a21a20a19a7a21a116a100a117a118a3a6a5a8a11a10a11a19a20 In the case of the Tree Kernel, a5 and a9 be trees.</Paragraph> <Paragraph position="9"> The Tree Kernel computes the number of common subtrees in two trees a5 and a9 . a116a100a119a95a3a6a5a8a11 is defined as the number of occurrences of the a120 'th enumerated subtree in tree a5 .</Paragraph> <Paragraph position="10"> In the case of SSK, input objects a5 and a9 are string sequences, and the kernel function computes the sum of the occurrences of a120 'th common subsequence a116 a119 a3a47a5a8a11 weighted according to the length of the subsequence. These two kernels make polynomial-time calculations, based on efficient recursive calculation, possible, see equation (1). Our proposed method uses the framework of Convolution Kernels.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 HDAG Kernel </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Definition of HDAG </SectionTitle> <Paragraph position="0"> This paper defines HDAG as a Directed Acyclic Graph (DAG) with hierarchical structures. That is, certain nodes contain DAGs within themselves.</Paragraph> <Paragraph position="1"> In basic NLP tasks, chunking and parsing are used to analyze the text semantically or grammatically.</Paragraph> <Paragraph position="2"> There are several levels of chunks, such as phrases, named entities and sentences, and these are bound by relation structures, such as dependency structure, anaphora, and coreference. HDAG is designed to enable the representation of all of these structures inside texts, hierarchical structures for chunks and DAG structures for the relations of chunks. We believe this richer representation is extremely useful to improve the performance of similarity measure between texts, moreover, learning and clustering tasks in the application areas of NLP.</Paragraph> <Paragraph position="3"> Figure 1 shows an example of the text structures that can be handled by HDAG. Figure 2 contains simple examples of HDAG that elucidate the calculation of similarity.</Paragraph> <Paragraph position="4"> As shown in Figures 1 and 2, the nodes are allowed to have more than zero attributes, because nodes in texts usually have several kinds of attributes. For example, attributes include words, part-of-speech tags, semantic information such as Word-</Paragraph> <Paragraph position="6"> Net, and class of the named entity.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Definition of HDAG Kernel </SectionTitle> <Paragraph position="0"> First of all, we define the set of nodes in HDAGs</Paragraph> <Paragraph position="2"> a124 and a125 , respectively, a126 and a127 represent nodes in the graph that are defined as a50a128a126a44a129a126 a119 a26 a124a130a7a10a120a131a29a133a132a74a7a21a20a22a20a21a20a19a7a134a129a124a135a129a56 and a50a74a127a136a129a127a89a137a138a26a139a125a140a7a10a141a35a29a51a132a142a20a22a20a21a20a10a7a97a129a125a143a129a56 , respectively. We use the expression a126 a17a54a144 a126a74a145 a144 a126a102a146 to represent the path from a126a147a17 toa126 a146 througha126 a145 .</Paragraph> <Paragraph position="3"> We define &quot;attribute sequence&quot; as a sequence of attributes extracted from nodes included in a subpath. The attribute sequence is expressed as 'A-B' or 'A-(C-B)' where ( ) represents a chunk. As a basic example of the extraction of attribute sequences from a sub-path, a127 a122 a144 a127a25a148 in Figure 2 contains the four attribute sequences 'e-b', 'e-V', 'N-b' and 'N-V', which are the combinations of all attributes in a127 a122 and a127a10a148 . Section 3.3 explains in detail the method of extracting attribute sequences from sub-paths.</Paragraph> <Paragraph position="4"> Next, we define &quot;terminated nodes&quot; as those that do not contain any graph, such as a126 a122 , a126a108a149 ; &quot;nonterminated nodes&quot; are those that do, such as a127 a17 , a127a25a145 . Since HDAGs treat not only exact matching of sub-structures but also approximate matching, we allow node skips according to decay factor a150a33a3a10a151a153a152 a150a36a154a155a132a74a11 when extracting attribute sequences from the sub-paths. This framework makes similarity evaluation robust; the similar sub-structures can be evaluated in the value of similarity, in contrast to exact matching that never evaluate the similar substructure. Next, we define parameter a156 (a156 a29 a132a74a7a21a157a74a7a21a20a22a20a158a20) as the number of attributes combined in the attribute sequence. When calculating similarity, we consider only combination lengths of up toa156 .</Paragraph> <Paragraph position="5"> Given the above discussion, the feature vector of HDAG is written as a116a100a3 a121 a11a39a29a133a3a10a116 a17 a3 a121 a11a10a7a22a20a21a20a22a20a95a7a22a116a100a117a64a3 a121 a11a19a11 , where a116 represents the explicit feature mapping of HDAG and a115 represents the number of all possible a156 attribute combinations. The value of a116 a119 a3 a121 a11 is the number of occurrences of thea120 'th attribute sequence in HDAG a121 ; each attribute sequence is weighted according to the node skip. The similarity between HDAGs, which is the definition of the HDAG Kernel, follows equation (2) where input objects a5 and</Paragraph> <Paragraph position="7"> a121a123a122 , respectively. According to this approach, the HDAG Kernel calculates the inner product of the common attribute sequences weighted according to their node skips and the occurrence between the two HDAGs, a121 a17 and a121 a122 .</Paragraph> <Paragraph position="8"> We note that, in general, if the dimension of the feature space becomes very high or approaches infinity, it becomes computationally infeasible to generate feature vector a116a100a3 a121 a11 explicitly. To improve the reader's understanding of what the HDAG Kernel calculates, before we introduce our efficient calculation method, the next section details the attribute sequences that become elements of the feature vector if the calculation is explicit.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Attribute Sequences: The Elements of the Feature Vector </SectionTitle> <Paragraph position="0"> We describe the details of the attribute sequences that are elements of the feature vector of the HDAG Kernel using a121 a17 and a121 a122 in Figure 2.</Paragraph> <Paragraph position="1"> The framework of node skip We denote the explicit representation of a node skip by &quot;a159 &quot;. The attribute sequences in the sub-path under the &quot;node skip&quot; are written as 'a-a159 -c'. It costs a150 to skip a terminated node. The cost of skipping a</Paragraph> <Paragraph position="3"> sub-path a. seq. val.</Paragraph> <Paragraph position="5"> sub-path a. seq. val.</Paragraph> <Paragraph position="7"> non-terminated node is the same as skipping all the graphs inside the non-terminated node. We introduce decay functions a177a138a178a136a3a6a126a102a11 , a179a176a178a111a3a47a126a102a11 and a180a39a178a136a3a47a126a102a11 ; all are based on decay factor a150 . a177a46a178a97a3a6a126a102a11 represents the cost of node skip a126 . For example, a177 a178 a3a47a126a100a17a158a11a135a29a181a157a74a150 a122 represents the cost of node skip a126 a122 a144 a182 a145 and that of a126a134a148 a144 a126a74a145 ; a177a131a178a136a3a47a126 a122 a11a135a29a181a150 is the cost of just node skip a126 a122 . a179a12a178a136a3a47a126a102a11 represents the sum of the multiplied cost of the node skips of all of the nodes that have a path to a126 , a179a8a178a111a3a47a126a134a145a106a11a49a29a133a157a57a150 that is the sum cost of both</Paragraph> <Paragraph position="9"> a126a134a148 that have a path to a126a136a145 , a179a12a178a111a3a47a126 a17 a11a34a29a183a132a57a3a19a150a111a184a89a11 .</Paragraph> <Paragraph position="10"> a180 a178 a3a47a126a102a11 represents the sum of the multiplied cost of the node skips of all the nodes that a126 has a path to. a180a185a178a33a3a47a126 a122 a11a64a29a186a150 represents the cost of node skip We define the attributes of the non-terminated node as the combinations of all attribute sequences including the node skip. Table 1 shows the attribute sequences and values of a126 a17 and a127a10a145 .</Paragraph> <Paragraph position="11"> Details of the elements in the feature vector The elements of the feature vector are not considered in any of the node skips. This means that 'A-a159 -B-C' is the same element as 'A-B-C', and 'A-a159 -a159 -B-C' and 'A-a159 -B-a159 -C' are also the same element as 'A-B-C'. Considering the hierarchical structure, it is natural to assume that '(N-a159 )-(d)-a' and '(N-a159 )-((a159 d)-a)' are different elements. However, in the framework of the node skip and the attributes of the non-terminated node, '(N-a159 )-(a159 )-a' and '(N-a159 )-((a159 -a159 )-a)' are treated as the same element. This framework</Paragraph> <Paragraph position="13"> att. seq. value att. seq. value</Paragraph> <Paragraph position="15"> a 2 a 1 2 b 1 b 1 1 c 1 c 1 1 d 1 d 1 1</Paragraph> <Paragraph position="17"> achieves approximate matching of the structure automatically, The HDAG Kernel judges all pairs of attributes in each attribute sequence that are inside or outside the same chunk. If all pairs of attributes in the attribute sequences are in the same condition, inside or outside the chunk, then the attribute sequences judge as the same element.</Paragraph> <Paragraph position="18"> Table 2 shows the similarity, the values of</Paragraph> <Paragraph position="20"> a11 , when the feature vectors are explicitly represented. We only show the common elements of each feature vector that appear in both a121 a17 and a121 a122 , since the number of elements that appear in only a121 a17 or a121a123a122 becomes very large.</Paragraph> <Paragraph position="21"> Note that, as shown in Table 2, the attribute sequences of the non-terminated node itself are not addressed by the features of the graph. This is due to the use of the hierarchical structure; the attribute sequences of the non-terminated node come from the combination of the attributes in the terminated nodes. In the case of a182a57a17 , attribute sequence 'N-a159 ' comes from 'N' in a182 a122 . If we treat both 'N-a159 ' in a126a176a17 and 'N' ina126 a122 , we evaluate the attribute sequence 'N' in a126 a122 twice. That is why the similarity value in Table 2 does not contain 'c-a159 ' ina126a142a17 and '(c-a159 )-a159 ' in a127 a145 , see Table 1.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Calculation </SectionTitle> <Paragraph position="0"> First, we determine a191a70a192a54a3 a182 a7a16a193a79a11 , which returns the sum of the common attribute sequences of the a194 -combination of attributes between nodes a126 and a127 .</Paragraph> <Paragraph position="1"> a3a47a126a102a7a22a127a106a11 returns the number of common attributes of nodes a126 and a127 , not including the attributes of nodes insidea126 and a127 . We define functiona120a43a156a197a3a6a126a102a11 as returning a set of nodes inside a non-terminated node a126 . a120a43a156a197a3a6a126a102a11a225a29a181a226 means node a126 is a terminated node. For example,a120a43a156a197a3a6a126a33a17a109a11a142a29a139a50a128a126 a122 a7a10a126 a148 a7a10a126 a145 a56 anda120a43a156a197a3a6a126 a122 a11a44a29a155a226 . We define functions a227a123a192a46a3a6a126a102a7a21a127a57a11 , a227a185a228</Paragraph> <Paragraph position="3"> a3a6a126a102a11 returns the set of nodes that have direct links to node a126 . a239a70a240 a224 a3a47a126a102a11a49a29a241a226 means no nodes have direct links to a182 . a239a70a240 a224 a3a47a126a33a145a106a11a242a29 a50a128a126 a122 a7a16a126a134a148a106a56 and</Paragraph> <Paragraph position="5"> Next, we define a2a64a3a6a126a102a7a21a127a57a11 as representing the sum of the common attribute sequences that are the a194 combinations of attributes extracted from the sub-paths whose sinks area126 and a127 , respectively.</Paragraph> <Paragraph position="7"> According to equation (13), given the recursive definition of a2a36a192a46a3a47a126a102a7a22a127a106a11 , the similarity between two HDAGs can be calculated in a250a140a3a47a156a42a129a124a135a129a101a129a125a143a129a11 time1.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Efficient Calculation Method </SectionTitle> <Paragraph position="0"> We will now elucidate an efficient processing algorithm. First, as a pre-process, the nodes are sorted under the following condition: all nodes that have a path to the focused node and are in the graph inside the focused node should be set before the focused node. We can get at least one set of ordered nodes since we are treating an HDAG. In the case of a121 a17 , we can get a251a212a126 a122 , a126a74a148 , a126a74a145 , a126 a17 , a126 a149 , a126a108a252 , a126a33a146a158a253 . We can rewrite the recursive calculation formula in &quot;for loops&quot;, if we follow the sorted order. Figure 3 shows the algorithm of the HDAG kernel. Dynamic programming technique is used to compute the HDAG Kernel very efficiently because when following the sorted order, the values that are needed to calculate the focused pair of nodes are already calculated in the previous calculation. We can calculate the table by following the order of the nodes from left to right and top to bottom.</Paragraph> <Paragraph position="1"> We normalize the computed kernels before their use within the algorithms. The normalization corresponds to the standard unit norm normalization of</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We evaluated the performance of the proposed method in an actual application of NLP; the data set is written in Japanese.</Paragraph> <Paragraph position="1"> We compared HDAG and DAG (the latter had no hierarchy structure) to the String Subsequence Kernel (SSK) for word sequence, Dependency Structure case of the Tree Kernel), and Cosine measure for feature vectors consisting of the occurrence of attributes (BOA), and the same as BOA, but only the attributes of noun and unknown word (BOA')were used.</Paragraph> <Paragraph position="2"> We expanded SSK and DSK to improve the total performance of the experiments. We denote them as SSK' and DSK' respectively. The original SSK treats only exacta156 string combinations based on parameter a156 . We consider string combinations of up to a156 for SSK'. The original DSK was specifically constructed for parse tree use. We expanded it to be able to treat thea156 combinations of nodes and the free order of child node matching.</Paragraph> <Paragraph position="3"> Figure 4 shows some input objects for each evaluated kernel, (a) for HDAG, (b) for DAG and DSK', and (c) for SSK'. Note, though DAG and DSK' treat the same input objects, their kernel calculation methods differ as do the return values.</Paragraph> <Paragraph position="4"> We used the words and semantic information of &quot;Goi-taikei&quot; (Ikehara et al., 1997), which is similar to WordNet in English, as the attributes of the node. The chunks and their relations in the texts were analyzed by cabocha (Kudo and Matsumoto, 2002), and named entities were analyzed by the method of (Isozaki and Kazawa, 2002).</Paragraph> <Paragraph position="5"> We tested eacha156 -combination case with changing parameter a150 from 0.1 through 0.9 in the step of 0.1. Only the best performance achieved under parameter a150 is shown in each case.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Performance as a Similarity Measure Question Classification </SectionTitle> <Paragraph position="0"> We used the 1011 questions of NTCIR-QAC1 2 and the 2000 questions of CRL-QA data 3 We assigned them into 148 question types based on the CRL-QA data.</Paragraph> <Paragraph position="1"> We evaluated classification performance in the following step. First, we extracted one question from the data. Second, we calculated the similarity between the extracted question and all the other questions. Third, we ranked the questions in order of descending similarity. Finally, we evaluated performance as a similarity measure by Mean Reciprocal Rank (MRR) (Voorhees and Tice, 1999) based on the question type of the ranked questions.</Paragraph> <Paragraph position="2"> Table 3 shows the results of this experiment.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Sentence Alignment </SectionTitle> <Paragraph position="0"> The data set (Hirao et al., 2003) taken from the &quot;Mainichi Shinbun&quot;, was formed into abstract sentences and manually aligned to sentences in the &quot;Yomiuri Shinbun&quot; according to the meaning of sentence (did they say the same thing).</Paragraph> <Paragraph position="1"> This experiment was prosecuted as follows.</Paragraph> <Paragraph position="2"> First, we extracted one abstract sentence from the &quot;Mainichi Shinbun&quot; data-set. Second, we calculated the similarity between the extracted sentence and the sentences in the &quot;Yomiuri Shinbun&quot; data-set. Third, we ranked the sentences in the &quot;Yomiuri Shinbun&quot; in descending order based on the calculated similarity values. Finally, we evaluated performance as a similarity measure using the MRR measure.</Paragraph> <Paragraph position="3"> Table 4 shows the results of this experiment.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Performance as a Kernel Function Question Classification </SectionTitle> <Paragraph position="0"> The comparison methods were evaluated the performance as a kernel function in the machine learning approach of the Question Classification. We chose SVM as a kernel-based learning algorithm that produces state-of-the-art performance in several NLP tasks.</Paragraph> <Paragraph position="1"> We used the same data set as used in the previous experiments with the following difference: if a question type had fewer than ten questions, we moved the entries into the upper question type as defined in CRL-QA data to provide enough training samples for each question type. We used one-vs-rest as the multi-class classification method and found a highest scoring question type. In the case of BOA and BOA', we used the polynomial kernel (Vapnik, 1995) to consider the attribute combinations.</Paragraph> <Paragraph position="2"> Table 5 shows the average accuracy of each question as evaluated by 5-fold cross validation.</Paragraph> </Section> </Section> class="xml-element"></Paper>