File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2023_metho.xml
Size: 14,541 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2023"> <Title>Improved-Edit-Distance Kernel for Chinese Relation Extraction</Title> <Section position="4" start_page="132" end_page="132" type="metho"> <SectionTitle> 2 Kernel-based Machine Learning </SectionTitle> <Paragraph position="0"> Most machine learning methods represent an object as a feature vector. They are well-known feature-based learning methods.</Paragraph> <Paragraph position="1"> Kernel methods (Cristianini and Shawe-Taylor, 2000) are an attractive alternative to feature-based methods. The kernel methods retain the original representation of objects and use the object only via computing a kernel function between a pair of objects. As we know, a kernel function is a similarity function satisfying certain properties. There are a number of learning algorithms that can operate using only the dot product of examples. We call them kernel machines. For instance, the Perceptron learning algorithm (Cristianini and Shawe-Taylor, 2000), Support Vector Machine (SVM) (Vapnik, 1998) and so on.</Paragraph> </Section> <Section position="5" start_page="132" end_page="133" type="metho"> <SectionTitle> 3 Relation Extraction Problem </SectionTitle> <Paragraph position="0"> We regard the RE problem as a classification learning problem. We only consider the relation between two entities in a sentence and no relations across sentences. For example, the sen-</Paragraph> <Section position="1" start_page="132" end_page="132" type="sub_section"> <SectionTitle> 3.1 Feature-based Methods </SectionTitle> <Paragraph position="0"> The feature-based methods have to transform the context into features. Expert knowledge is required for deciding which elements or their combinations thereof are good features. Usually these features' values are binary (0 or 1).</Paragraph> <Paragraph position="1"> The feature-based methods will cost lots of labor to find suitable features for a particular application field. Another problem is that we can either select only the local features with a small window or we will have to spend much more training and test time. At the same time, the feature-based methods will not use the combination of these features. null</Paragraph> </Section> <Section position="2" start_page="132" end_page="133" type="sub_section"> <SectionTitle> 3.2 Kernel-based Methods </SectionTitle> <Paragraph position="0"> Different from the feature-based methods, kernel-based methods do not require much labor on extracting the suitable features. As explained in the introduction to Section 2, we retain the original string form of objects and consider the similarity function between two objects. For the problem of the person-affiliation relation extraction, the objects are the context around people and organization with a fixed window size w. It means that we get w words around each entity as the samples in the classification problem. Again considering be written as &quot;n ORGPEOb&quot; Through the objects transformed from the original texts, we can calculate the similarity between any two objects by using the kernel (similarity) function.</Paragraph> <Paragraph position="1"> For the Chinese relation extraction problem, we must consider the semantic similarity between words and the structure of strings while computing similarity. Therefore we must consider the kernel function which has a good similarity measure. The methods for computing the similarity between two strings are: the same-word based method (Nirenburg et al., 1993), the thesaurus based method (Qin et al., 2003), the Edit-Distance method (Ristad and Yianilos, 1998) and the statistical method (Chatterjee, 2001). We know that the same-word based method cannot solve the problem of synonyms. The thesaurus based method can overcome this difficulty but does not consider the structure of the text. Although the Edit-Distance method uses the structure of the text, it also has the same problem of the replacement of synonyms. As for the statistical method, it needs large corpora of similarity text and thus is difficult to use for realistic applications.</Paragraph> <Paragraph position="2"> For the reasons described above, we propose a novel Improved-Edit-Distance (IED) method for calculating the similarity between two Chinese strings.</Paragraph> </Section> </Section> <Section position="6" start_page="133" end_page="134" type="metho"> <SectionTitle> 4 IED Kernel Method </SectionTitle> <Paragraph position="0"> Like normal kernel methods, the new IED kernel method includes two components: the kernel function and the kernel machine. We use the IED method to calculate the semantic similarity between two Chinese strings as the kernel function. As for the kernel machine, we tested the Voted Perceptron with dual form and SVM with a customized kernel. In the following subsections, Distance and the Improved-Edit-Distance we will introduce the kernel function, the IED method, and kernel machines.</Paragraph> <Section position="1" start_page="133" end_page="134" type="sub_section"> <SectionTitle> 4.1 Improved-Edit-Distance </SectionTitle> <Paragraph position="0"> Before the introduction to IED, we will give a brief review of the classical Edit-Distance method (Ristad and Yianilos, 1998).</Paragraph> <Paragraph position="1"> The edit distance between two strings is defined as: The minimum number of edit operations necessary to transform one string into another.</Paragraph> <Paragraph position="2"> There are three edit operations, Insert, Delete, and Replace. For example, in Figure 1(a), the edit distance between &quot;T(like apples)&quot; and &quot; P(like bananas)&quot; is 4, as indicated by the four dotted lines.</Paragraph> <Paragraph position="3"> As we see, the method of computing the edit distance between two Chinese strings cannot reflect the actual situation. First, the Edit-Distance method computes the similarity measured in Chinese character. But in Chinese, most of the characters have no concrete meanings, such as &quot;&quot;, &quot;T&quot; and so on. The single character cannot express the meanings of words. Second, the cost of the Replace operation is different for different words. For example, the operation of &quot;(love)&quot; being replace by &quot;(like)&quot; should have a small cost, because they are synonyms. At last, if there are a few words being inserted into a string, the meaning of it should not be changed too much.</Paragraph> <Paragraph position="4"> Such as &quot;T(like apples)&quot; and &quot;C T(like sweet apples)&quot; are very similar.</Paragraph> <Paragraph position="5"> Based on the above idea, we provide the IED method for computing the similarity between two Chinese strings. It means that we will use Chinese words as the basis of our measurement (instead of characters). By using a thesaurus, the similarity between two Chinese words can be computed. At the same time, the cost of the Insert operation is reduced.</Paragraph> <Paragraph position="6"> Here, we use the CiLin (Mei et al., 1996) as the thesaurus resource to compute the similarity between two Chinese words. In CiLin, the semantics of words are divided into High, Middle, and Low classes to describe a semantic system from general to special semantic. For example:</Paragraph> <Paragraph position="8"> The semantic distance between word A and word B can be defined as:</Paragraph> <Paragraph position="10"> where A and B are the semantic sets of word A and word B respectively. The distance between semantic a and b is: dist(a;b) = 2 / (3 ! d), where d means that the semantic code is different from the dth class. If the semantic code is same, then the semantic distance is</Paragraph> <Paragraph position="12"> According to Table 1, we can define the cost of various edit operations in IED. See Table 2, where &quot;!&quot; denotes the delete operation.</Paragraph> <Paragraph position="13"> By the redefinition of the cost of edit operations, the computation of IED between &quot; T&quot; and &quot;P&quot; is as shown Figure 1(b), where the Replace cost of &quot;&quot;!&quot;&quot; is 0.5 and &quot;T&quot;!&quot;P&quot; is 0.7. Thus the cost of IED is 1.2. Compared with the cost of classical Edit-Distance, the cost of IED is much more appropriate in the actual situation.</Paragraph> <Paragraph position="14"> We use dynamic programming to compute the IED similar with the computing of edit distance.</Paragraph> <Paragraph position="15"> In order to compute the similarity between two strings, we should convert the distance value into a similarity. Empirically, the maximal similarity is set to be 10. The similarity is 10 minus the improved edit distance of two Chinese strings.</Paragraph> </Section> <Section position="2" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 4.2 Kernel Machines </SectionTitle> <Paragraph position="0"> We use the Voted Perceptron and SVM algorithms as the kernel machines here.</Paragraph> <Paragraph position="1"> The Voted Perceptron algorithm was described in (Freund and Schapire, 1998). We used SVMlight (Joachims, 1998) with custom kernel as the implementation of the SVM method. In our experiments, we just replaced the custom kernel with the IED kernel function.</Paragraph> </Section> </Section> <Section position="7" start_page="134" end_page="135" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> In this section, we show how to extract the person-affiliation relation from text and give some experimental results. It is relatively straightforward to extend the IED kernel method to other RE problems.</Paragraph> <Paragraph position="1"> The corpus for our experiments comes from Bejing Youth Daily1. We annotated about 500 news with named entities of PEO and ORG. We selected 4,200 sentences (examples) with both PEO and ORG pairs as described in Section 3.</Paragraph> <Paragraph position="2"> There are about 1,200 positive examples and 3,000 negative examples. We took about 2,500 random examples as training data and the rest of about 1,700 examples as test data.</Paragraph> <Section position="1" start_page="134" end_page="135" type="sub_section"> <SectionTitle> 5.1 Infection of Window Size in Kernel Methods </SectionTitle> <Paragraph position="0"> The change of the performance of the IED kernel method varying while the window size w is shown in Table 3. Here the Voted Perceptron is used as the kernel machine.</Paragraph> <Paragraph position="1"> Our experimental results show that the IED kernel method got the best performance with the highest F-Score when the window size w = 2. As w grows, the Precision becomes higher. With smaller w's, the Recall becomes higher.</Paragraph> </Section> <Section position="2" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 5.2 Comparison between Feature and Kernel Methods </SectionTitle> <Paragraph position="0"> For the feature-based methods implementation, we use the words which are around the PEO and the ORG entities and their POS. The window size is w (See Section 3). All examples can be transformed into feature vectors. We used the regularized winnow learning algorithm (Zhang, 2001) to train on the training data and predict the test data.</Paragraph> <Paragraph position="1"> From the experimental results, we find that when w = 2, the performance of feature-based method is highest.</Paragraph> <Paragraph position="2"> The comparison of the performance between the feature-based and the kernel-base methods is shown in Table 4.</Paragraph> <Paragraph position="3"> Figure 2 displays the change of F-Score for different methods varying with the training data size.</Paragraph> <Paragraph position="4"> From Table 4 and Figure 2, we can see that the IED kernel methods perform better for the person-affiliation relation extraction problem than for the feature-based methods.</Paragraph> <Paragraph position="5"> Figure 2 shows that the Voted Perceptron method gets close to, but not as good as, the performance to the SVM method on the RE problem. But when using the method, we can save significantly on computation time and programming effort. null</Paragraph> </Section> </Section> <Section position="8" start_page="135" end_page="135" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Our experimental results show that the kernel-based and the feature-based methods can get the best performance with the highest F-Score when the window size w = 2. This shows that for relation extraction problem, the two words around entities are the most significant ones. On the other hand, with w becoming bigger, the Precision becomes higher. And with w becoming smaller, the Recall becomes higher.</Paragraph> <Paragraph position="1"> From Table 4 and Figure 2, we can see that the IED kernel methods perform very well for the person-affiliation relation extraction. Furthermore, it does not need an expensive feature selection stage like feature-based methods. Because the IED kernel method uses the semantic similarity between words, it can get a better extension. We can conclude that the IED kernel method requires much fewer examples than feature-based methods for achieving the same performance.</Paragraph> <Paragraph position="2"> For example, there is a test sentence &quot; !!!n</Paragraph> </Section> <Section position="9" start_page="135" end_page="136" type="metho"> <SectionTitle> IBM 9&quot; (Chairman </SectionTitle> <Paragraph position="0"> Hu Jintao met the CEO of IBM Corporation). The feature-based method judges the!-IBM as a person-affiliation relation, because the context around!and IBM is similar with the context of the person-affiliation relation. However, the IED kernel method does the correct judgment based on the structure information. For this case the IED kernel method gets a higher precision. At the same time, because the IED kernel method considers the extension of synonyms, its recall does not decrease very much.</Paragraph> <Paragraph position="1"> The speed is a practical problem in applying kernel-based methods. Kernel-based classifiers are relatively slow compared to feature-based classifiers. The main reason is that the computing of kernel (similarity) function takes much time. Therefore, it becomes a key problem to improve the efficiency of the computing of the kernel function.</Paragraph> </Section> class="xml-element"></Paper>