File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1210_intro.xml
Size: 20,403 bytes
Last Modified: 2025-10-06 14:00:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1210"> <Title>A trainable method for extracting Chinese entity names and their relations</Title> <Section position="2" start_page="0" end_page="70" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Entity names and their relations form the main content of a document. By grasping the entity names and their relations from a document, we will be able to understand the document to some extent.</Paragraph> <Paragraph position="1"> In the field of information extraction, much work has been done to automatically learn patterns \[1\] from training corpus in order to extract entity names and their relations from Engfish documents. But for Chinese, researchers primarily use man-made rules and keyword sets to identify entity names \[2\]. Building rules is often complex, error-prone and time-consuming, and usually requires detailed knowledge of system internals. Some researchers also use statistical method \[3\], but the training needs lots of human annotated data and only local context information can be used. With this in the view, we have sought a more efficient and flexible trainable method to resolve this problem.</Paragraph> <Paragraph position="2"> Section 2 first gives a general outline of the trainable method we have defined to extract Chinese entity names and their relations, then describes person name extraction, entity name classification and relation extraction in detail. Section 3 describes the preliminary experimental results of our method. Section 4 contains our remarks and discussion on possible extensions of the proposed work.</Paragraph> <Paragraph position="3"> 2. General Outline of the Method we view the problem of Chinese entity names and their relations as a series of classification problems, such as person name boundary classification, entity name classification, noun phrase boundary classification and relation classification. If those classification problems are clarified, then we will be able to resolve the related extraction problem. For example, if we can correctly classify (or identify) the beginning or the ending boundaries of a person name appeared in a document, we will be able to extract the person name.</Paragraph> <Paragraph position="4"> The process can be divided into two stages. The first stage is the learning process in which several classifiers are built from the training data. The second stage is the extracting process in which Chinese entity names and their relations are extracted using the classifiers learned. The learning algorithm used in the learning process is memory-based learning (MBL) \[4\]. MBL entails a classification based supervised learning approach. The approach has been named differently in a variety of contexts, such as similarity-based, example-based, analogical, case-based, instance-based, and lazy learning, we selected MBL as our learning algorithm because it suits well the domains with a large number of features from heterogeneous sources and can remember exceptional and low frequency cases that are useful to extrapolate from \[5\]. In addition, we can customize the learner using different weighting functions according to linguistic bias.</Paragraph> <Paragraph position="5"> The main steps for the learning process are: base) from the parsed training data. Four training sets are extracted for different tasks with each related to Chinese person names, entity names, noun phrases, or relations between entity names in the training data.</Paragraph> <Paragraph position="6"> The main features used in an example can be either local context features, e.g. dependency relation feature, or global context features, e.g. the features of a word in the whole document, or surface linguistic features, e.g. character feature and word feature, or deep linguistic features like semantic feature.</Paragraph> <Paragraph position="7"> Step 4: Use MBL algorithm to obtain IG-Tree \[4\] for the four training sets. IG-Tree is a compressed representation of the training set that can be processed quickly in classification process. In our case, the resulted IG-Trees are PersonName-IG-Tree, EntityName-IG-Tree, NP-IG-Tree, and Relation-IG-Tree.</Paragraph> <Paragraph position="8"> The main steps for extracting process are: Step 1: Segmenting, tagging and partial parsing the input Chinese documents.</Paragraph> <Paragraph position="9"> Step 2: Identify Chinese people names using PersonName-IG-Tree.</Paragraph> <Paragraph position="10"> Step 3: Identify Chinese organization names using the same method as described in \[2\]. Step 4: Identify other entity names (location, time, number) using the same method as described in \[2\].</Paragraph> <Paragraph position="11"> Step 5: Identify Chinese noun phrases (NP chunking) using NP-IG-Tree.</Paragraph> <Paragraph position="12"> Step 6: Use entity names and noun phrases extracted to perform partial parsing again to fix the parsing errors.</Paragraph> <Paragraph position="13"> Step 7: Use EntityName-IG-tree to classify the noun phrases extracted. This step will identify entity names that are missed in the previous steps.</Paragraph> <Paragraph position="14"> Step 8: Use Relafion-IG-Tree to identify relations between the extracted entity names. For a better understanding of the algorithm, we will describe in detail the person name extraction, the entity name classification, and the relation extraction in the next subsection. Please note that we are not going to discuss NP chunking further since it is beyond the main theme of this paper.</Paragraph> <Section position="1" start_page="66" end_page="68" type="sub_section"> <SectionTitle> 2.1 Person Name extraction </SectionTitle> <Paragraph position="0"> Chinese person names can be divided into two categories, local Chinese person names that consist of Chinese surnames and given names and transliterated person names that are sound translations of foreign names.</Paragraph> <Paragraph position="1"> The length of a local person name ranges from 2 to 6 characters, while the length of a transliterated person name is unrestricted.</Paragraph> <Paragraph position="2"> After segmentation, person names are usually divided into several words. The task is to extract the word sequences that are person name components. With this in view, we convert the person name extraction problem to an equivalent classification problem, i.e. classifying word sequences existing in the results of segmentation into two classes, namely Person-Name and NotPerson-Name. null To classify a word sequence we need to use a number of features.</Paragraph> <Paragraph position="3"> (1) Word features: the beginning word/tag of the sequence, the ending word/tag of the sequence.</Paragraph> <Paragraph position="4"> (2) Local context features: the n-th (n<=3) word/tag before/after the sequence; the verb before/after the sequence.</Paragraph> <Paragraph position="5"> (3) Context dependency features: the dependency relations of the word sequence and the dependency relations of the first word before/after the sequence. The main dependency relations include verb-object (the relation between a verb and its noun object), subject-verb (the relation between a verb and its subject), subject-adj (the relation between an adjective and its subject), adv-verb (the relation between a verb and its adverbial modifier), adv-adj (the relation between an adjective and its adverbial modifier), modifier-head (the relation between a noun and its modifier. In Chinese, the modifier can be adjective, noun, verb or other phrases).</Paragraph> <Paragraph position="6"> The word features and local context features can be directly extracted from the parsing results of the training data. The extraction of dependency features needs more explanation. We employ the collocation information obtained from a large corpus \[6\] to help the Chinese partial parser do the parsing. In most cases, dependency relations can be taken directly from the parsing results. But, there are some instances that the parser does not function well resulting in flat parsing trees. Under these circumstances, we resort to some simple heuristics such as linear order for dependency relations, thus making our method robust enough to extract most of the dependency relations.</Paragraph> <Paragraph position="7"> To make the learning process more efficient, we use Boolean features in the tralning set, so every feature described above is translated into several Boolean features. For example, for the feature lth-Next-Word (the first word after the word sequence), its value is the top 500 words (ordered by frequency) that can appear after a person name. We translate it into 500 features with every feature name like lth-Next-Word-XX in which XX is one of the 500 words. The feature value 1 means that the XX appear next to the word sequence in the instance. The translated examples have about several thousands of Boolean features, which will be a big challenge for machine learning algorithms like C4.5 and CN2, but for MBL this is not a big problem.</Paragraph> <Paragraph position="8"> Furthermore, we use sparse array representation to make the storage requirement much lower.</Paragraph> <Paragraph position="9"> For every word sequence in the training data that meets with one of the following three requirements, we extract that word sequence, including its class and all its features described above: (1) Begin with a surnarne, plus 1 or 2 characters.</Paragraph> <Paragraph position="10"> (2) Begin with two surnames, plus 1 or 2 characters.</Paragraph> <Paragraph position="11"> (3) Begin with a character included in the first character set of transliterated person names (extracted from training data), plus several characters. The name may not surpass a normal word (that is included in a list of 5000 most frequently used Chinese words), because these normal words rarely occurred in a transliterated name.</Paragraph> <Paragraph position="12"> For example, if in the training data, three words &quot;W1 W2 W3&quot; are annotated as a person name, then we will extract a Person-</Paragraph> <Paragraph position="14"> After all the examples are extracted from the training data, they are fed to MBL Learner to get the PersonName-IG-Tree.</Paragraph> <Paragraph position="15"> In the extracting process, we do the same as in the learning process to extract all examples, but the class of every example is unknown to us in advance. With the PersonName-IG-Tree, we can derive the class of every example, and then all word sequences classified as PersonName are extracted.</Paragraph> <Paragraph position="16"> When a person name appears more than once in a document, we can rely on cache mechanism, similar to those described in \[2\], to solve ambiguous cases. For example, a person name &quot;~:J~3~&quot; appears more than once in a document. We first erroneously extract &quot;~!~gj.~&quot; from a sentence &quot;~i~1~-~ ~ 5~ ~-~ ~--'q&quot;;~ ~i~ ~'t~ I~!1 ~i~ ~ 131 &quot;, then correctly extract &quot;~31U' from another sentence &quot;~3,~-~--:~&quot;. Now &quot;~:\]~gj-~&quot; and &quot;~!5~&quot; are both in the cache, the cache mechanism will be able to correct the first error based on the heuristics that, if an extracted person name is a substring of another extracted person name and they do not appear in one sentence, then only the substring is the correct person name.</Paragraph> <Paragraph position="17"> To better understand the process of extracting person names, we describe a few intuitive examples below.</Paragraph> <Paragraph position="18"> person name candidates for extraction.</Paragraph> <Paragraph position="19"> Because no dependency relations are found for the next word &quot;-~-&quot; and most training examples with this feature are classified as Not-Person-Name in training data, so &quot;~:\]~.~&quot; is also classified as Not-Person-Name. Both the previous word -,~,~.~:.~.~e~ ~:rm,, and the next word &quot;$.,k~&quot; are positive evidences that make it certain that &quot;&quot;~!~..~_&quot; shduld be classified as a person name, therefore our algorithm correctly extracted it from this sentence.</Paragraph> <Paragraph position="20"> (2) Sample 2 The sentence: '&quot;~,:i~lJ~\]~.il~l\]&quot; The segmentation result: &quot;.~:~ :~lJ Our algorithmcorrectly classified &quot;\]t~Jl~ll&quot; as not a person name. The reason is that in the training data all word sequences whose previous word is &quot;~&quot; is a not a person name.</Paragraph> <Paragraph position="21"> These examples show that our method performs disambiguation well thanks to the MBL learner's capability of catching exceptions.</Paragraph> </Section> <Section position="2" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 2.2 Entity Name Classification </SectionTitle> <Paragraph position="0"> The task of entity name classification is to classify the given noun phrases into several categories, such as organization name, product name, location, etc., as well as person names that are missed in the previous extraction.</Paragraph> <Paragraph position="1"> In addition to the features used in person name extraction, more features are needed.</Paragraph> <Paragraph position="2"> Some of these features are equivalent to the features used in Crystal \[8\], such as subjectauxiliary-noun, e.g. the relation between &quot;I~I~I~'~N\] '' and &quot;~.~&quot; in the sentence Some features are specific to Chinese.</Paragraph> <Paragraph position="3"> Semantic features are also included to make the learned classifier more powerful. The semantic features of a word can be taken from a widely used Chinese thesaurus \[7\] that classifies Chinese words into 12 broad categories, 94 middle categories, and 1428 small categories. There are about 70 thousands words in the thesaurus.</Paragraph> <Paragraph position="4"> Unlike other inductive learning systems in the field of information extraction, such as Crystal, we use a general machine learning algorithm to do the learning. The most relevant earlier work is the experiment described in \[8\] using the machine learning algorithm C4.5. Though their experiment showed that the performance of C4.5 based method was comparable to Crystal, they abandoned this method due to the time complexity of C4.5 when ,dealing with large number of features. MBL is similar to C4.5 in that both are general machine learning algorithms. Their differences lie in that * MBL is a lazy learning algorithm that keeps all training data in memory and only abstracts at classification time by extrapolating a class from the most similar items in memory, therefore, its time complexity is much lower than C4.5, especially when training data contains large number of features and examples. Actually, in Soderland's analysis \[8\], MBL's time complexity is only slightly greater than that of Crystal. Though the instance-based algorithm like MBL may require large memory, the advanced hardware technology available today can overcome this problem.</Paragraph> <Paragraph position="5"> A sparse vector representation will also lower the memory requirement. Taken all these into consideration, MBL is well suited for entity name extraction and relation extraction. Furthermore, the simple instance representation and weight function make MBL-based method more flexible and extensible. Any useful features can be added to the system without any modification to the algorithm. In Crystal, however, adding more features may affect the correctness of weight function used in finding similar examples. Our method can employ global features, i.e. features beyond the sentence level. We treat a NP and all its occurrences (including its anaphofical references) in one text as one single example and all context words that are in some dependency relations to this NP as this example's features. Thus, we can resolve more complicated cases than Crystal. For example, if a NP is in subj-verb relation with verb &quot;~,~&quot; , it can be a person name or an organization name. But, if we know all verbs that have subj-verb relations with this NP, then we will know the exact class this NP belongs to.</Paragraph> <Paragraph position="6"> The steps for entity name classification are similar to the steps in person name extraction. Our method is quite impressive in that it can learn a lot of context features to classify the entity names, e.g. it correctly classifies &quot;/~ 1\]~&quot; in &quot;~ t\]~ l~C/J.~.gj-~&quot; (Qiming's father) as a person name. Such a person name cannot be recognized in person name extraction because it does not begin with a surname or first character of transliterated person names.</Paragraph> </Section> <Section position="3" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 2.3 Relation extraction </SectionTitle> <Paragraph position="0"> This task is to identify relation classes between entity names. Our current classes include employee-of, location-of, productof, and no-relation. The relations we can extract are by no means restricted to this set.</Paragraph> <Paragraph position="1"> We can expand the set if training data are provided.</Paragraph> <Paragraph position="2"> The features for this task include features used in Soderland's experiment \[8\]. These features are equivalent to the syntacticlexical or syntactic-semantic constraints used in Crystal. The feature name begins with the name of the syntax position (SUB J, OBJ, PP-OBJ etc.), followed by the name of the constraint and the actual term or class name. For example, &quot;\]l~,~,~,~ '' in the subject position would include the features: SUBJ-Terms-\]\[~l~ i, SUBJ-Terms-~,~ SUBJ-Mod-Terms-Ii~l~ //the terms in the modifier of the subject SUB J-Head-Terms-,e~, ~ SUB-Classes-Employee // the semantic categories of the subject SUB-Mod-Classes-Organization SUB-Head-Classes-Organization More features are introduced in our method, such as the linear order of entity names, the word(s) between the entity names, the relative position of the entity names (in one sentence or in neighboring sentences), etc. These features will make our method more robust than Crystal.</Paragraph> <Paragraph position="3"> For every two related entity names in the training data, we identify a training example and extract it. After all the examples are extracted from the training data, they are fed to MBL Learner to get the Relation-IG-Tree. In the extracting process, we do the same as in the learning process to extract all pairs of entity names. Then using the Relation-IG-Tree, we can derive the relation between every pair of entity names.</Paragraph> <Paragraph position="4"> To better understand the process of relation extraction, we describe a couple of examples below.</Paragraph> <Paragraph position="5"> In the entity extraction, we have extracted &quot;~l~l~l~&quot; as a company name and &quot;IT~lj~'f@~.~,:~-&quot; as a product name. In the training data, some training examples have similar sentence patterns, e.g. &quot;Company Name (~J/;~) ...Product Name ~J~\]~&quot;, and most of the time there are product-of relation between the two entity names.</Paragraph> <Paragraph position="6"> Based on this evidence, a product-of relation can be identified between &quot;~\]:~1~&quot; and &quot;IT~i~'~t:~-~-&quot;. (2) Sample 2 The input text: ~':t~;,~/~t~j~J2~,~'~k~t~ ~l~.~..,,~. :;i~:t~_, ~L~tIJ!:~I,~TCL~I~\[\] In the entity extraction, we have extracted &quot;~:t:~&quot; as a person name and &quot;TCL~,~I~&quot; as a company name. Now we want to test if these two entity names have an employee-of relation. As can be seen in the training data, if a person name and a company name appear in neighbonng sentences, and no other person names and company names are found in between, they tend to have a employee-of relation. Based on this evidence, an employee-of relation can be identified between &quot;~::\[:~&quot; and &quot;TCL~\[\]&quot;. Current systems, such as Crystal, would find it difficult to resolve because these two entity names appear in different sentences.</Paragraph> </Section> </Section> class="xml-element"></Paper>