File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1053_metho.xml

Size: 20,861 bytes

Last Modified: 2025-10-06 14:08:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1053">
  <Title>Discovering Relations among Named Entities from Large Corpora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Relation Discovery
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> We propose a new approach to relation discovery from large text corpora. Our approach is based on 2A research and evaluation program in information extraction organized by the U.S. Government.</Paragraph>
      <Paragraph position="1"> context based clustering of pairs of entities. We assume that pairs of entities occurring in similar context can be clustered and that each pair in a cluster is an instance of the same relation. Relations between entities are discovered through this clustering process. In cases where the contexts linking a pair of entities express multiple relations, we expect that the pair of entities either would not be clustered at all, or would be placed in a cluster corresponding to its most frequently expressed relation, because its contexts would not be sufficiently similar to contexts for less frequent relations. We assume that useful relations will be frequently mentioned in large corpora. Conversely, relations mentioned once or twice are not likely to be important.</Paragraph>
      <Paragraph position="2"> Our basic idea is as follows:  1. tagging named entities in text corpora 2. getting co-occurrence pairs of named entities and their context 3. measuring context similarities among pairs of named entities 4. making clusters of pairs of named entities 5. labeling each cluster of pairs of named entities  We show an example in Figure 1. First, we find the pair of ORGANIZATIONs (ORG) A and B, and the pair of ORGANIZATIONs (ORG) C and D, after we run the named entity tagger on our newspaper corpus. We collect all instances of the pair A and B occurring within a certain distance of one another. Then, we accumulate the context words intervening between A and B, such as &amp;quot;be offer to buy&amp;quot;, &amp;quot;be negotiate to acquire&amp;quot;.3 In same way, we also accumulate context words intervening between C and D. If the set of contexts of A and B and those of C and D are similar, these two pairs are placed into the same cluster. A - B and C - D would be in the same relation, in this case, merger and acquisition (M&amp;A). That is, we could discover the relation between these ORGANIZATIONs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Named entity tagging
</SectionTitle>
      <Paragraph position="0"> Our proposed method is fully unsupervised. We do not need richly annotated corpora or any initial manually selected seeds. Instead of them, we use a named entity (NE) tagger. Recently developed named entity taggers work quite well and extract named entities from text at a practically usable 3We collect the base forms of words which are stemmed by a POS tagger (Sekine, 2001). But verb past participles are distinguished from other verb forms in order to distinguish the passive voice from the active voice.</Paragraph>
      <Paragraph position="1">  level. In addition, the set of types of named entities has been extended by several research groups. For example, Sekine proposed 150 types of named entities (Sekine et al., 2002). Extending the range of NE types would lead to more effective relation discovery. If the type ORGANIZATION could be divided into subtypes, COMPANY, MILITARY, GOVERN-MENT and so on, the discovery procedure could detect more specific relations such as those between COMPANY and COMPANY.</Paragraph>
      <Paragraph position="2"> We use an extended named entity tagger (Sekine, 2001) in order to detect useful relations between extended named entities.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 NE pairs and context
</SectionTitle>
      <Paragraph position="0"> We define the co-occurrence of NE pairs as follows: two named entities are considered to co-occur if they appear within the same sentence and are separated by at most N intervening words.</Paragraph>
      <Paragraph position="1"> We collect the intervening words between two named entities for each co-occurrence. These words, which are stemmed, could be regarded as the context of the pair of named entities. Different orders of occurrence of the named entities are also considered as different contexts. For example, a0a2a1a4a3a5a3a5a3a6a0a8a7 and a0a4a7a2a3a5a3a5a3a6a0a9a1 are collected as different contexts, where a0a10a1 and a0a8a7 represent named entities. Less frequent pairs of NEs should be eliminated because they might be less reliable in learning relations. So we have set a frequency threshold to remove those pairs.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Context similarity among NE pairs
</SectionTitle>
      <Paragraph position="0"> We adopt a vector space model and cosine similarity in order to calculate the similarities between the set of contexts of NE pairs. We only compare NE pairs which have the same NE types, e.g., one PERSON - GPE pair and another PERSON - GPE pair. We define a domain as a pair of named entity types, e.g., the PERSON-GPE domain. For example, we have to detect relations between PERSON and GPE in the PERSON-GPE domain.</Paragraph>
      <Paragraph position="1"> Before making context vectors, we eliminate stop words, words in parallel expressions, and expressions peculiar to particular source documents (examples of these are given below), because these expressions would introduce noise in calculating similarities. null A context vector for each NE pair consists of the bag of words formed from all intervening words from all co-occurrences of two named entities. Each word of a context vector is weighed by tf*idf, the product of term frequency and inverse document frequency. Term frequency is the number of occurrences of a word in the collected context words. The order of co-occurrence of the named entities is also considered. If a word a11a13a12 occurred a14 times in context a0a9a1a4a3a5a3a5a3a6a0a4a7 and a15 times in context a0a16a7a2a3a5a3a5a3a6a0a2a1 , the term frequency a0a2a1a2a12 of the word a11 a12 is defined as a14a4a3 a15 , where a0a9a1 and a0a8a7 are named entities. We think that this term frequency of a word in different orders would be effective to detect the direction of a relation if the arguments of a relation have the same NE types. Document frequency is the number of documents which include the word.</Paragraph>
      <Paragraph position="2"> If the norm a5 a6a7a5 of the context vector a6 is extremely small due to a lack of content words, the cosine similarity between the vector and others might be unreliable. So, we also define a norm threshold in advance to eliminate short context vectors.</Paragraph>
      <Paragraph position="3"> The cosine similarity a8a10a9a12a11a14a13a16a15 a0a18a17a20a19a22a21 between context vectors a6 and a23 is calculated by the following formula. null</Paragraph>
      <Paragraph position="5"> Cosine similarity varies from a32 to a3 a32 . A cosine similarity of a32 would mean these NE pairs have exactly the same context words with the NEs appearing predominantly in the same order, and a cosine similarity of a3 a32 would mean these NE pairs have exactly the same context words with the NEs appearing predominantly in reverse order.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Clustering NE pairs
</SectionTitle>
      <Paragraph position="0"> After we calculate the similarity among context vectors of NE pairs, we make clusters of NE pairs based on the similarity. We do not know how many clusters we should make in advance, so we adopt hierarchical clustering. Many clustering methods were proposed for hierarchical clustering, but we adopt complete linkage because it is conservative in making clusters. The distance between clusters is taken to be the distance of the furthest nodes between clusters in complete linkage.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.6 Labeling clusters
</SectionTitle>
      <Paragraph position="0"> If most of the NE pairs in the same cluster had words in common, the common words would represent the characterization of the cluster. In other words, we can regard the common words as the characterization of a particular relation.</Paragraph>
      <Paragraph position="1"> We simply count the frequency of the common words in all combinations of the NE pairs in the same cluster. The frequencies are normalized by the number of combinations. The frequent common words in a cluster would become the label of the cluster, i.e. they would become the label of the relation, if the cluster would consist of the NE pairs in the same relation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We experimented with one year of The New York Times (1995) as our corpus to verify our proposed method. We determined three parameters for thresholds and identified the patterns for parallel expressions and expressions peculiar to The New York Times as ignorable context. We set the maximum context word length to 5 words and set the frequency threshold of co-occurring NE pairs to 30 empirically. We also used the patterns, &amp;quot;,.*,&amp;quot;, &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot; for parallel expressions, and the pattern &amp;quot;) --&amp;quot; (used in datelines at the beginning of articles) as peculiar to The New York Times. In our experiment, the norm threshold was set to 10.</Paragraph>
    <Paragraph position="1"> We also used stop words when context vectors are made. The stop words include symbols and words which occurred under 3 times as infrequent words and those which occurred over 100,000 times as highly frequent words.</Paragraph>
    <Paragraph position="2"> We applied our proposed method to The New York Times 1995, identified the NE pairs satisfying our criteria, and extracted the NE pairs along with their intervening words as our data set. In order to evaluate the relations detected automatically, we analyzed the data set manually and identified the relations for two different domains. One was the PERSON-GPE (PER-GPE) domain. We obtained 177 distinct NE pairs and classified them into 38 classes (relations) manually. The other was the COMPANY-COMPANY (COM-COM) domain. We got 65 distinct NE pairs and classified them into 10 classes manually. However, the types of both arguments of a relation are the same in the COM-COM domain. So the COM-COM domain includes symmetrical relations as well as asymmetrical relations. For the latter, we have to distinguish the different orders of arguments. We show the types of classes and the number in each class in Table 1. The errors in NE tagging were eliminated to evaluate our method correctly.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluated separately the placement of the NE pairs into clusters and the assignment of labels to these clusters. In the first step, we evaluated clusters consisting of two or more pairs. For each cluster, we determined the relation (R) of the cluster as the most frequently represented relation; we call this the major relation of the cluster. NE pairs with relation R in a cluster whose major relation was R were counted as correct; the correct pair count, a33a35a34a37a36a39a38a2a38a41a40a42a34a44a43 , is defined as the total number of correct pairs in all clusters. Other NE pairs in the cluster were counted as incorrect; the incorrect pair count, a33 a12a46a45a47a34a37a36a39a38a2a38a41a40a42a34a44a43 , is also defined as the total number of incorrect pairs in all clusters. We evaluated clusters based on Recall,  sures as follows.</Paragraph>
    <Paragraph position="1"> Recall (R) How many correct pairs are detected out of all the key pairs? The key pair count, a33a1a0 a40a3a2 , is defined as the total number of pairs manually classified in clusters of two or more pairs.</Paragraph>
    <Paragraph position="2"> Recall is defined as follows:</Paragraph>
    <Paragraph position="4"> among the pairs clustered automatically? Precision is defined as follows:</Paragraph>
    <Paragraph position="6"> tion of recall and precision according to the following formula:</Paragraph>
    <Paragraph position="8"> These values vary depending on the threshold of cosine similarity. As the threshold is decreased, the clusters gradually merge, finally forming one big cluster. We show the results of complete linkage clustering for the PERSON-GPE (PER-GPE) domain in Figure 2 and for the COMPANY-COMPANY (COM-COM) domain in Figure 3. With these metrics, precision fell as the threshold of cosine similarity was lowered. Recall increased until the threshold was almost 0, at which point it fell because the total number of correct pairs in the remaining few big clusters decreased. The best F-measure was 82 in the PER-GPE domain, 77 in the COM-COM domain. In both domains, the best F-measure was found near 0 cosine similarity. Generally, it is difficult to determine the threshold of similarity in advance. Since the best threshold of cosine similarity was almost same in the two domains, we fixed the cosine threshold at a single value just above zero for both domains for simplicity.</Paragraph>
    <Paragraph position="9"> We also investigated each cluster with the threshold of cosine similarity just above 0. We got 34  ing the threshold of cosine similarity in complete linkage clustering for the PERSON-GPE domain  PER-GPE clusters and 15 COM-COM clusters. We show the F-measure, recall and precision at this cosine threshold in both domains in Table 2. We got 80 F-measure in the PER-GPE domain and 75 F-measure in the COM-COM domain. These values were very close to the best F-measure.</Paragraph>
    <Paragraph position="10"> Then, we evaluated the labeling of clusters of NE pairs. We show the larger clusters for each domain, along with the ratio of the number of pairs bearing the major relation to the total number of pairs in each cluster, on the left in Table 3. (As noted above, the major relation is the most frequently represented relation in the cluster.) We also show the most frequent common words and their relative frequency in each cluster on the right in Table 3. If two NE pairs in a cluster share a particular context word, we consider these pairs to be linked (with respect to this word). The relative frequency for a word is the number of such links, relative to the maximal possible number of links (a33 a17 a33 a3 a32 a21a1a0 a10 for a cluster of a33 pairs). If the relative frequency is a32 a3a3a2 , the word is shared by all NE pairs. Although we obtained some meaningful relations in small clusters, we have omitted the small clusters because the common words in such small clusters might be unreliable. We found that all large clusters had appropriate relations and that the common words which occurred frequently in those clusters accurately represented the relations. In other words, the frequent common words could be regarded as suitable labels for the relations.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> The results of our experiments revealed good performance. The performance was a little higher in the PER-GPE domain than in the COM-COM domain, perhaps because there were more NE pairs with high cosine similarity in the PER-GPE domain than in the COM-COM domain. However, the graphs in both domains were similar, in particular when the cosine similarity was under 0.2.</Paragraph>
    <Paragraph position="1"> We would like to discuss the differences between the two domains and the following aspects of our unsupervised method for discovering the relations: a4 properties of relations a4 appropriate context word length a4 selecting best clustering method a4 covering less frequent pairs We address each of these points in turn.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Properties of relations
</SectionTitle>
      <Paragraph position="0"> We found that the COM-COM domain was more difficult to judge than the PER-GPE domain due to the similarities of relations. For example, the pair of companies in M&amp;A relation might also subsequently appear in the parent relation.</Paragraph>
      <Paragraph position="1"> Asymmetric properties caused additional difficulties in the COM-COM domain, because most relations have directions. We have to recognize the direction of relations, a5a7a6 a8 vs. a8 a6 a5 , to distinguish, for example, &amp;quot;A is parent company of B&amp;quot; and &amp;quot;B is parent company of A&amp;quot;. In determining the similarities between the NE pairs A and B and the NE pairs C and D, we must calculate both the similarity a5a9a6a10a8 with a11a12a6a10a13 and the similarity a5a14a6a15a8 with a13a12a6a16a11 . Sometimes the wrong correspondence ends up being favored. This kind of error was observed in 2 out of the 15 clusters, due to the fact that words happened to be shared by NE pairs aligned in the wrong direction more than in right direction. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Context word length
</SectionTitle>
      <Paragraph position="0"> The main reason for undetected or mis-clustered NE pairs in both domains is the absence of common words in the pairs' context which explicitly represent the particular relations. Mis-clustered NE pairs were clustered based on another common word which occurred by accident. If the maximum context length were longer than the limit of 5 words which we set in the experiments, we could detect additional common words, but the noise would also increase. In our experiments, we used only the words between the two NEs. Although the outer context words (preceding the first NE or following the second NE) may be helpful, extending the context in this way will have to be carefully evaluated. It is future work to determine the best context word length.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Clustering method
</SectionTitle>
      <Paragraph position="0"> We tried single linkage and average linkage as well as complete linkage for making clusters. Complete linkage was the best clustering method because it yielded the highest F-measure. Furthermore, for the other two clustering methods, the threshold of cosine similarity producing the best F-measure was different in the two domains. In contrast, for complete linkage the optimal threshold was almost the same in the two domains. The best threshold of cosine similarity in complete linkage was determined to be just above 0; when this threshold reaches 0, the F-measure drops suddenly because the pairs need not share any words. A threshold just above 0 means that each combination of NE pairs in the same cluster shares at least one word in common -- and most of these common words were pertinent to the relations. We consider that this is relevant to context word length. We used a relatively small maximum context word length - 5 words - making it less likely that noise words appear in common across different relations. The combination of complete linkage and small context word length proved useful for relation discovery.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.4 Less frequent pairs
</SectionTitle>
      <Paragraph position="0"> As we set the frequency threshold of NE co-occurrence to 30, we will miss the less frequent NE pairs. Some of those pairs might be in valuable relations. For the less frequent NE pairs, since the context varieties would be small and the norms of context vectors would be too short, it is difficult to reliably classify the relation based on those pairs. One way of addressing this defect would be through bootstrapping. The problem of bootstrapping is how to select initial seeds; we could resolve this problem with our proposed method. NE pairs which have many context words in common in each cluster could be promising seeds. Once these seeds have been established, additional, lower-frequency NE pairs could be added to these clusters based on more relaxed keyword-overlap criteria.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML