File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1039_intro.xml

Size: 9,684 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1039">
  <Title>Hurricane Date (Affected Place) Articles</Title>
  <Section position="3" start_page="304" end_page="306" type="intro">
    <SectionTitle>
2 Basic Idea
</SectionTitle>
    <Paragraph position="0"> In Unrestricted Relation Discovery, the discovery process (i.e. creating new tables) can be formulated as a clustering task. The key idea is to cluster a set of articles that contain entities bearing a similar relation to each other in such a way that we can construct a table where the entities that play the same role are placed in the same column.</Paragraph>
    <Paragraph position="1"> Suppose that there are two articles A and B, and both report hurricane-related news. Article A contains two entities &amp;quot;Katrina&amp;quot; and &amp;quot;New Orleans&amp;quot;, and article B contains &amp;quot;Longwang&amp;quot; and &amp;quot;Taiwan&amp;quot;. These entities are recognized by a Named Entity (NE) tagger. We want to discover a relation among them. First, we introduce a notion called &amp;quot;basic pattern&amp;quot; to form a relation. A basic pattern is a part of the text that is syntactically connected to an entity. Some examples are &amp;quot;X is hit&amp;quot; or &amp;quot;Y's residents&amp;quot;. Figure 1 shows several basic patterns connected to the entities &amp;quot;Katrina&amp;quot; and &amp;quot;New Orleans&amp;quot; in article A. Similarly, we obtain the basic patterns for article B. Now, in Figure 2, both entities &amp;quot;Katrina&amp;quot; and &amp;quot;Longwang&amp;quot; have the basic pattern &amp;quot;headed&amp;quot; in common. In this case, we connect these two entities to each other. Furthermore, there is also a common basic pattern &amp;quot;was-hit&amp;quot; shared by &amp;quot;New Orleans&amp;quot; and &amp;quot;Taiwan&amp;quot;. Now, we found two sets of entities that can be placed in correspondence at the same time. What does this mean? We can infer that both entity sets (&amp;quot;Katrina&amp;quot;-&amp;quot;New Orleans&amp;quot; and &amp;quot;Longwang&amp;quot;-&amp;quot;Taiwan&amp;quot;) represent a certain relation that has something in common: a hurricane name  and the place it affected. By finding multiple parallel correspondences between two articles, we can estimate the similarity of their relations.</Paragraph>
    <Paragraph position="2"> Generally, in a clustering task, one groups items by finding similar pairs. After finding a pair of articles that have a similar relation, we can bring them into the same cluster. In this case, we cluster articles by using their basic patterns as features. However, each basic pattern is still connected to its entity so that we can extract the name from it. We can consider a basic pattern to represent something like the &amp;quot;role&amp;quot; of its entity. In this example, the entities that had &amp;quot;headed&amp;quot; as a basic pattern are hurricanes, and the entities that had &amp;quot;was-hit&amp;quot; as a basic pattern are the places it affected. By using basic patterns, we can align the entities into the corresponding column that represents a certain role in the relation. From this example, we create a two-by-two table, where each column represents the roles of the entities, and each row represents a different article, as shown in the bottom of Figure 2.</Paragraph>
    <Paragraph position="3"> We can extend this table by finding another article  in the same manner. In this way, we gradually extend a table while retaining a relation among its columns.</Paragraph>
    <Paragraph position="4"> In this example, the obtained table is just what an IE system (whose task is to find a hurricane name and the affected place) would create.</Paragraph>
    <Paragraph position="5"> However, these articles might also include other things, which could represent different relations. For example, the governments might call for help or some casualties might have been reported. To obtain such relations, we need to choose different entities from the articles. Several existing works have tried to extract a certain type of relation by manually choosing different pairs of entities (Brin, 1998; Ravichandran and Hovy, 2002). Hasegawa et al.</Paragraph>
    <Paragraph position="6"> (2004) tried to extract multiple relations by choosing entity types. We assume that we can find such relations by trying all possible combinations from a set of entities we have chosen in advance; some combinations might represent a hurricane and government relation, and others might represent a place and its casualties. To ensure that an article can have several different relations, we let each article belong to several different clusters.</Paragraph>
    <Paragraph position="7"> In a real-world situation, only using basic patterns sometimes gives undesired results. For example, &amp;quot;(President) Bush flew to Texas&amp;quot; and &amp;quot;(Hurricane) Katrina flew to New Orleans&amp;quot; both have a basic pattern &amp;quot;flew to&amp;quot; in common, so &amp;quot;Bush&amp;quot; and &amp;quot;Katrina&amp;quot; would be put into the same column. But we want to separate them in different tables. To alleviate this problem, we put an additional restriction on clustering. We use a bag-of-words approach to discriminate two articles: if the word-based similarity between two articles is too small, we do not bring them together into the same cluster (i.e. table). We exclude names from the similarity calculation at this stage because we want to link articles about the same type of event, not the same instance. In addition, we use the frequency of each basic pattern to compute the similarity of relations, since basic patterns like &amp;quot;say&amp;quot; or &amp;quot;have&amp;quot; appear in almost every article and it is dangerous to rely on such expressions.</Paragraph>
    <Section position="1" start_page="305" end_page="306" type="sub_section">
      <SectionTitle>
Increasing Basic Patterns
</SectionTitle>
      <Paragraph position="0"> In the above explanation, we have assumed that we can obtain enough basic patterns from an article.</Paragraph>
      <Paragraph position="1"> However, the actual number of basic patterns that one can find from a single article is usually not enough, because the number of sentences is rather small in comparison to the variation of expressions.</Paragraph>
      <Paragraph position="2"> So having two articles that have multiple basic patterns in common is very unlikely. We extend the number of articles for obtaining basic patterns by using a cluster of comparable articles that report the same event instead of a single article. We call this cluster of articles a &amp;quot;basic cluster.&amp;quot; Using basic clusters instead of single articles also helps to increase the redundancy of data. We can give more confidence to repeated basic patterns.</Paragraph>
      <Paragraph position="3"> Note that the notion of &amp;quot;basic cluster&amp;quot; is different from the clusters used for creating tables explained above. In the following sections, a cluster for creating a table is called a &amp;quot;metacluster,&amp;quot; because this is a cluster of basic clusters. A basic cluster consists of a set of articles that report the same event which happens at a certain time, and a metacluster consists of a set of events that contain the same relation over a certain period.</Paragraph>
      <Paragraph position="4"> We try to increase the number of articles in a basic cluster by looking at multiple news sources simultaneously. We use a clustering algorithm that uses a vector-space-model to obtain basic clusters. Then we apply cross-document coreference resolution to connect entities of different articles within a basic cluster. This way, we can increase the number of basic patterns connected to each entity. Also, it allows us to give a weight to entities. We calculate their weights using the number of occurrences within a cluster and their position within an article. These entities are used to obtain basic patterns later.</Paragraph>
      <Paragraph position="5"> We also use a parser and tree normalizer to generate basic patterns. The format of basic patterns is crucial to performance. We think a basic pattern should be somewhat specific, since each pattern should capture an entity with some relevant context. But at the same time a basic pattern should be general enough to reduce data sparseness. We choose a predicate-argument structure as a natural solution for this problem. Compared to traditional constituent trees, a predicate-argument structure is a higher-level representation of sentences that has gained wide acceptance from the natural language community recently. In this paper we used a logical feature structure called GLARF proposed by Meyers et al. (2001a). A GLARF converter takes a syntactic tree as an input and augments it with several  hit Louisiana's coast.&amp;quot; features. Figure 3 shows a sample GLARF structure obtained from the sentence &amp;quot;Katrina hit Louisiana's coast.&amp;quot; We used GLARF for two reasons: first, unlike traditional constituent parsers, GLARF has an ability to regularize several linguistic phenomena such as participial constructions and coordination. This allows us to handle this syntactic variety in a uniform way. Second, an output structure can be easily converted into a directed graph that represents the relationship between each word, without losing significant information from the original sentence. Compared to an ordinary constituent tree, it is easier to extract syntactic relationships. In the next section, we discuss how we used this structure to generate basic patterns.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML