File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1039_metho.xml

Size: 9,651 bytes

Last Modified: 2025-10-06 14:10:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1039">
  <Title>Hurricane Date (Affected Place) Articles</Title>
  <Section position="4" start_page="306" end_page="308" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"> The overall process to generate basic patterns and discover relations from unannotated news articles is shown in Figure 4. Theoretically this could be a straight pipeline, but due to the nature of the implementation we process some stages separately and combine them in the later stage. In the following subsection, we explain each component.</Paragraph>
    <Section position="1" start_page="306" end_page="307" type="sub_section">
      <SectionTitle>
3.1 Web Crawling and Basic Clustering
</SectionTitle>
      <Paragraph position="0"> First of all, we need a lot of news articles from multiple news sources. We created a simple web crawler that extract the main texts from web pages. We observed that the crawler can correctly take the main texts from about 90% of the pages from each news site. We ran the crawler every day on several news sites. Then we applied a simple clustering algorithm to the obtained articles in order to find a set of arti- null cles that talk about exactly the same news and form a basic cluster.</Paragraph>
      <Paragraph position="1"> We eliminate stop words and stem all the other words, then compute the similarity between two articles by using a bag-of-words approach. In news articles, a sentence that appears in the beginning of an article is usually more important than the others. So we preserved the word order to take into account the location of each sentence. First we computed a word vector from each article:</Paragraph>
      <Paragraph position="3"> (A) is a vector element of word w in article A, IDF(w) is the inverse document frequency of word w, and POS(w,A) is a list of w's positions in the article. avgwords is the average number of words for all articles. Then we calculated the cosine value of each pair of vectors:</Paragraph>
      <Paragraph position="5"> We computed the similarity of all possible pairs of articles from the same day, and selected the pairs  whose similarity exceeded a certain threshold (0.65 in this experiment) to form a basic cluster.</Paragraph>
    </Section>
    <Section position="2" start_page="307" end_page="307" type="sub_section">
      <SectionTitle>
3.2 Parsing and GLARFing
</SectionTitle>
      <Paragraph position="0"> After getting a set of basic clusters, we pass them to an existing statistical parser (Charniak, 2000) and rule-based tree normalizer to obtain a GLARF structure for each sentence in every article. The current implementation of a GLARF converter gives about 75% F-score using parser output. For the details of GLARF representation and its conversion, see Meyers et al. (2001b).</Paragraph>
    </Section>
    <Section position="3" start_page="307" end_page="307" type="sub_section">
      <SectionTitle>
3.3 NE Tagging and Coreference Resolution
</SectionTitle>
      <Paragraph position="0"> In parallel with parsing and GLARFing, we also apply NE tagging and coreference resolution for each article in a basic cluster. We used an HMM-based NE tagger whose performance is about 85% in Fscore. This NE tagger produces ACE-type Named  document coreference resolution for each article, we connect the entities among different articles in the same basic cluster to obtain cross-document coreference entities with simple string matching.</Paragraph>
    </Section>
    <Section position="4" start_page="307" end_page="308" type="sub_section">
      <SectionTitle>
3.4 Basic Pattern Generation
</SectionTitle>
      <Paragraph position="0"> After getting a GLARF structure for each sentence and a set of documents whose entities are tagged and connected to each other, we merge the two outputs and create a big network of GLARF structures whose nodes are interconnected across different sentences/articles. Now we can generate basic patterns for each entity. First, we compute the weight for each cross-document entity E in a certain basic cluster as follows:</Paragraph>
      <Paragraph position="2"> where e [?] E is an entity within one article and mentions(e) and firstsent(e) are the number of mentions of entity e in a document and the position  of the sentence where entity e first appeared, respectively. C is a constant value which was 0.5 in this experiment. To reduce combinatorial complexity, we took only the five most highly weighted entities from each basic cluster to generate basic patterns. We observed these five entities can cover major relations that are reported in a basic cluster.</Paragraph>
      <Paragraph position="3"> Next, we obtain basic patterns from the GLARF structures. We used only the first ten sentences in each article for getting basic patterns, as most important facts are usually written in the first few sentences of a news article. Figure 5 shows all the basic patterns obtained from the sentence &amp;quot;Katrina hit Louisiana's coast.&amp;quot; The shaded nodes &amp;quot;Katrina&amp;quot; and &amp;quot;Louisiana&amp;quot; are entities from which each basic pattern originates. We take a path of GLARF nodes from each entity node until it reaches any predicative node: noun, verb, or adjective in this case. Since the nodes &amp;quot;hit&amp;quot; and &amp;quot;coast&amp;quot; can be predicates in this example, we obtain three unique paths &amp;quot;Louisiana+T-POS:coast (Louisiana's coast)&amp;quot;, &amp;quot;Katrina+SBJ:hit (Katrina hit something)&amp;quot;, and &amp;quot;Katrina+SBJ:hit-OBJ:coast (Katrina hit some coast)&amp;quot;.</Paragraph>
      <Paragraph position="4"> To increase the specificity of patterns, we generate extra basic patterns by adding a node that is immediately connected to a predicative node. (From this example, we generate two basic patterns: &amp;quot;hit&amp;quot; and &amp;quot;hit-coast&amp;quot; from the &amp;quot;Katrina&amp;quot; node.) Notice that in a GLARF structure, the type of each argument such as subject or object is preserved in an edge even if we extract a single path of a graph. Now, we replace both entities &amp;quot;Katrina&amp;quot; and &amp;quot;Louisiana&amp;quot; with variables  based on their NE tags and obtain parameterized patterns: &amp;quot;GPE+T-POS:coast (Louisiana's coast)&amp;quot;, &amp;quot;PER+SBJ:hit (Katrina hit something)&amp;quot;, and &amp;quot;PER+SBJ:hit-OBJ:coast (Katrina hit some coast)&amp;quot;.</Paragraph>
      <Paragraph position="5"> After taking all the basic patterns from every basic cluster, we compute the Inverse Cluster Frequency (ICF) of each unique basic pattern. ICF is similar to the Inverse Document Frequency (IDF) of words, which is used to calculate the weight of each basic pattern for metaclustering.</Paragraph>
    </Section>
    <Section position="5" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
3.5 Metaclustering
</SectionTitle>
      <Paragraph position="0"> Finally, we can perform metaclustering to obtain tables. We compute the similarity between each basic cluster pair, as seen in Figure 6. X  , respectively. We examine all possible mappings of relations (parallel mappings of multiple entities) from both basic clusters, and find all the mappings M whose similarity score exceeds a certain threshold. wordsim(c  ) is the bag-of-words similarity of two clusters. As a weighting function</Paragraph>
      <Paragraph position="2"> clusters that include p all clusters ) We then sort the similarities of all possible pairs of basic clusters, and try to build a metacluster by taking the most strongly connected pair first. Note that in this process we may assign one basic cluster to several different metaclusters. When a link is found between two basic clusters that were already assigned to a metacluster, we try to put them into all the existing metaclusters it belongs to. However, we allow a basic cluster to be added only if it can fill all the columns in that table. In other words, the first two basic clusters (i.e. an initial two-row table) determines its columns and therefore define the relation of that table.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="308" end_page="308" type="metho">
    <SectionTitle>
4 Experiment and Evaluation
</SectionTitle>
    <Paragraph position="0"> We used twelve newspapers published mainly in the U.S. We collected their articles over two months (from Sep. 21, 2005 - Nov. 27, 2005). We obtained 643,767 basic patterns and 7,990 unique types. Then we applied metaclustering to these basic clusters  and obtained 302 metaclusters (tables). We then removed duplicated rows and took only the tables that had 3 or more rows. Finally we had 101 tables. The total number the of articles and clusters we used are shown in Table 2.</Paragraph>
    <Section position="1" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
4.1 Evaluation Method
</SectionTitle>
      <Paragraph position="0"> We evaluated the obtained tables as follows. For each row in a table, we added a summary of the source articles that were used to extract the relation. Then for each table, an evaluator looks into every row and its source article, and tries to come up with a sentence that explains the relation among its columns. The description should be as specific as possible. If at least half of the rows can fit the explanation, the table is considered &amp;quot;consistent.&amp;quot; For each consistent table, the evaluator wrote down the sentence using variable names ($1, $2, ...) to refer to its columns. Finally, we counted the number of consistent tables. We also counted how many rows in each table can fit the explanation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML