File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0208_metho.xml

Size: 17,477 bytes

Last Modified: 2025-10-06 14:10:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0208">
  <Title>Learning Domain-Specific Information Extraction Patterns from the Web</Title>
  <Section position="5" start_page="66" end_page="67" type="metho">
    <SectionTitle>
3 Learning IE Patterns from a Fixed
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
Training Set
</SectionTitle>
      <Paragraph position="0"> As our baseline system, we created an IE system for the MUC-4 terrorism domain using the AutoSlog-TS extraction pattern learning system (Riloff, 1996; Riloff and Phillips, 2004), which is freely available for research use. AutoSlog-TS is a weakly supervised learner that requires two sets of texts for training: texts that are relevant to the domain and texts that are irrelevant to the domain. The MUC-4 data includes relevance judgments (implicit in the answer keys), which we used to partition our training set into relevant and irrelevant subsets.</Paragraph>
      <Paragraph position="1"> AutoSlog-TS' learning process has two phases.</Paragraph>
      <Paragraph position="2"> In the first phase, syntactic patterns are applied to the training corpus in an exhaustive fashion, so that extraction patterns are generated for (literally) every lexical instantiation of the patterns that appears in the corpus. For example, the syntactic pattern &amp;quot;&lt;subj&gt; PassVP&amp;quot; would generate extraction patterns for all verbs that appear in the corpus in a passive voice construction. The sub-ject of the verb will be extracted. In the terrorism domain, some of these extraction patterns might be: &amp;quot;&lt;subj&gt; PassVP(murdered)&amp;quot; and &amp;quot;&lt;subj&gt; PassVP(bombed).&amp;quot; These would match sentences such as: &amp;quot;the mayor was murdered&amp;quot;, and &amp;quot;the embassy and hotel were bombed&amp;quot;. Figure 1 shows the 17 types of extraction patterns that AutoSlog-TS currently generates. PassVP refers to passive voice verb phrases (VPs), ActVP refers to active voice VPs, InfVP refers to infinitive VPs, and AuxVP refers to VPs where the main verb is a form of &amp;quot;to be&amp;quot; or &amp;quot;to have&amp;quot;. Subjects (subj), direct objects (dobj), PP objects (np), and possessives can be extracted by the patterns.</Paragraph>
      <Paragraph position="3"> In the second phase, AutoSlog-TS applies all of the generated extraction patterns to the training corpus and gathers statistics for how often each pattern occurs in relevant versus irrelevant texts.</Paragraph>
      <Paragraph position="4"> The extraction patterns are subsequently ranked based on their association with the domain, and then a person manually reviews the patterns, deciding which ones to keep1 and assigning thematic roles to them. We manually defined selectional restrictions for each slot type (victim and target) 1Typically, many patterns are strongly associated with the domain but will not extract information that is relevant to the IE task. For example, in this work we only care about patterns that will extract victims and targets. Patterns that extract other types of information are not of interest.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="67" end_page="67" type="metho">
    <SectionTitle>
IE patterns
</SectionTitle>
    <Paragraph position="0"> and then automatically added these to each pattern when the role was assigned.</Paragraph>
    <Paragraph position="1"> On our training set, AutoSlog-TS generated 40,553 distinct extraction patterns. A person manually reviewed all of the extraction patterns that had a score [?] 0.951 and frequency [?] 3. This score corresponds to AutoSlog-TS' RlogF metric, described in (Riloff, 1996). The lowest ranked patterns that passed our thresholds had at least 3 relevant extractions out of 5 total extractions. In all, 2,808 patterns passed the thresholds. The reviewer ultimately decided that 396 of the patterns were useful for the MUC-4 IE task, of which 291 were useful for extracting victims and targets.</Paragraph>
  </Section>
  <Section position="7" start_page="67" end_page="68" type="metho">
    <SectionTitle>
4 Data Collection
</SectionTitle>
    <Paragraph position="0"> In this research, our goal is to automatically learn IE patterns from a large, domain-independent text collection, such as the Web. The billions of freely available documents on the World Wide Web and its ever-growing size make the Web a potential source of data for many corpus-based natural language processing tasks. Indeed, many researchers have recently tapped the Web as a data-source for improving performance on NLP tasks (e.g., Resnik (1999), Ravichandran and Hovy (2002), Keller and Lapata (2003)). Despite these successes, numerous problems exist with collecting data from the Web, such as web pages containing information that is not free text, including advertisements, embedded scripts, tables, captions, etc. Also, the documents cover many genres, and it is not easy to identify documents of a particular genre or domain. Additionally, most of the documents are in HTML, and some amount of processing is required to extract the free text. In the following subsections we describe the process of collecting a corpus of terrorism-related CNN news articles from the Web.</Paragraph>
    <Section position="1" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
4.1 Collecting Domain-Specific Texts
</SectionTitle>
      <Paragraph position="0"> Our goal was to automatically identify and collect a set of documents that are similar in domain to the MUC-4 terrorism text collection. To create such a corpus, we used hand-crafted queries given to a search engine. The queries to the search engine were manually created to try to ensure that the majority of the documents returned by the search engine would be terrorism-related. Each query consisted of two parts: (1) the name of a terrorist organization, and (2) a word or phrase describing a terrorist action (such as bombed, kidnapped, etc.).</Paragraph>
      <Paragraph position="1"> The following lists of 5 terrorist organizations and 16 terrorist actions were used to create search engine queries: Terrorist organizations: Al Qaeda, ELN, FARC, HAMAS, IRA Terrorist actions: assassinated, assassination, blew up, bombed, bombing, bombs, explosion, hijacked, hijacking, injured, kidnapped, kidnapping, killed, murder, suicide bomber, wounded.</Paragraph>
      <Paragraph position="2"> We created a total of 80 different queries representing each possible combination of a terrorist organization and a terrorist action.</Paragraph>
      <Paragraph position="3"> We used the Google2 search engine with the help of the freely available Google API3 to locate the texts on the Web. To ensure that we retrieved only CNN news articles, we restricted the search to the domain &amp;quot;cnn.com&amp;quot; by adding the &amp;quot;site:&amp;quot; option to each of the queries. We also restricted the search to English language documents by initializing the API with the lang en option. We deleted documents whose URLs contained the word &amp;quot;transcript&amp;quot; because most of these were transcriptions of CNN's TV shows and were stylistically very different from written text. We ran the 80 queries twice, once in December 2005 and once in April 2005, which produced 3,496 documents and 3,309 documents, respectively.</Paragraph>
      <Paragraph position="4"> After removing duplicate articles, we were left  with a total of 6,182 potentially relevant terrorism articles.</Paragraph>
    </Section>
    <Section position="2" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
4.2 Processing the Texts
</SectionTitle>
      <Paragraph position="0"> The downloaded documents were all HTML documents containing HTML tags and JavaScript intermingled with the news text. The CNN web-pages typically also contained advertisements, text for navigating the website, headlines and links to other stories. All of these things could be problematic for our information extraction system, which was designed to process narrative text using a shallow parser. Thus, simply deleting all HTML tags on the page would not have given us natural language sentences. Instead, we took advantage of the uniformity of the CNN web pages to &amp;quot;clean&amp;quot; them and extract just the sentences corresponding to the news story.</Paragraph>
      <Paragraph position="1"> We used a tool called HTMLParser4 to parse the HTML code, and then deleted all nodes in the HTML parse trees corresponding to tables, comments, and embedded scripts (such as JavaScript or VBScript). The system automatically extracted news text starting from the headline (embedded in an H1 HTML element) and inferred the end of the article text using a set of textual clues such as &amp;quot;Feedback:&amp;quot;, &amp;quot;Copyright 2005&amp;quot;, &amp;quot;contributed to this report&amp;quot;, etc. In case of any ambiguity, all of the text on the web page was extracted.</Paragraph>
      <Paragraph position="2"> The size of the text documents ranged from 0 bytes to 255 kilobytes. The empty documents were due to dead links that the search engine had indexed at an earlier time, but which no longer existed. Some extremely small documents also resulted from web pages that had virtually no free text on them, so only a few words remained after the HTML had been stripped. Consequently, we removed all documents less than 10 bytes in size. Upon inspection, we found that many of the largest documents were political articles, such as political party platforms and transcriptions of political speeches, which contained only brief references to terrorist events. To prevent the large documents from skewing the corpus, we also deleted all documents over 10 kilobytes in size. At the end of this process we were left with a CNN terrorism news corpus of 5,618 documents, each with an average size of about 648 words. In the rest of the paper we will refer to these texts as &amp;quot;the CNN terrorism web pages&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="68" end_page="69" type="metho">
    <SectionTitle>
5 Learning Domain-Specific IE Patterns
from Web Pages
</SectionTitle>
    <Paragraph position="0"> Having created a large domain-specific corpus from the Web, we are faced with the problem of identifying the useful extraction patterns from these new texts. Our basic approach is to use the patterns learned from the fixed training set as seed patterns to identify sentences in the CNN terrorism web pages that describe a terrorist event. We hypothesized that extraction patterns occurring in the same sentence as a seed pattern are likely to be associated with terrorism.</Paragraph>
    <Paragraph position="1"> Our process for learning new domain-specific IE patterns has two phases, which are described in the following sections. Section 5.1 describes how we produce a ranked list of candidate extraction patterns from the CNN terrorism web pages. Section 5.2 explains how we filter these patterns based on the semantic affinity of their extractions, which is a measure of the tendency of the pattern to extract entities of a desired semantic category.</Paragraph>
    <Section position="1" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
5.1 Identifying Candidate Patterns
</SectionTitle>
      <Paragraph position="0"> The first goal was to identify extraction patterns that were relevant to our domain: terrorist events.</Paragraph>
      <Paragraph position="1"> We began by exhaustively generating every possible extraction pattern that occurred in our CNN terrorism web pages. We applied the AutoSlog-TS system (Riloff, 1996) to the web pages to automatically generate all lexical instantiations of patterns in the corpus. Collectively, the resulting patterns were capable of extracting every noun phrase in the CNN collection. In all, 147,712 unique extraction patterns were created as a result of this process. null Next, we computed the statistical correlation of each extraction pattern with the seed patterns based on the frequency of their occurrence in the same sentence. IE patterns that never occurred in the same sentence as a seed pattern were discarded. We used Pointwise Mutual Information (PMI) (Manning and Sch&amp;quot;utze, 1999; Banerjee and Pedersen, 2003) as the measure of statistical correlation. Intuitively, an extraction pattern that occurs more often than chance in the same sentence as a seed pattern will have a high PMI score.</Paragraph>
      <Paragraph position="2"> The 147,712 extraction patterns acquired from the CNN terrorism web pages were then ranked by their PMI correlation to the seed patterns. Table 1 lists the most highly ranked patterns. Many of these patterns do seem to be related to terrorism,  highly correlated with the terrorism seed patterns but many of them are not useful to our IE task (for this paper, identifying the victims and physical targets of a terrorist attack). For example, the pattern &amp;quot;explode after &lt;np&gt;&amp;quot; will not extract victims or physical targets, while the pattern &amp;quot;sympathizers of &lt;np&gt;&amp;quot; may extract people but they would not be the victims of an attack. In the next section, we explain how we filter and re-rank these candidate patterns to identify the ones that are directly useful to our IE task.</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
5.2 Filtering Patterns based upon their
Semantic Affinity
</SectionTitle>
      <Paragraph position="0"> Our next goal is to filter out the patterns that are not useful for our IE task, and to automatically assign the correct slot type (victim or target) to the ones that are relevant. To automatically determine the mapping between extractions and slots, we define a measure called semantic affinity. The semantic affinity of an extraction pattern to a semantic category is a measure of its tendency to extract NPs belonging to that semantic category.</Paragraph>
      <Paragraph position="1"> This measure serves two purposes:  (a) It allows us to filter out candidate patterns that do not have a strong semantic affinity to our categories of interest.</Paragraph>
      <Paragraph position="2"> (b) It allows us to define a mapping between the  extractions of the candidate patterns and the desired slot types.</Paragraph>
      <Paragraph position="3"> We computed the semantic affinity of each candidate extraction pattern with respect to six semantic categories: target, victim, perpetrator, organization, weapon and other. Targets and victims are our categories of interest. Perpetrators, organizations, and weapons are common semantic classes in this domain which could be &amp;quot;distractors&amp;quot;. The other category is a catch-all to represent all other semantic classes. To identify the semantic class of each noun phrase, we used the Sundance package (Riloff and Phillips, 2004), which is a freely available shallow parser that uses dictionaries to assign semantic classes to words and phrases.</Paragraph>
      <Paragraph position="4"> We counted the frequencies of the semantic categories extracted by each candidate pattern and applied the RLogF measure used by AutoSlog-TS (Riloff, 1996) to rank the patterns based on their affinity for the target and victim semantic classes. For example, the semantic affinity of an extraction pattern for the target semantic class would be calculated as:</Paragraph>
      <Paragraph position="6"> where ftarget is the number of target semantic class extractions and fall = ftarget + fvictim + fperp+forg +fweapon+fother. This is essentially a probability P(target) weighted by the log of the frequency.</Paragraph>
      <Paragraph position="7"> We then used two criteria to remove patterns that are not strongly associated with a desired semantic category. If the semantic affinity of a pattern for category C was (1) greater than a threshold, and (2) greater than its affinity for the other category, then the pattern was deemed to have a semantic affinity for category C. Note that we intentionally allow for a pattern to have an affinity for more than one semantic category (except for the catch-all other class) because this is fairly common in practice. For example, the pattern &amp;quot;attack on &lt;np&gt;&amp;quot; frequently extracts both targets (e.g., &amp;quot;an attack on the U.S. embassy&amp;quot;) and victims (e.g., &amp;quot;an attack on the mayor of Bogota&amp;quot;). Our hope is that such a pattern would receive a high semantic affinity ranking for both categories.</Paragraph>
      <Paragraph position="8"> Table 2 shows the top 10 high frequency (freq [?] 50) patterns that were judged to have a strong semantic affinity for the target and victim categories. There are clearly some incorrect entries (e.g., &amp;quot;&lt;subj&gt; fired missiles&amp;quot; is more likely to identify perpetrators than targets), but most of the patterns are indeed good extractors for the desired categories. For example, &amp;quot;fired into &lt;np&gt;&amp;quot;, &amp;quot;went off in &lt;np&gt;&amp;quot;, and &amp;quot;car bomb near &lt;np&gt;&amp;quot; are all good patterns for identifying targets of a terrorist attack. In general, the semantic affinity measure seemed to do a reasonably good job of filtering patterns that are not relevant to our task, and identifying patterns that are useful for extracting victims and targets.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML