File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0112_metho.xml

Size: 32,798 bytes

Last Modified: 2025-10-06 14:14:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0112">
  <Title>Automatically Acquiring Conceptual Patterns Without an Annotated Corpus</Title>
  <Section position="3" start_page="148" end_page="149" type="metho">
    <SectionTitle>
2 Information Extraction
</SectionTitle>
    <Paragraph position="0"> Information extraction (IE) is a natural language processing task that involves extracting predefined types of information from text. Information extraction systems are domain-specific because they extract facts about a specific domain and typically ignore information that is not relevant to the domain.: For example, an information extraction system for the terrorism domain might extract the names of perpetrators, victims, physical targets, and weapons associated with terrorist events mentioned in a text. The information extraction task has received a lot of attention recently as a result of the message understanding conferences (MUCs) \[MUC-5 Proceedings, 1993; MUC-4 Proceedings, 1992; MUC-3 Proceedings, 1991\].</Paragraph>
    <Paragraph position="1"> The systems described in this paper use a conceptual sentence analyzer called CIRCUS \[Lehnert, 1991\]. CIRCUS extracts information using domain-specific structures called concept nodes.</Paragraph>
    <Paragraph position="2"> Each concept node is triggered by a keyword, but is activated only in certain linguistic contexts.</Paragraph>
    <Paragraph position="3"> For example, a concept node called $murder-passive$ is triggered by the verb &amp;quot;murdered&amp;quot; but activated only when the verb appears in a passive construction. Therefore this concept node would be activated by phrases such as &amp;quot;X was murdered&amp;quot;, &amp;quot;X and Y were murdered&amp;quot;, and &amp;quot;X has been murdered.&amp;quot; The subject of the verb is extracted as the victim of the murder. Figure 1 shows a sample sentence and the instantiated concept node produced by CIRCUS.</Paragraph>
    <Paragraph position="4"> Sentence: Three peasants were murdered.</Paragraph>
    <Paragraph position="5"> $murder-passive$ victim = &amp;quot;three peasants&amp;quot;  A similar concept node called $murder-active$ recognizes active forms of the verb &amp;quot;murdered&amp;quot;, such as &amp;quot;terrorists murdered three peasants.&amp;quot; This concept node is also triggered by the verb &amp;quot;murdered&amp;quot;, bat is activated only when the verb appears in an active construction. In this case, the subject of the verb is extracted as the perpetrator of the murder.</Paragraph>
    <Paragraph position="6"> CIRCUS relies entirely on its dictionary of concept nodes to extract information, so it is crucial to have a good concept node dictionary for a domain. However, building a concept node dictionary by hand is tedious and time-consuming. We estimate that it took approximately 1500 person-hours to construct a concept node dictionary by hand for the MUC-4 terrorism domain \[Lehnert et al., 1992\]. Subsequently, we developed a system called AutoSlog that can build concept node dictionaries automatically using an annotated training corpus. The next section describes the original version of AutoSlog as well as the new version, AutoSlog-TS, that generates concept node dictionaries automatically using only a preclassified training corpus.</Paragraph>
  </Section>
  <Section position="4" start_page="149" end_page="154" type="metho">
    <SectionTitle>
3 Automated Dictionary Construction for Information Extrac-
</SectionTitle>
    <Paragraph position="0"> tion A major knowledge-engineering bottleneck for information extraction (IE) systems is the process of constructing a dictionary of appropriate extraction patterns. A few systems have been developed recently to build dictionaries for information extraction automatically, such as AutoSlog \[Riloff, 1993\] and PALKA \[Kim and Moldovan, 1993\]. These systems generate extraction patterns automatically using a set of associated answer keys or an annotated training corpus. In this section, we describe the original AutoSlog system for automated dictionary construction and then present AutoSlog-TS, a variant of AutoSlog that does not rely on text annotations.</Paragraph>
    <Section position="1" start_page="149" end_page="150" type="sub_section">
      <SectionTitle>
3.1 AutoSlog: Automated Dictionary Construction Using Text Annotations
</SectionTitle>
      <Paragraph position="0"> The guiding principle behind AutoSlog is that most role relationships can be identified by local linguistic context surrounding a phrase. For example, consider the sentence &amp;quot;John Smith was kidnapped by three armed men.&amp;quot; To identify &amp;quot;John Smith&amp;quot; as the victim of a kidnapping, we must recognize that he is the subject of the passive verb &amp;quot;kidnapped.&amp;quot; Similarly, to identify &amp;quot;three armed men&amp;quot; as the perpetrators, we must recognize that &amp;quot;three armed men&amp;quot; is the object of the preposition &amp;quot;by&amp;quot; and attaches to the verb &amp;quot;kidnapped.&amp;quot; It is impossible to look at an isolated noun phrase such as &amp;quot;John Smith&amp;quot; and determine whether he is a perpetrator or a victim without considering local context.</Paragraph>
      <Paragraph position="1">  AutoSlog uses simple domain-independ.ent linguistic rules to create extraction patterns for a given set of noun phrases in a text corpus. Figure 2 shows the steps involved in dictionary construction. As input, AutoSlog requires a set of annotated texts in which the noun phrases that need to be extracted have been tagged. 1 For each &amp;quot;targeted&amp;quot; noun phrase, AutoSlog finds the sentence in which it was tagged 2 and passes the sentence to CIRCUS for syntactic analysis.</Paragraph>
      <Paragraph position="2"> 1Alternatively, a set of answer keys that list the relevant noun phrases (e.g., the MUC-4 answer keys) could be used (e.g., see \[Riloff, 1993\]).</Paragraph>
      <Paragraph position="4"> CIRCUS separates each sentence into clauses and identifies the subject, verb, direct object and prepositional phrases in each clause. AutoSlog then determines which clause contains the targeted noun phrase and whether it is a subject, direct object, or prepositional phrase.</Paragraph>
      <Paragraph position="5"> Next, AutoSlog uses a small set of heuristics to infer which other words in the sentence identify the role of the noun phrase. If the targeted noun phrase is the subject or direct object of a clause then AutoSlog infers that the verb defines the role of the noun phrase. AutoSlog uses several rules to recognize different verb forms. In the subject case, consider the sentence &amp;quot;John Smith killed two people&amp;quot; and the targeted noun phrase &amp;quot;John Smith&amp;quot; tagged as a perpetrator. AutoSlog generates a concept node that is triggered by the verb &amp;quot;killed&amp;quot; and activated when the verb appears in an active construction; the resulting concept node recognizes the pattern &amp;quot;X killed&amp;quot; and extracts X as a perpetrator. Given the sentence &amp;quot;John Smith was killed&amp;quot; with &amp;quot;John Smith&amp;quot; tagged as a victim, AutoSlog generates a concept node that recognizes the pattern &amp;quot;X was killed&amp;quot; and extracts X as a victim.</Paragraph>
      <Paragraph position="6"> In the direct object case, the sentence &amp;quot;the armed men killed John Smith&amp;quot; produces the pattern &amp;quot;killed X.&amp;quot; If the targeted noun phrase is in a prepositional phrase, then AutoSlog uses a simple pp--attachment algorithm to attach the prepositional phrase to a previous verb or noun in the sentence which is then used as a trigger word for a concept node. For example, &amp;quot;the men were killed in Bogota by John Smith&amp;quot; produces the pattern &amp;quot;killed by X.&amp;quot; It should be noted that, although we are using a simple phrase-like notation for the patterns, they are actually concept nodes activated by an NLP system so the words do not have to be strictly adjacent in the text.</Paragraph>
    </Section>
    <Section position="2" start_page="150" end_page="151" type="sub_section">
      <SectionTitle>
Linguistic Pattern Example
</SectionTitle>
      <Paragraph position="0"> 1. &lt;subject&gt; active-verb &lt;perpetrator&gt; bombed 2. &lt;subject&gt; active-verb direct-object 3 &lt;perpetrator&gt; claimed responsibility 3. &lt;subject&gt; passive-verb 4. &lt;subject&gt; verb infinitive 5. &lt;sUbject&gt; auxiliary noun 6. active-verb &lt;direct-object&gt; 7. paSsive-verb &lt;direct-object&gt; 4 8. infinitive &lt;direct-object&gt; 9. verb infinitive &lt;dlrect-object&gt; 10. gerund &lt;direct-object&gt; 11. noun auxiliary &lt;dlrect-object&gt; 12. noun preposition &lt;noun-phrase&gt; 13. active-verb preposition &lt;noun-phrase&gt; 14. passive-verb preposition &lt;noun-phrase&gt; 15. infinitive preposition &lt;noun-phrase&gt; 3 &lt;victim&gt; was murdered &lt;perpetrator&gt; attempted to ki_..H &lt;victim&gt; was victim bombed &lt;target&gt; killed &lt;victim&gt; to kill &lt;victim&gt; threatened to attack &lt;target&gt; killing &lt;victim&gt;  The set of heuristics used by AutoSlog is shown in Figure 3. The heuristics are divided into three categories depending upon where the targeted noun phrase is found. The location is indicated by the bracketed item (subject, direct-object, noun-phrase in a PP). The other words represent the s~rrounding context used to construct a concept node. The examples in the right-hand column show instantiated patterns for which AutoSlog generated concept nodes based on the  general pattern on the left. The underlined word represents the trigger word, the bracketed item represents the type of information that will be extracted by the concept node, and the remaining words represent the required context.</Paragraph>
      <Paragraph position="1"> In previous experiments, we used AutoSlog to construct a dictionary for the MUC-4 terrorism domain using 772 relevant texts from the MUC-4 corpus. AutoSlog created 1237 concept node definitions, but many of these concept nodes represented general expressions that will not reliably extract relevant information. Therefore, we introduced a human-in-the-loop to weed out the unreliable definitions. A person manually reviewed all 1237 definitions and retained 450 of them for the final dictionary. The resulting dictionary achieved 98% of the performance of a dictionary that was hand-crafted for the MUC-4 terrorism domain \[Riloff, 1993\]. One of the main differences between AutoSlog and previous lexical acquisition systems is that AutoSlog creates new definitions entirely from scratch. In contrast, previous language learning systems (e.g., \[Jacobs and Zernik, 1988; Carbonell, 1979; Granger, 1977\]) create new definitions based on the definitions of other known words in the context. That is, they assume that some definitions already exist and use those definitions to create new ones. The structures created by AutoSlog are also considerably different than the lexical definitions created by most systems, although the PALKA system \[Kim and Moldovan, 1993\] creates similar extraction patterns. The main difference between PALKA and AutoSlog is that PALKA is given the set of keywords associated with each concept (essentially its &amp;quot;trigger words&amp;quot;) and then learns to generalize the patterns surrounding the keywords. In contrast, AutoSlog infers the trigger words and patterns on its own but does not generalize them.</Paragraph>
    </Section>
    <Section position="3" start_page="151" end_page="154" type="sub_section">
      <SectionTitle>
3.2 AutoSlog-TS: Automated Dictionary Construction Without Text Annota-
</SectionTitle>
      <Paragraph position="0"> tions As described in the previous section, AutoSlog requires an annotated training corpus in which the noun phrases that should be extracted have been tagged. Creating an annotated corpus is much easier than building a dictionary by hand. However, the annotation process is not trivial. It may take days or even weeks for a domain expert to annotate several hundred texts. 5 But perhaps even more importantly, the annotation process is not always well-defined; in many cases, it is not clear which portions of a text should be annotated. Complex noun phrases (e.g., conjunctions, appositives, prepositional phrases) are often confusing for annotators. Should the entire noun phrase be tagged or just the head noun? Should modifiers be included? Should prepositional phrases be included? Conjuncts and appositives? These issues are not only frustrating for a user, but can have serious consequences for the system. A noun phrase that is incorrectly annotated often produces an undesirable extraction pattern or produces no extraction pattern at all.</Paragraph>
      <Paragraph position="1"> To bypass the need for an annotated corpus, we created a new version of AutoSlog that does not rely on text annotations..The new system, Autoslog-TS, can be run exhaustively on an untagged but preclassified corpus. None of the words or phrases in the texts need to be tagged, but each text must be classified as either relevant or irrelevant to the targeted domain. 6 Figure 4 shows the steps involved in dictionary construction. The process breaks down into two stages:  classification tasks, the irrelevant texts should reflect the types of texts that will need to be distinguished from relevant texts. For example, many of the irrelevant texts in the MUC-4 corpus describe military actions so the resulting AutoSlog-TS dictionary is especially well-suited for discriminating texts describing military incidents from those describing terrorist incidents.</Paragraph>
      <Paragraph position="2">  . Given a corpus of preclassified texts, a sentence analyzer (CIRCUS) is applied to each sentence to identify all of the noun phrases in the sentence. For example, in Figure 4, two noun phrases are identified: &amp;quot;The World Trade Center&amp;quot; and &amp;quot;terrorists.&amp;quot; . For each noun phrase, the system determines whether the noun phrase was a subject, direct object, or prepositional phrase based on the syntactic analysis produced by the sentence analyzer.</Paragraph>
      <Paragraph position="3"> . All of the appropriate heuristics are fired. For example, in Figure 4 &amp;quot;The World Trade Center&amp;quot; was identified as the subject of the sentence so all of the subject patterns are fired (patterns #1-5 in Figure 3). Pattern #3 is the only one that is satisfied, so a single concept node is generated that recognizes the pattern &amp;quot;X was bombed.&amp;quot; It is possible for multiple heuristics to fire; for example, patterns #1 and #2 may both fire if the targeted noun phrase is the subject of an active verb and takes a direct-object.</Paragraph>
      <Paragraph position="4"> After processing the training texts, we have a huge collection of concept nodes. The second stage involves collecting statistics to determine which concept nodes represent domain-specific expressions.  Stage 2: Statistically Filtering the Concept Nodes 1. All of the newly generated concept nodes are loaded into the system and the training corpus is run through the sentence analyzer again. This time, however, the concept nodes are activated during sentence processing.</Paragraph>
      <Paragraph position="5">  2. Statistics are computed to determine how often each concept node was activated in relevant  texts and how often it was activated in irrelevant texts. We calculate the relevancy rate of each concept node (i.e., the number of occurrences in relevant texts divided by the total number of occurrences), and the frequency of each concept node (i.e., the total number of times it was activated in the corpus).</Paragraph>
      <Paragraph position="6"> After Stage 1, we have a large set of concept node definitions that, collectivelyl can extract virtually 7 every noun phrase in the corpus. Most of the concept nodes represent general phrases that are likely to occur in a wide variety of texts (e.g., &amp;quot;X saw&amp;quot;). However, some of the concept nodes represent domain-specific patterns (e.g., &amp;quot;X was bombed&amp;quot;). Stage 2 is designed to identify these concept nodes automatically under the assumption that most of them will have high relevancy rates. In other words, if we sort the concept nodes by relevancy rates then the domain-specific patterns should float to the top.</Paragraph>
      <Paragraph position="7"> One of the side effects of this approach is that the statistics provide feedback on which heuristics are most appropriate. In previous work with AutoSlog, we found that some domains require longer extraction patterns than others \[Riloff, 1994\]. In particular, we found that simple verb forms usually suffice as extraction patterns in the terrorism domain (e.g., &amp;quot;X was killed&amp;quot;). But in the joint ventures domain, good extraction patterns often require both verbs and nouns (e.g., &amp;quot;X formed venture&amp;quot; is better than &amp;quot;X formed&amp;quot;). For this reason, we found it necessary to run AutoSlog with slightly different rule sets in these domains. In contrast, AutoSlog-TS simply allows all applicable heuristics to fire s , often producing multiple extraction patterns of varying lengths, and lets the statistics ultimately decide which ones work the best. For example, &amp;quot;X formed&amp;quot; would presumably have a much lower relevancy rate than &amp;quot;X formed venture&amp;quot; in the joint ventures domain. The original version of AutoSlog could have applied multiple heuristics as well, but its dictionary had to be manually filtered so it was preferable to keep the dictionary small. Since AutoSlog-TS uses statistical filtering, we don't have to worry as much about the number of concept nodes generated and therefore don't need separate rule sets.</Paragraph>
      <Paragraph position="8"> However, determining which concept nodes are ultimately &amp;quot;useful&amp;quot; depends on how one intends to use them. We are interested in using the concept nodes for two tasks: information extraction and text classification. These tasks place different demands on the concept node dictionary. A good dictionary for information extraction should contain patterns that provide broad coverage of the domain. In general, useful patterns fall into one of two categories: (a) patterns that frequently extract relevant information and rarely extract irrelevant information or (b) patterns that frequently extract relevant information but often extract irrelevant information as well. Patterns of type (a) should have high relevancy rates. Patterns of type (b) are more difficult to identify but will occur with high frequency in relevant texts. Section 4.2 presents experiments with concept node filtering techniques for the information extraction task.</Paragraph>
      <Paragraph position="9"> A good dictionary for text classification should contain patterns that frequently occur in relevant texts but rarely occur in irrelevant texts. These patterns represent expressions that are highly indicative of the domain and are therefore useful for classifying new texts, AutoSlog-TS was motivated by a text classification algorithm called the relevancy signatures algorithm \[Riloff and Lehnert, 1994\]. This algorithm applies CIRCUS to a preclassified training corpus and com-TMost, but not all, noun phrases will yield a concept node. AutoSlog's heuristics sometimes fail to produce a concept node when the verb is weak (e.g., forms of &amp;quot;to be&amp;quot;), when the linguistic context does match any of the heuristics, or when CIRCUS produces a faulty sentence analysis.</Paragraph>
      <Paragraph position="10">  putes statistics to identify which signatures occur much more frequently in relevant texts than irrelevant texts (i.e., have a high relevancy rate). A signature consists of a concept node paired with the word that triggered it, although in the experiments presented in this paper there is a one-to-one correspondence between concept nodes and signatures. 9 The relevancy signatures algorithm essentially identifies concept nodes that have a high relevancy rate and uses them to classify new texts. Therefore, the AutoSlog-TS dictionary and statistics can be fed directly into the text classification algorithm. We present text classification results with AutoSlog-TS in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="154" end_page="159" type="metho">
    <SectionTitle>
4 Experiments in the Terrorism Domain
</SectionTitle>
    <Paragraph position="0"> We conducted a series of experiments with AutoSlog-TS to evaluate how well it performs on a text classification task, and to assess the viability of using it for information extraction tasks. First, we describe text classification results for the MUC-4 terrorism domain. Second, we present data that suggests how the dictionary can be filtered automatically for information extraction.</Paragraph>
    <Section position="1" start_page="154" end_page="156" type="sub_section">
      <SectionTitle>
4.1 Text Classification Experiments
</SectionTitle>
      <Paragraph position="0"> In the first experiment, we applied AutoSlog-TS to 1500 texts 1deg from the MUC-4 corpus, which has been preclassified for the domain of Latin American terrorism. Roughly 50% of the texts are classified as relevant. AutoSlog-TS produced a dictionary of 32,345 unique concept nodes. To reduce the set of patterns down to a manageable size, we eliminated all concept nodes that were proposed exactly once, under the assumption that a pattern encountered only once is unlikely to be of much value. AutoSlog-TS often proposes the same pattern multiple times and keeps track of how often each pattern is proposed. After frequency filtering, the AutoSlog-TS dictionary contained 11,225 unique concept nodes.</Paragraph>
      <Paragraph position="1"> We then ran CIRCUS over the same set of texts using the new concept node dictionary. For each text, we kept track of the concept nodes that were activated. We expect each concept node to be activated at least once, because these texts were used to create the concept node definitions, n This data was handed off to the relevancy signatures algorithm which generates signatures for each text (by pairing each concept node with the word that triggered it), and calculates statistics for each signature to identify how often it appeared in relevant texts versus irrelevant texts. The relevancy signatures algorithm uses a relevancy threshold R to identify the most relevant signatures and a frequency threshold M to eliminate signatures that were seen only a few times during training.</Paragraph>
      <Paragraph position="2"> Signatures that pass both thresholds are labeled as relevancy signatures and are used to classify new texts.</Paragraph>
      <Paragraph position="3"> Finally, we evaluated the system by classifying two blind sets of 100 texts each, the TST3 and TST4 test sets from the MUC-4 corpus. Each new text was processed by CIRCUS and classified as relevant if it generated a relevancy signature. We compared these results with results produced  nA concept node may be activated by CIRCUS more often than it is proposed by AutoSlog-TS. For example, consider the phrase I &amp;quot;the murder in Bogota by terrorists.&amp;quot; To extract &amp;quot;terrorists&amp;quot;, AutoSlog-TS uses a pp-attachment algorithm which should attach the PP to the noun &amp;quot;murder.&amp;quot; However, it often makes mistakes and might attach the PP to the noun &amp;quot;Bogota.&amp;quot; In this case, AutoSlog-TS would not propose the pattern &amp;quot;murder by X&amp;quot; even though it appears in the text.</Paragraph>
      <Paragraph position="4">  by the hand-crafted MUC-4 dictionary. We ran each system 120 times using a variety of threshold settings: R was varied from 70 to 95 in increments of five, and M was varied from 1 to 20 in increments of one. Both text classification systems were trained on the same set of 1500 texts and were identical except that they used different concept node dictionaries. Figures 5 and 6 show the scatterplots.</Paragraph>
      <Paragraph position="5">  The AutoSlog-TS dictionary performed comparably to the hand-crafted dictionary on both test sets. On TST4, the AutoSlog-TS dictionary actually achieved higher precision than the hand-crafted dictionary for recall levels &lt; 60%, and produced several data points that achieved 100% precision (the hand-crafted dictionary did not produce any). However, we see a trade-off at higher recall levels. The AutoSlog-TS dictionary achieved higher recall (up to 100%), which makes sense considering that the AutoSlog-TS dictionary is much bigger than the hand-crafted dictionary.</Paragraph>
      <Paragraph position="6"> But the hand-crafted dictionary achieved higher precision at recall levels above 60-65%. This is probably because the hand-crafted dictionary was filtered manually, which ensures that all of its concept nodes are relevant to the domain (although not all are useful as classifiers). In contrast, the AutoSlog-TS dictionary was not filtered manually so the statistics are solely responsible for separating the relevant concept nodes from the irrelevant ones. To achieve high recall, the threshold values must be low which allows some irrelevant patterns to pass threshold and cause erroneous classifications.</Paragraph>
      <Paragraph position="7"> Overall, the text classification results from AutoSlog-TS are very encouraging. The AutoSlog-TS dictionary produced results comparable to a hand-crafted dictionary on both test sets and even surpassed the precision scores of the hand-crafted dictionary on TST4. Furthermore, the entire text classification system is constructed automatically using only a preclassified training corpus, and no text annotations or manual filtering of any kind.</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="159" type="sub_section">
      <SectionTitle>
4.2 Comparative Dictionary Experiments
</SectionTitle>
      <Paragraph position="0"> We were also interested in gathering data to suggest how the AutoSlog-TS dictionary could be filtered automatically to produce an effective dictionary for information extraction. As we indicated in Section 3.2, a dictionary for text classification requires patterns that can discriminate between relevant and irrelevant texts. In contrast, a dictionary for information extraction requires patterns that will extract relevant information, but they may also extract irrelevant information. For example, in the terrorism domain, it is essential to have a pattern for the expression &amp;quot;X was killed&amp;quot; because people are frequently killed in terrorist attacks. However, this pattern is also likely to appear in texts that describe other types of incidents, such as accidents and military actions.</Paragraph>
      <Paragraph position="1"> First, we collected data to compare the AutoSlog-TS dictionary with a dictionary produced by the original Version of AutoSlog. The AutoSlog dictionary was generated using an annotated corpus and was subsequently filtered by a person, so it relied on two levels of human effort. The AutoSlog dictionary contains 428 unique concept node patterns 12, which were all deemed to be relevant by a person. The AutoSlog-TS dictionary contains 32,345 unique patterns of which 398 intersect with the AutoSlog dictionary33 We experimented with automatic filtering techniques based on two criteria: frequency and relevancy. For frequency filtering, we simply removed all concept nodes that were proposed by AutoSlog-TS less than N times. For example, N=2 eliminated all concept nodes that were proposed exactly once and reduced the size of the dictionary from 32,345 to 11,225. Figure 7 shows the intersections between the AutoSlog-TS dictionary and the AutoSlog dictionary after frequency 12The dictionary actually contains 450 concept nodes but some concept nodes represent the same pattern to extract different types of objects. For example, the pattern &amp;quot;X was attacked&amp;quot; is used to extract both victims and physical targets.</Paragraph>
      <Paragraph position="2"> lain theory, AutoSlog-TS should have generated all of the patterns that were generated by AutoSlog. However, AutoSlog-TS uses a slightly different version of CIRCUS and a new pp-attachment algorithm (see \[Riloff, 1994\]).  filtering. It is interesting to note that approximately half of the concept nodes in the AutoSlog dictionary were proposed fewer than 5 times by AutoSlog-TS. This implies that roughly half of the concept nodes in the AutoSlog dictionary occurred infrequently and probably had little impact on the overall performance of the information extraction system. 14 One of the problems with manual filtering is that it is difficult for a person to know whether a pattern will occur frequently or infrequently in future texts. As a result, people tend to retain many patterns that are not likely to be encountered very often.</Paragraph>
      <Paragraph position="3">  For relevancy filtering, we retained only the concept nodes that had &gt; N% correlation with relevant texts. For example, N--80 means that we retained a concept node if &gt; 80% of its occurrences were in relevant texts. Figure 8 shows the intersections between the dictionaries after relevancy filtering. Not surprisingly, most of the concept nodes in the AutoSlog dictionary had at least a 50% relevancy rate. However, the number of concept nodes drops off rapidly at higher relevancy rates. Again, this is not surprising because many useful extraction patterns will be common in both relevant and irrelevant texts.</Paragraph>
      <Paragraph position="4"> Finally, we filtered the AutoSlog-TS dictionary using both relevancy and frequency filtering (N=5) to get a rough idea of how many concept node definitions will be useful for information extraction. Figure 9 shows the size of the resulting dictionaries after filtering. The number of concept nodes drops off dramatically from 32,345 to 4,169 after frequency filtering alone. There is a roughly linear relationship between the relevancy rate and the number of concept nodes retained. It seems relatively safe to assume that concept nodes with a relevancy rate below 50% are not highly associated with the domain, and that concept nodes with a total frequency &lt; 5 are probably not going to be encountered often. Using these two threshold values, we can reduce the size of the dictionary down to 1870 definitions. This dictionary is much more manageable in size 14This is consistent with earlier results which showed that a relatively small set of concept nodes typically do most of the work \[RilotT, 1994\].</Paragraph>
      <Paragraph position="5">  and could be easily reviewed by a person to separate the good definitions from the bad ones. 15 If for no other reason, a human would be required to assign semantic labels to each definition so that the system can identify the type of information that is extracted. Furthermore, the AutoSlog-TS dictionary should contain a higher percentage of relevant definitions that the original AutoSlog dictionary. Since the AutoSlog-TS dictionary has been prefiltered for both frequency and relevancy, many concept nodes that represent uncommon phrases or general expressions have already been removed.</Paragraph>
      <Paragraph position="6"> Because AutoSlog-TS is not constrained to consider only the annotated portions of the corpus, it found many good patterns that AutoSlog did not. For example, AutoSlog-TS produced 158 concept nodes that have a relevancy rate &gt; 90% and frequency &gt; 5. Only 45 of these concept nodes were in the original AutoSlog dictionary. Figure 10 shows a sample of some of the new concept nodes that represent patterns associated with terrorism. 16 was assassinated in X assassination in X X ordered assassination was captured by X capture of X X managed to escape was exploded in X damage in X X expressed solidarity was injured by X headquarters of X perpetrated on X was kidnapped in X targets of X hurled at X was perpetrated on X went_off on X carried_out X was shot in X X blamed suspected X was shot_to_death on X X defused to protest X X was hit X injured to arrest X Figure 10: Patterns found by AutoSlog-TS but not by AutoSlog These results suggest that combining domain-independent linguistic rules with simple filtering techniques is a promising approach for automatically creating dictionaries of extraction patterns. Although it may still be necessary for a human to review the resulting patterns to build an information extraction system, this approach eliminates the need for text annotations and relies only on preclassified texts.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML