File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2086_metho.xml
Size: 15,247 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2086"> <Title>URES : an Unsupervised Web Relation Extraction System</Title> <Section position="5" start_page="667" end_page="671" type="metho"> <SectionTitle> 3 Description of URES </SectionTitle> <Paragraph position="0"> The goal of URES is extracting instances of relations from the Web without human supervision.</Paragraph> <Paragraph position="1"> Accordingly, the input of the system is limited to (reasonably short) definition of the target relations. The output of the system is a large list of relation instances, ordered by confidence. The system consists of several largely independent components. The Sentence Gatherer generates (e.g., downloads from the Web) a large set of sentences that may contain target instances. The Pattern Learner uses a small number of known seed instances to learn likely patterns of relation occurrences. The Sentence Classifier filters the set of sentences, removing those that are unlikely to contain instances of the target relations. The Instance Extractor extracts the attributes of the instances from the sentences, and generates the output of the system.</Paragraph> <Section position="1" start_page="668" end_page="668" type="sub_section"> <SectionTitle> 3.1 Sentence Gatherer </SectionTitle> <Paragraph position="0"> The Sentence Gatherer is currently implemented in a very simple way. It gets a set of keywords as input, and proceeds to download all documents that contain one of those keywords. From the documents, it extracts all sentences that contain at least one of the keywords.</Paragraph> <Paragraph position="1"> The keywords for a relation are the words that are indicative of instances of the relation. The keywords are given to the system as part of the relation definition. Their number is usually small. For instance, the set of keywords for Acquisition in our experiments contains two words - &quot;acquired&quot; and &quot;acquisition&quot;. Additional key-words (such as &quot;acquire&quot;, &quot;purchased&quot;, and &quot;hostile takeover&quot;) can be added automatically by using WordNet (Miller 1995).</Paragraph> </Section> <Section position="2" start_page="668" end_page="670" type="sub_section"> <SectionTitle> 3.2 Pattern Learner </SectionTitle> <Paragraph position="0"> The task of the Pattern Learner is to learn the patterns of occurrence of relation instances. This is an inherently supervised task, because at least some occurrences must be known in order to be able to find patterns among them. Consequently, the input to the Pattern Learner includes a small set (10-15 instances) of known instances for each target relation. Our system assumes that the seeds are a part of the target relation definition.</Paragraph> <Paragraph position="1"> However, the seeds need not be created manually. Instead, they can be taken from the top-scoring results of a high-precision low-recall unsupervised extraction system, such as KnowItAll. The seeds for our experiments were produced in exactly this way.</Paragraph> <Paragraph position="2"> The Pattern Learner proceeds as follows: first, the gathered sentences that contain the seed instances are used to generate the positive and negative sets. From those sets the pattern are learned. Then, the patterns are post-processed and filtered. We shall now describe those steps in detail.</Paragraph> <Paragraph position="3"> Preparing the positive and negative sets The positive set of a predicate (the terms predicate and relation are interchangeable in our work) consists of sentences that contain a known instance of the predicate, with the instance attributes changed to &quot;<AttrN>&quot;, where N is the attribute index. For example, assuming there is a seed instance Acquisition(Oracle, PeopleSoft), the sentence The Antitrust Division of the U.S. Department of Justice evaluated the likely competitive effects of Oracle's proposed acquisition of PeopleSoft.</Paragraph> <Paragraph position="4"> will be changed to The Antitrust Division... ...of <Attr1>'s proposed acquisition of <Attr2>.</Paragraph> <Paragraph position="5"> The positive set of a predicate P is generated straightforwardly, using substring search.</Paragraph> <Paragraph position="6"> The negative set of a predicate consists of similarly modified sentences with known false instances of the predicate. We build the negative set as a union of two subsets. The first subset is generated from the sentences in the positive set by changing the assignment of one or both attributes to some other suitable entity. In the first mode of operation, when only a shallow parser is available, any suitable noun phrase can be assigned to an attribute. Continuing the example above, the following sentences will be included in the negative set: <Attr1> of <Attr2> evaluated the likely...</Paragraph> <Paragraph position="7"> <Attr2> of the U.S. ... ...acquisition of <Attr1>.</Paragraph> <Paragraph position="8"> etc.</Paragraph> <Paragraph position="9"> In the second mode of operation, when the NER is available, only entities of the correct type get assigned to an attribute.</Paragraph> <Paragraph position="10"> The other subset of the negative set contains all sentences produced in a similar way from the positive sentences of all other target predicates. We assume without loss of generality that the predicates that are being extracted simultaneously are all disjoint. In addition, the definition of each predicate indicates whether the predicate is symmetric (like &quot;merger&quot;) or antisymmetric (like &quot;acquisition&quot;). In the former case, the sentences produced by exchanging the attributes in positive sentences are placed into the positive set, and in the later case - into the negative set of the predicate.</Paragraph> <Paragraph position="11"> The following pseudo code shows the process of generating the positive and negative sets in detail: Let S be the set of gathered sentences.</Paragraph> <Paragraph position="12"> For each predicate P For each s[?]S containing a word from Keywords(P) from the positive set of the predicate. The function generates a pattern that is the best (according to the objective function defined below) generalization of its two arguments. The following pseudo code shows the process of generating the patterns: For each predicate P The patterns are sequences of tokens, skips (denoted *), limited skips (denoted *?) and slots. The tokens can match only themselves, the skips match zero or more arbitrary tokens, and slots match instance attributes. The limited skips match zero or more arbitrary tokens, which must not belong to entities of the types equal to the types of the predicate attributes. The General- null ) function takes two patterns (note, that sentences in the positive and negative sets are patterns without skips) and generates the least (most specific) common generalization of both. The function does a dynamical programming search for the best match between the two patterns (Optimal String Alignment algorithm), with the cost of the match defined as the sum of costs of matches for all elements. We use the following numbers: two identical elements match at cost 0, a token matches a skip or an empty space at cost 10, a skip matches an empty space at cost 2, and different kinds of skip match at cost 3. All other combinations have infinite cost. After the best match is found, it is converted into a pattern by copying matched identical elements and adding skips where non-identical elements are matched. For example, at total cost = 80. The match will be converted to the pattern (assuming the NER mode, so the only entity belonging to the same type as one of the attributes is &quot;X&quot;): *? *? this *? *? , <Attr1> *? acquired <Attr2> *? * which becomes, after combining adjacent skips, *? this *? , <Attr1> *? acquired <Attr2> * Note, that the generalization algorithm allows patterns with any kind of elements beside skips, such as CapitalWord, Number, CapitalizedSequence, etc. As long as the costs and results of matches are properly defined, the Generalize function is able to find the best generalization of any two patterns. However, in the present work we stick with the simplest pattern definition as described above.</Paragraph> <Paragraph position="13"> Post-processing, filtering, and scoring The number of patterns generated at the previous step is very large. Post-processing and filtering tries to reduce this number, keeping the most useful patterns and removing the too specific and irrelevant ones.</Paragraph> <Paragraph position="14"> First, we remove from patterns all &quot;stop words&quot; surrounded by skips from both sides, such as the word &quot;this&quot; in the last pattern in the previous subsection. Such words do not add to the discriminative power of patterns, and only needlessly reduce the pattern recall. The list of stop words includes all functional and very common English words, as well as puncuation marks. Note, that the stop words are removed only if they are surrounded by skips, because when they are adjacent to slots or non-stop words they often convey valuable information.</Paragraph> <Paragraph position="15"> After this step, the pattern above becomes *? , <Attr1> *? acquired <Attr2> * In the next step of filtering, we remove all patterns that do not contain relevant words. For each predicate, the list of relevant words is automatically generated from WordNet by following all links to depth at most 2 starting from the predicate keywords. For example, the pattern will be kept, because the word &quot;purchased&quot; can be reached from &quot;acquisition&quot; via synonym and derivation links.</Paragraph> <Paragraph position="16"> The final (optional) filtering step removes all patterns, that contain slots surrounded by skips on both sides, keeping only the patterns, whose slots are adjacent to tokens or to sentence boundaries. Since both the shallow parser and the NER system that we use are far from perfect, they often place the entity boundaries incorrectly. Using only patterns with anchored slots significantly improves the precision of the whole system. In our experiments we compare the performance of anchored and unanchored patterns. The filtered patterns are then scored by their performance on the positive and negative sets. Currently we use a simple scoring method - the score of a pattern is the number of positive matches divided by the number of negative matches plus one:</Paragraph> <Paragraph position="18"> This formula is purely empirical and produces reasonable results. The threshold is applied to the set of patterns, and all patterns scoring less than the threshold (currently, it is set to 6) are discarded.</Paragraph> </Section> <Section position="3" start_page="670" end_page="670" type="sub_section"> <SectionTitle> 3.3 Sentence Classifier </SectionTitle> <Paragraph position="0"> The task of the Sentence Classifier is to filter out from the large pool of sentences produced by the Sentence Gatherer the sentences that do not contain the target predicate instances. In the current version of our system, this is only done in order to reduce the number of sentences that need to be processed by the Slot Extractor. Therefore, in this stage we just remove the sentences that do not match any of the regular expressions generated from the patterns. Regular expressions are generated from patterns by replacing slots with skips.</Paragraph> </Section> <Section position="4" start_page="670" end_page="671" type="sub_section"> <SectionTitle> 3.4 Instance Extractor </SectionTitle> <Paragraph position="0"> The task of the Instance Extractor is to use the patterns generated by the Pattern Learner on the sentences that were passed through by the Sentence Classifier. However, the patterns cannot be directly matched to the sentences, because the patterns only define the placeholders for instance attributes and cannot by themselves extract the values of the attributes.</Paragraph> <Paragraph position="1"> We currently have two different ways to solve this problem - using a general-purpose shallow parser, which is able to recognize noun phrases and their heads, and using an information extraction system called TEG (Rosenfeld, Feldman et al. 2004), together with a trained grammar able to recognize the entities of the types of the predicates' attributes. We shall briefly describe the two modes of operation.</Paragraph> <Paragraph position="2"> Shallow Parser mode In the first mode of operation, the predicates may define attributes of two different types: ProperName and CommonNP. We assume that the values of the ProperName type are always heads of proper noun phrases. And the values of the CommonNP type are simple common noun phrases (with possible proper noun modifiers, e.g. &quot;the Kodak camera&quot;).</Paragraph> <Paragraph position="3"> We use a Java-written shallow parser from the OpenNLP (http://opennlp.sourceforge.net/) package. Each sentence is tokenized, tagged with part-of-speech, and tagged with noun phrase boundaries. The pattern matching and extraction is straightforward.</Paragraph> <Paragraph position="4"> purpose hybrid rule-based and statistical IE system, able to extract entities and relations at the sentence level. It is adapted to any domain by writing a suitable set of rules, and training them using an annotated corpus. The TEG rule language is a straightforward extension of a context-free grammar syntax. A complete set of rules is compiled into a PCFG (Probabilistic Context Free Grammar), which is then trained upon the training corpus.</Paragraph> <Paragraph position="5"> Some of the nonterminals inside the TEG grammar can be marked as target concepts.</Paragraph> <Paragraph position="6"> Wherever such nonterminal occurs in a final parse of a sentence, TEG generates an output label. The target concept rules may specify some of their parts as attributes. Then the concept is considered to be a relation, with the values of the attributes determined by the concept parse. Concepts without attributes are entities.</Paragraph> <Paragraph position="7"> For the TEG-based instance extractor we utilize the NER ruleset of TEG and an internal training corpus called INC, as described in (Rosenfeld, Feldman et al. 2004). The ruleset defines a grammar with a set of concepts for Person, Location, and Organization entities. In addition, the grammar defines a generic Noun-Phrase concept, which can be used for capturing the entities that do not belong to any of the entity types above.</Paragraph> <Paragraph position="8"> In order to do the extraction, the patterns generated by the Pattern Learner are converted to the TEG syntax and added to the pre-built NER grammar. This produces a grammar, which is able to extract relations. This grammar is trained upon the automatically labeled positive set from the Pattern Learning. The resulting trained model is applied to the sets of sentences produced by the Sentence Classifier.</Paragraph> </Section> </Section> class="xml-element"></Paper>