File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2919_metho.xml
Size: 14,774 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2919"> <Title>A Context Pattern Induction Method for Named Entity Extraction</Title> <Section position="4" start_page="141" end_page="143" type="metho"> <SectionTitle> 2 Context Pattern Induction </SectionTitle> <Paragraph position="0"> The overall method for inducing entity context patterns and extending entity lists is as follows: 1. Let E = seed set, T = text corpus.</Paragraph> <Paragraph position="1"> 2. Find the contexts C of entities in E in the corpus T (Section 2.1).</Paragraph> <Paragraph position="2"> 3. Select trigger words from C (Section 2.2). 4. For each trigger word, induce a pattern automaton (Section 2.3).</Paragraph> <Paragraph position="3"> 5. Use induced patterns P to extract more entities Eprime (Section 3).</Paragraph> <Paragraph position="4"> 6. Rank P and Eprime (Section 3.1).</Paragraph> <Paragraph position="5"> 7. If needed, add high scoring entities in Eprime to E and return to step 2. Otherwise, terminate with patterns P and extended entity list E [?] Eprime as results.</Paragraph> <Section position="1" start_page="141" end_page="141" type="sub_section"> <SectionTitle> 2.1 Extracting Context </SectionTitle> <Paragraph position="0"> Starting with the seed list, we first find occurrences of seed entities in the unlabeled data. For each such occurrence, we extract a fixed number W (context window size) of tokens immediately preceding and immediately following the matched entity. As we are only interested in modeling the context here, we replace all entity tokens by the single token -ENT-.</Paragraph> <Paragraph position="1"> This token now represents a slot in which an entity can occur. Examples of extracted entity contexts are shown in Table 1. In the work presented in this papers, seeds are entity instances (e.g. Google is a seed for organization category).</Paragraph> <Paragraph position="2"> increased expression of -ENT- in vad mice the expression of -ENT- mrna was greater expression of the -ENT- gene in mouse</Paragraph> <Paragraph position="4"> The set of extracted contexts is denoted by C. The next step is to automatically induce high-precision patterns containing the token -ENT- from such extracted contexts.</Paragraph> </Section> <Section position="2" start_page="141" end_page="142" type="sub_section"> <SectionTitle> 2.2 Trigger Word Selection </SectionTitle> <Paragraph position="0"> To induce patterns, we need to determine their starts.</Paragraph> <Paragraph position="1"> It is reasonable to assume that some tokens are more specific to particular entity classes than others. For example, in the examples shown above, expression can be one such word for gene names. Whenever one comes across such a token in text, the probability of finding an entity (of the corresponding entity class) in its vicinity is high. We call such starting tokens trigger words. Trigger words mark the beginning of a pattern. It is important to note that simply selecting the first token of extracted contexts may not be a good way to select trigger words. In such a scheme, we would have to vary W to search for useful pattern starts. Instead of that brute-force technique, we propose an automatic way of selecting trigger words. A good set of trigger words is very important for the quality of induced patterns.</Paragraph> <Paragraph position="2"> Ideally, we want a trigger word to satisfy the following: null to extracted contexts.</Paragraph> <Paragraph position="3"> We use a term-weighting method to rank candidate trigger words from entity contexts. IDF (Inverse Document Frequency) was used in our experiments but any other suitable term-weighting scheme may work comparably. The IDF weight fw for a word w occurring in a corpus is given by:</Paragraph> <Paragraph position="5"> parenrightbigg where N is the total number of documents in the corpus and nw is the total number of documents containing w. Now, for each context segment c [?] C, we select a dominating word dc given by dc = argmaxw[?]c fw There is exactly one dominating word for each c [?] C. All dominating words for contexts in C form multiset M. Let mw be the multiplicity of the dominating word w in M. We sort M by decreasing mw and select the top n tokens from this list as potential trigger words.</Paragraph> <Paragraph position="6"> Selection criteria based on dominating word frequency work better than criteria based on simple term weight because high term weight words may be rare in the extracted contexts, but would still be misleadingly selected for pattern induction. This can be avoided by using instead the frequency of dominating words within contexts, as we did here.</Paragraph> </Section> <Section position="3" start_page="142" end_page="143" type="sub_section"> <SectionTitle> 2.3 Automata Induction </SectionTitle> <Paragraph position="0"> Rather than using individual contexts directly, we summarize them into automata that contain the most significant regularities of the contexts sharing a given trigger word. This construction allows us to determine the relative importance of different context features using a variant of the forward-backward algorithm from HMMs.</Paragraph> <Paragraph position="1"> For each trigger word, we list the contexts starting with the word. For example, with &quot;expression&quot; as the trigger word, the contexts in Table 1 are reduced to those in Table 2. Since &quot;expression&quot; is a left-context trigger word, only one token to the right of -ENT- is retained. Here, the predictive context lies to the left of the slot -ENT- and a single token is retained on the right to mark the slot's right boundary. To model predictive right contexts, the token string can be reversed and the same techniques as here applied on the reversed string.2 expression of -ENT- in expression of -ENT- mrna expression of the -ENT- gene Similar contexts are prepared for each trigger word. The context set for each trigger word is then summarized by a pattern automaton with transitions that match the trigger word and also the wildcard -ENT- . We expect such automata to model the position in context of the entity slot and help us extract more entities of the same class with high precision. We use a simple form of grammar induction to learn the pattern automata. Grammar induction techniques have been previously explored for information extraction (IE) and related tasks. For instance, Freitag (1997) used grammatical inference to improve precision in IE tasks.</Paragraph> <Paragraph position="2"> Context segments are short and typically do not involve recursive structures. Therefore, we chose to use 1-reversible automata to represent sets of contexts. An automaton A is k-reversible iff (1) A is deterministic and (2) Ar is deterministic with k tokens of lookahead, where Ar is the automaton obtained by reversing the transitions of A. Wrapper induction using k-reversible grammar is discussed by Chidlovskii (2000).</Paragraph> <Paragraph position="3"> In the 1-reversible automaton induced for each trigger word, all transitions labeled by a given token go to the same state, which is identified with that token. Figure 1 shows a fragment of a 1-reversible automaton. Solan et al. (2005) describe a similar automaton construction, but they allow multiple transitions between states to distinguish among sentences. Each transition e = (v,w) in a 1-reversible automaton A corresponds to a bigram vw in the contexts used to create A. We thus assign each transition the probability</Paragraph> <Paragraph position="5"> where C(v,w) is the number of occurrences of the bigram vw in contexts for W. With this construction, we ensure words will be credited in proportion to their frequency in contexts. The automaton may overgenerate, but that potentially helps generalization. null The initially induced automata need to be pruned to remove transitions with weak evidence so as to increase match precision.</Paragraph> <Paragraph position="6"> The simplest pruning method is to set a count threshold c below which transitions are removed. However, this is a poor method. Consider state 10 in the automaton of Figure 2, with c = 20. Transitions (10,11) and (10,12) will be pruned. C(10,12) lessmuch c but C(10,11) just falls short of c. However, from the transition counts, it looks like the sequence &quot;the -ENT-&quot; is very common. In such a case, it is not desirable to prune (10,11). Using a local threshold may lead to overpruning.</Paragraph> <Paragraph position="7"> We would like instead to keep transitions that are used in relatively many probable paths through the automaton. The probability of path p is P(p) =producttext (v,w)[?]p P(w|v). Then the posterior probability of</Paragraph> <Paragraph position="9"> which can be efficiently computed by the forward-backward algorithm (Rabiner, 1989). We can now remove transitions leaving state v whose posterior probability is lower than pv = k(maxw P(v,w)), where 0 < k [?] 1 controls the degree of pruning, with higher k forcing more pruning. All induced and pruned automata are trimmed to remove unreachable states.</Paragraph> <Paragraph position="10"> sition counts are shown in parenthesis.</Paragraph> </Section> </Section> <Section position="5" start_page="143" end_page="144" type="metho"> <SectionTitle> 3 Automata as Extractor </SectionTitle> <Paragraph position="0"> Each automaton induced using the method described in Sections 2.3-2.3.2 represents high-precision patterns that start with a given trigger word. By scanning unlabeled data using these patterns, we can extract text segments which can be substituted for the slot token -ENT-. For example, assume that the induced pattern is &quot;analyst at -ENT- and&quot; and that the scanned text is &quot;He is an analyst at the University of California and ...&quot;. By scanning this text using the pattern mentioned above, we can figure out that the text &quot;the University of California&quot; can substitute for &quot;-ENT-&quot;. This extracted segment is a candidate extracted entity. We now need to decide whether we should retain all tokens inside a candidate extraction or purge some tokens, such as &quot;the&quot; in the example.</Paragraph> <Paragraph position="1"> One way to handle this problem is to build a language model of content tokens and retain only the maximum likelihood token sequence. However, in the current work, the following heuristic which worked well in practice is used. Each token in the extracted text segment is labeled either keep (K) or droppable (D). By default, a token is labeled K. A token is labeled D if it satisfies one of the droppable criteria. In the experiments reported in this paper, droppable criteria were whether the token is present in a stopword list, whether it is non-capitalized, or whether it is a number.</Paragraph> <Paragraph position="2"> Once tokens in a candidate extraction are labeled using the above heuristic, the longest token sequence corresponding to the regular expression K[DK][?]K is retained and is considered a final extraction. If there is only one K token, that token is retained as the final extraction. In the example above, the tokens are labeled &quot;the/D University/K of/D California/K&quot;, and the extracted entity will be &quot;University of California&quot;. null To handle run-away extractions, we can set a domain-dependent hard limit on the number of tokens which can be matched with &quot;-ENT-&quot;. This stems from the intuition that useful extractions are not very long. For example, it is rare that a person name longer than five tokens.</Paragraph> <Section position="1" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 3.1 Ranking Patterns and Entities </SectionTitle> <Paragraph position="0"> Using the method described above, patterns and the entities extracted by them from unlabeled data are paired. But both patterns and extractions vary in quality, so we need a method for ranking both.</Paragraph> <Paragraph position="1"> Hence, we need to rank both patterns and entities.</Paragraph> <Paragraph position="2"> This is difficult given that there we have no nega- null tive labeled data. Seed entities are the only positive instances that are available.</Paragraph> <Paragraph position="3"> Related previous work tried to address this problem. Agichtein and Gravano (2000) seek to extract relations, so their pattern evaluation strategy considers one of the attributes of an extracted tuple as a key. They judge the tuple as a positive or a negative match for the pattern depending on whether there are other extracted values associated with the same key.</Paragraph> <Paragraph position="4"> Unfortunately, this method is not applicable to entity extraction.</Paragraph> <Paragraph position="5"> The pattern evaluation mechanism used here is similar in spirit to those of Etzioni et al. (2005) and Lin et al. (2003). With seeds for multiple classes available, we consider seed instances of one class as negative instances for the other classes. A pattern is penalized if it extracts entities which belong to the seed lists of the other classes. Let pos(p) and neg(p) be respectively the number of distinct positive and negative seeds extracted by pattern p. In contrast to previous work mentioned above, we do not combine pos(p) and neg(p) to calculate a single accuracy value. Instead, we discard all patterns p with positive neg(p) value, as well as patterns whose total positive seed (distinct) extraction count is less than certain threshold epattern. This scoring is very conservative. There are several motivations for such a conservative scoring. First, we are more interested in precision than recall. We believe that with massive corpora, large number of entity instances can be extracted anyway. High accuracy extractions allow us to reliably (without any human evaluation) use extracted entities in subsequent tasks successfully (see Section 4.3). Second, in the absence of sophisticated pattern evaluation schemes (which we are investigating -- Section 6), we feel it is best to heavily penalize any pattern that extracts even a single negative instance.</Paragraph> <Paragraph position="6"> Let G be the set of patterns which are retained by the filtering scheme described above. Also, let I(e,p) be an indicator function which takes value 1 when entity e is extracted by pattern p and 0 otherwise. The score of e, S(e), is given by</Paragraph> <Paragraph position="8"> This whole process can be iterated by including extracted entities whose score is greater than or equal to a certain threshold eentity to the seed list.</Paragraph> </Section> </Section> class="xml-element"></Paper>