File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1102_metho.xml
Size: 14,167 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1102"> <Title>Names and Similarities on the Web: Fact Extraction in the Fast Lane</Title> <Section position="4" start_page="809" end_page="811" type="metho"> <SectionTitle> 2 Similarities for Pattern Acquisition </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="809" end_page="810" type="sub_section"> <SectionTitle> 2.1 Generalization via Word Similarities </SectionTitle> <Paragraph position="0"> The extraction patterns are acquired by matching the pairs of phrases from the seed set into document sentences. The patterns consist of contiguous sequences of sentence terms, but otherwise differ from the types of patterns proposed in earlier work in two respects. First, the terms of a pattern are either regular words or, for higher generality, any word from a class of similar words. Second, the amount of textual context encoded in a pattern is limited to the sequence of terms between (i.e., infix) the pair of phrases from a seed fact that could be matched in a document sentence, thus excluding any context to the left (i.e., prefix) and to the right (i.e., postfix) of the seed.</Paragraph> <Paragraph position="1"> The pattern shown at the top of Figure 2, which contains the sequence [CL1 born CL2 00 .], illustrates the use of classes of distributionally similar words within extraction patterns. The first word class in the sequence, CL1, consists of words such as {was, is, could}, whereas the second class includes {February, April, June, Aug., November} and other similar words. The classes of words are computed on the fly over all sequences of terms in the extracted patterns, on top of a large set of pairwise similarities among words (Lin, 1998) extracted in advance from around 50 million news articles indexed by the Google search engine over three years. All digits in both patterns and sentences are replaced with a common marker, such that any two numerical values with the same number of digits will overlap during matching.</Paragraph> <Paragraph position="2"> Many methods have been proposed to compute distributional similarity between words, e.g., (Hindle, 1990), (Pereira et al., 1993), (Grefenstette, 1994) and (Lin, 1998). Almost all of the methods represent a word by a feature vector, where each feature corresponds to a type of context in which the word appeared. They differ in how the feature vectors are constructed and how the similarity between two feature vectors is computed.</Paragraph> <Paragraph position="3"> In our approach, we define the features of a word w to be the set of words that occurred within a small window of w in a large corpus. The context window of an instance of w consists of the closest non-stopword on each side of w and the stop-words in between. The value of a feature wprime is defined as the pointwise mutual information between wprime and w: PMI(wprime, w) = [?]log( P(w,wprime)P(w)P(wprime)). The similarity between two different words w1 and w2, S(w1, w2), is then computed as the cosine of the angle between their feature vectors.</Paragraph> <Paragraph position="4"> While the previous approaches to distributional similarity have only applied to words, we applied the same technique to proper names as well as words. The following are some example similar words and phrases with their similarities, as obtained from the Google News corpus: To our knowledge, the only previous study that embeds similarities into the acquisition of extraction patterns is (Stevenson and Greenwood, 2005). The authors present a method for computing pair-wise similarity scores among large sets of potential syntactic (subject-verb-object) patterns, to detect centroids of mutually similar patterns. By assuming the syntactic parsing of the underlying text collection to generate the potential patterns in the first place, the method is impractical on Web-scale collections. Two patterns, e.g. chairman-resign and CEO-quit, are similar to each other if their components are present in an external hand-built ontology (i.e., WordNet), and the similarity among the components is high over the ontology. Since general-purpose ontologies, and WordNet in particular, contain many classes (e.g., chairman and CEO) but very few instances such as Osasuna, Crewe etc., the patterns containing an instance rather than a class will not be found to be similar to one another. In comparison, the classes and instances are equally useful in our method for generalizing patterns for fact extraction. We merge basic patterns into generalized patterns, regardless of whether the similar words belong, as classes or instances, in any external ontology.</Paragraph> </Section> <Section position="2" start_page="810" end_page="811" type="sub_section"> <SectionTitle> 2.2 Generalization via Infix-Only Patterns </SectionTitle> <Paragraph position="0"> By giving up the contextual constraints imposed by the prefix and postfix, infix-only patterns represent the most aggressive type of extraction patterns that still use contiguous sequences of terms.</Paragraph> <Paragraph position="1"> In the absence of the prefix and postfix, the outer boundaries of the fact are computed separately for the beginning of the first (left) and end of the second (right) phrases of the candidate fact. For generality, the computation relies only on the part-of-speech tags of the current seed set. Starting forward from the right extremity of the infix, we collect a growing sequence of terms whose part-of-speech tags are [P1+ P2+ .. Pn+], where the notation Pi+ represents one or more consecutive occurrences of the part-of-speech tag Pi. The sequence [P1 P2 .. Pn] must be exactly the sequence of part of speech tags from the right side of one of the seed facts. The point where the sequence cannot be grown anymore defines the boundary of the fact. A similar procedure is applied backwards, starting from the left extremity of the infix. An infix-only pattern produces a candidate fact from a sentence only if an acceptable sequence is found to the left and also to the right of the infix.</Paragraph> <Paragraph position="2"> Figure 2 illustrates the process on the infix-only pattern mentioned earlier, and one seed fact. The part-of-speech tags for the seed fact are [NNP NNP] and [CD] for the left and right sides respectively. The infix occurs in all sentences. However, the matching of the part-of-speech tags of the sentence sequences to the left and right of the infix, against the part-of-speech tags of the seed fact, only succeeds for the last three sentences. It fails for the first sentence S1 to the left of the infix, because [.. NNP] (for Vega) does not match [NNP NNP]. It also fails for the second sentence S2 to both the left and the right side of the infix, since [.. NN] (for poet) does not match [NNP NNP], and [JJ ..] (for several) does not match [CD].</Paragraph> </Section> </Section> <Section position="5" start_page="811" end_page="812" type="metho"> <SectionTitle> 3 Similarities for Validation and Ranking </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="811" end_page="811" type="sub_section"> <SectionTitle> 3.1 Revisiting Standard Ranking Criteria </SectionTitle> <Paragraph position="0"> Because some of the acquired extraction patterns are too generic or wrong, all approaches to iterative acquisition place a strong emphasis on the choice of criteria for ranking. Previous literature quasi-unanimously assesses the quality of each candidate fact based on the number and quality of the patterns that extract the candidate fact (more is better); and the number of seed facts extracted by the same patterns (again, more is better) (Agichtein and Gravano, 2000; Thelen and Riloff, 2002; Lita and Carbonell, 2004). However, our experiments using many variations of previously proposed scoring functions suggest that they have limited applicability in large-scale fact extraction, for two main reasons. The first is that it is impractical to perform hundreds of acquisition iterations on terabytes of text. Instead, one needs to grow the seed set aggressively in each iteration. Previous scoring functions were implicitly designed for cautious acquisition strategies (Collins and Singer, 1999), which expand the seed set very slowly across consecutive iterations.</Paragraph> <Paragraph position="1"> In that case, it makes sense to single out a small number of best candidates, among the other available candidates. Comparatively, when 10,000 candidate facts or more need to be added to a seed set of 10 seeds as early as after the first iteration, it is difficult to distinguish the quality of extraction patterns based, for instance, only on the percentage of the seed set that they extract. The second reason is the noisy nature of the Web. A substantial number of factors can and will concur towards the worst-case extraction scenarios on the Web.</Paragraph> <Paragraph position="2"> Patterns of apparently high quality turn out to produce a large quantity of erroneous &quot;facts&quot; such as (A-League, 1997), but also the more interesting (Jethro Tull, 1947) as shown earlier in Figure 2, or (Web Site David, 1960) or (New York, 1831). As for extraction patterns of average or lower quality, they will naturally lead to even more spurious extractions. null</Paragraph> </Section> <Section position="2" start_page="811" end_page="811" type="sub_section"> <SectionTitle> 3.2 Ranking of Extraction Patterns </SectionTitle> <Paragraph position="0"> The intuition behind our criteria for ranking generalized pattern is that patterns of higher precision tend to contain words that are indicative of the relation being mined. Thus, a pattern is more likely to produce good candidate facts if its infix contains the words language or spoken if extracting Language-SpokenIn-Country facts, or the word capital if extracting City-CapitalOf-Country relations. In each acquisition iteration, the scoring of patterns is a two-pass procedure. The first pass computes the normalized frequencies of all words excluding stopwords, over the entire set of extraction patterns. The computation applies separately to the prefix, infix and postfix of the patterns. In the second pass, the score of an extraction pattern is determined by the words with the highest frequency score in its prefix, infix and postfix, as computed in the first pass and adjusted for the relative distance to the start and end of the infix.</Paragraph> </Section> <Section position="3" start_page="811" end_page="812" type="sub_section"> <SectionTitle> 3.3 Ranking of Candidate Facts </SectionTitle> <Paragraph position="0"> Figure 3 introduces a new scheme for assessing the quality of the candidate facts, based on the computation of similarity scores for each candidate relative to the set of seed facts. A candidate fact, e.g., (Richard Steele, 1672), is similar to the seed set if both its phrases, i.e., Richard Steele and 1672, are similar to the corresponding phrases (John Lennon or Stephen Foster in the case of Richard Steele) from the seed facts. For a phrase of a candidate fact to be assigned a non-default (non-minimum) be similar to one or more words situated at the same positions in the seed facts. This is the case for the first five candidate facts in Figure 3. For example, the first word Richard from one of the candidate facts is similar to the first word John from one of the seed facts. Concurrently, the last word Steele from the same phrase is similar to Foster from another seed fact. Therefore Robert Foster is similar to the seed facts. The score of a phrase containing N words is:braceleftBigg</Paragraph> <Paragraph position="2"> where Simi is the similarity of the component word at position i in the phrase, and C1 and C2 are scaling constants such that C2lessmuchC1. Thus, the similarity score of a candidate fact aggregates individual word-to-word similarity scores, for the left side and then for the right side of a candidate fact. In turn, the similarity score of a component word Simi is higher if: a) the computed word-to-word similarity scores are higher relative to words at the same position i in the seeds; and b) the component word is similar to words from more than one seed fact.</Paragraph> <Paragraph position="3"> The similarity scores are one of a linear combination of features that induce a ranking over the candidate facts. Three other domain-independent features contribute to the final ranking: a) a phrase completeness score computed statistically over the entire set of candidate facts, which demotes candidate facts if any of their two sides is likely to be incomplete (e.g., Mary Lou vs. Mary Lou Retton, or John F. vs. John F. Kennedy); b) the average PageRank value over all documents from which the candidate fact is extracted; and c) the pattern-based scores of the candidate fact. The latter feature converts the scores of the patterns extracting the candidate fact into a score for the candidate fact. For this purpose, it considers a fixed-length window of words around each match of a candidate fact in some sentence from the text collection. This is equivalent to analyzing all sentence contexts from which a candidate fact can be extracted.</Paragraph> <Paragraph position="4"> For each window, the word with the highest frequency score, as computed in the first pass of the procedure for scoring the patterns, determines the score of the candidate fact in that context. The overall pattern-based score of a candidate fact is the sum of the scores over all its contexts of occurrence, normalized by the frequency of occurrence of the candidate over all sentences.</Paragraph> <Paragraph position="5"> Besides inducing a ranking over the candidate facts, the similarity scores also serve as a validation filter over the candidate facts. Indeed, any candidates that are not similar to the seed set can be filtered out. For instance, the elimination of (Jethro Tull, 1947) is a side effect of verifying that Tull is not similar to any of the last-position words from phrases in the seed set.</Paragraph> </Section> </Section> class="xml-element"></Paper>