File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1015_intro.xml
Size: 4,248 bytes
Last Modified: 2025-10-06 14:03:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1015"> <Title>Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations</Title> <Section position="4" start_page="113" end_page="113" type="intro"> <SectionTitle> 2 Relevant Work </SectionTitle> <Paragraph position="0"> To date, most research on relation harvesting has focused on is-a and part-of. Approaches fall into two categories: pattern- and clustering-based.</Paragraph> <Paragraph position="1"> Most common are pattern-based approaches.</Paragraph> <Paragraph position="2"> Hearst (1992) pioneered using patterns to extract hyponym (is-a) relations. Manually building three lexico-syntactic patterns, Hearst sketched a bootstrapping algorithm to learn more patterns from instances, which has served as the model for most subsequent pattern-based algorithms.</Paragraph> <Paragraph position="3"> Berland and Charniak (1999) proposed a system for part-of relation extraction, based on the (Hearst 1992) approach. Seed instances are used to infer linguistic patterns that are used to extract new instances. While this study introduces statistical measures to evaluate instance quality, it remains vulnerable to data sparseness and has the limitation of considering only one-word terms.</Paragraph> <Paragraph position="4"> Improving upon (Berland and Charniak 1999), Girju et al. (2006) employ machine learning algorithms and WordNet (Fellbaum 1998) to disambiguate part-of generic patterns like &quot;X's Y&quot; and &quot;X of Y&quot;. This study is the first extensive attempt to make use of generic patterns. In order to discard incorrect instances, they learn WordNet-based selectional restrictions, like &quot;X(scene#4)'s Y(movie#1)&quot;. While making huge grounds on improving precision/recall, heavy supervision is required through manual semantic annotations.</Paragraph> <Paragraph position="5"> Ravichandran and Hovy (2002) focus on scaling relation extraction to the Web. A simple and effective algorithm is proposed to infer surface patterns from a small set of instance seeds by extracting substrings relating seeds in corpus sentences. The approach gives good results on specific relations such as birthdates, however it has low precision on generic ones like is-a and partof. Pantel et al. (2004) proposed a similar, highly scalable approach, based on an edit-distance technique, to learn lexico-POS patterns, showing both good performance and efficiency. Espresso uses a similar approach to infer patterns, but we make use of generic patterns and apply refining techniques to deal with wide variety of relations.</Paragraph> <Paragraph position="6"> Other pattern-based algorithms include (Riloff and Shepherd 1997), who used a semi-automatic method for discovering similar words using a few seed examples, KnowItAll (Etzioni et al.</Paragraph> <Paragraph position="7"> 2005) that performs large-scale extraction of facts from the Web, Mann (2002) who used part of speech patterns to extract a subset of is-a relations involving proper nouns, and (Downey et al.</Paragraph> <Paragraph position="8"> 2005) who formalized the problem of relation extraction in a coherent and effective combinatorial model that is shown to outperform previous probabilistic frameworks.</Paragraph> <Paragraph position="9"> Clustering approaches have so far been applied only to is-a extraction. These methods use clustering algorithms to group words according to their meanings in text, label the clusters using its members' lexical or syntactic dependencies, and then extract an is-a relation between each cluster member and the cluster label. Caraballo (1999) proposed the first attempt, which used conjunction and apposition features to build noun clusters. Recently, Pantel and Ravichandran (2004) extended this approach by making use of all syntactic dependency features for each noun.</Paragraph> <Paragraph position="10"> The advantage of clustering approaches is that they permit algorithms to identify is-a relations that do not explicitly appear in text, however they generally fail to produce coherent clusters from fewer than 100 million words; hence they are unreliable for small corpora.</Paragraph> </Section> class="xml-element"></Paper>