File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1041_intro.xml
Size: 3,095 bytes
Last Modified: 2025-10-06 14:02:17
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1041"> <Title>Automatically Labeling Semantic Classes</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> There have been several approaches to automatically discovering lexico-semantic information from text (Hearst 1992; Riloff and Shepherd 1997; Riloff and Jones 1999; Berland and Charniak 1999; Pantel and Lin 2002; Fleischman et al. 2003; Girju et al. 2003). One approach constructs automatic thesauri by computing the similarity between words based on their distribution in a corpus (Hindle 1990; Lin 1998). The output of these programs is a ranked list of similar words to each word. For example, Lins approach outputs the following top-20 similar words of orange: (D) peach, grapefruit, yellow, lemon, pink, avocado, tangerine, banana, purple, Santa Ana, strawberry, tomato, red, pineapple, pear, Apricot, apple, green, citrus, mango A common problem of such lists is that they do not discriminate between the senses of polysemous words.</Paragraph> <Paragraph position="1"> For example, in (D), the color and fruit senses of orange are mixed up.</Paragraph> <Paragraph position="2"> Lin and Pantel (2001) proposed a clustering algorithm, UNICON, which generates similar lists but discriminates between senses of words. Later, Pantel and Lin (2002) improved the precision and recall of UNICON clusters with CBC (Clustering by Committee). Using sets of representative elements called committees, CBC discovers cluster centroids that unambiguously describe the members of a possible class. The algorithm initially discovers committees that are well scattered in the similarity space. It then proceeds by assigning elements to their most similar committees. After assigning an element to a cluster, CBC removes their overlapping features from the element before assigning it to another cluster. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses.</Paragraph> <Paragraph position="3"> CBC discovered both the color sense of orange, as shown in list (C) of Section 1, and the fruit sense shown below: (E) peach, pear, apricot, strawberry, banana, mango, melon, apple, pineapple, cherry, plum, lemon, grapefruit, orange, berry, raspberry, blueberry, kiwi, ...</Paragraph> <Paragraph position="4"> There have also been several approaches to discovering hyponym (is-a) relationships from text. Hearst (1992) used seven lexico-syntactic patterns, for example such NP as {NP,}*{(or|and)} NP and NP {, NP}*{,} or other NP. Berland and Charniak (1999) used similar pattern-based techniques and other heuristics to extract meronymy (part-whole) relations. They reported an accuracy of about 55% precision on a corpus of 100,000 words. Girju, Badulescu and Moldovan (2003) improved upon this work by using a machine learning filter. Mann (2002) and Fleischman et al. (2003) used part of speech patterns to extract a subset of hyponym relations involing proper nouns.</Paragraph> </Section> class="xml-element"></Paper>