File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1010_intro.xml
Size: 5,133 bytes
Last Modified: 2025-10-06 14:02:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1010"> <Title>Acquiring Hyponymy Relations from Web Documents</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The goal of this work is to become able to automatically acquire hyponymy relations for a wide range of words or phrases from HTML documents on the WWW. We do not use particular lexicosyntactic patterns, as previous attempts have (Hearst, 1992; Caraballo, 1999; Imasumi, 2001; Fleischman et al., 2003; Morin and Jacquemin, 2003; Ando et al., 2003). The frequencies of use for such lexicosyntactic patterns are relatively low, and there can be many words or phrases that do not appear in such patterns even if we look at a large number of texts. The effort of searching for other clues indicating hyponymy relations is thus significant. We try to acquire hyponymy relations by combining three different types of clue obtainable from a wide range of words or phrases. The first type of clue is inclusion in itemizations or lists found in typical HTML documents on the WWW. The second consists of statistical measures such as the document frequency (df) and the inverse document frequency (idf), which are popular in the IR literature. The third is verb-noun co-occurrence in normal corpora.</Paragraph> <Paragraph position="1"> In our acquisition, we made the following assumptions.</Paragraph> <Paragraph position="2"> Assumption A Expressions included in the same itemization or listing in an HTML document are likely to have a common hypernym.</Paragraph> <Paragraph position="3"> Assumption B Given a set of hyponyms that have a common hypernym, the hypernym appears in many documents that include the hyponyms.</Paragraph> <Paragraph position="4"> Assumption C Hyponyms and their hypernyms are semantically similar.</Paragraph> <Paragraph position="5"> Our acquisition process computes a common hypernym for expressions in the same itemizations. It proceeds as follows. First, we download a large number of HTML documents from the WWW and extract a set of natural language expressions that are listed in the same itemized region of documents. Consider the itemization in Fig. 1. We extract the set of expressions, {Toyota,Honda,Nissan} from it. From Assumption A, we can treat these expressions as candidates of hyponyms that have a common hypernym such as &quot;company&quot;. We call such expressions in the same itemization hyponym candidates. Particularly, a set of the hyponym candidates extracted from a single itemization or list is called a hyponym candidate set (HCS). For the example document, we would treat Toyota, Honda, and Nissan as hyponym candidates, and regard them as members of the same HCS.</Paragraph> <Paragraph position="6"> We then download documents that include at least one hyponym candidate by using an existing search engine, and pick up a noun that appears in the documents and that has the largest score. The score was designed so that words appearing in many downloaded documents are highly ranked, according to Assumption B. We call the selected noun a hypernym candidate for the given hyponym candidates.</Paragraph> <Paragraph position="7"> Note that if we download documents including &quot;Toyota&quot; or &quot;Honda&quot;, many will include the word &quot;company&quot;, which is a true hypernym of Toyota. However, words which are not hypernyms, but which are closely associated with Toyota or Honda (e.g., &quot;price&quot;) will also be included in many of the downloaded documents. The next step of our procedure is designed to exclude such non-hypernyms according to Assumption C. We compute the similarity between hypernym candidates and hyponym candidates in an HCS, and eliminate the HCS and its hypernym candidate from the output if they are not semantically similar. For instance, if the previous step of our procedure produces &quot;price&quot; as a hypernym candidate for Toyota and Honda, then the hypernym candidate and the hyponym candidates are removed from the output. We empirically show that this helps to improve overall precision. null Finally, we further elaborate computed hypernym candidates by using additional heuristic rules. Though we admit that these rules are rather ad hoc, they worked well in our experiments.</Paragraph> <Paragraph position="8"> We have tested the effectiveness of our methods through a series of experiments in which we used HTML documents downloaded from actual web sites. We observed that our method can find a significant number of hypernyms that (at least some of) alternative hypernym acquisition procedures cannot acquire, at least, when only a rather small amount of texts are available.</Paragraph> <Paragraph position="9"> In this paper, Section 2 describes our acquisition algorithm. Section 3 gives our experimental results which we obtained using Japanese HTML documents, and Section 4 discusses the benefit obtained through our method based on a comparison with alternative methods.</Paragraph> </Section> class="xml-element"></Paper>