File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0106_intro.xml

Size: 5,617 bytes

Last Modified: 2025-10-06 14:05:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0106">
  <Title>Customizing a Lexicon to Better Suit a Computational Task</Title>
  <Section position="4" start_page="0" end_page="55" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Much effort is being applied to the creation of lexicons and the acquisition of semantic and syntactic attributes of the lexical items that comprise them, e.g, \[1\], \[4\],\[7\],\[8\], \[11\], \[16\], \[18\], \[20\]. However, a lexicon as given may not suit the requirements of a particular computational task. Because lexicons are expensive to build, rather than create new ones from scratch, it is preferable to adjust existing ones to meet an application's needs. In this paper we describe such an effort: we add associational information to a hierarchically structured lexicon in order to better serve a text labeling task.</Paragraph>
    <Paragraph position="1"> An algorithm for partitioning a full-length expository text into a sequence of subtopieal discussions is described in \[9\]. Once the partitioning is done, we need to assign labels 1 indicating what the subtopical discussions are about, for the purposes of information retrieval and hypertext navigation. One way to label texts, when working within a limited domain of discourse, is to start with a pre-defined set of topics and specify the word contexts that indicate the topics of interest (e.g., \[10\]). Another way, assuming that a large collection of pre-labeled texts exists, is to use statistics to automatically infer which lexical items indicate which labels (e.g., \[12\]). In contrast, we are interested in assigning labels to general, domain-independent text, without benefit of pre-classified texts. In all three cases, a lexicon that specifies which lexical items correspond to which topics is required. The topic labeling method we use is statistical and thus requires a large number of representative lexical items for each category.</Paragraph>
    <Paragraph position="2"> The starting point for our lexicon is WordNet \[13\], which is readily available online and provides a large repository of English lexical items. WordNet 2 is composed of synsets,  structures containing sets of terms with synonymous meanings, thus allowing a distinction to be made between different senses of homographs. Associated with each synset is a list of relations that the synset participates in. One of these, in the noun dataset, is the hyponymy relation (and its inverse, hypernymy), roughly glossed as the &amp;quot;ISA&amp;quot; relation. This relation imposes a hierarchical structure on the synsets, indicating how to generalize from a subordinate term to a superordinate one, and vice versa. 3 This is a very useful kind of information for many tasks, such as reasoning with generalizations and assigning probabilities to grammatical relations \[17\].</Paragraph>
    <Paragraph position="3"> We would like to adjust this lexicon in two ways in order to facilitate the label assignment task. The first is to collapse the fine-grained hierarchical structure into a set of coarse but semantically-related categories. These categories will provide the lexical evidence for the topic labels. (After the label is assigned, the hierarchical structure can be reintroduced.) Once the hierarchy has been converted into categories, we can augment the categories with new lexical items culled from free :text corpora, in order to further improve the labeling task.</Paragraph>
    <Paragraph position="4"> The second way we would like to adjust the lexicon is to combine categories from distant parts of the hierarchy. In particular, we are interested in finding groupings of terms that contribute to a frame or schema-like representation \[14\]; this can be achieved by finding associational lexical relations among the existing taxonymic relations. For example, WordNet has the following synsets: &amp;quot;athletic game&amp;quot; (hyponyms: baseball, tennis), &amp;quot;sports implement&amp;quot; (hyponyms: bat, racquet), and &amp;quot;tract, piece of land&amp;quot; (hyponyms: baseball_diamond, court), none of which are closely related in the hierarchy. We would like to automatically find relations among categories headed by synsets like these. (In Version 1.3, the WordNet encoders have placed some associational links among these categories, but still only some of the desired connections appear.) In other words, we would like to derive links among schematically related parts of the hierarchy, where these links reflect the text genre on which text processing is to be done.</Paragraph>
    <Paragraph position="5"> \[19\] describes a method called WordSpace that represents lexical items according to how semantically close they are to one another, based on evidence from a large text corpus.</Paragraph>
    <Paragraph position="6"> We propose combining this term-similarity information with the hierarchical information already available in WordNet to create structured associational information.</Paragraph>
    <Paragraph position="7"> In the next section we describe the algorithm for compressing the WordNet hierarchy into a set of categories. This is followed by a discussion of how these categories are to be used and why they need to be improved. Section 4 describes the first improvement technique: including new, related terms from a corpus, and Section 5 describes the second improvement technique: bringing disparate categories together to form schematic groupings while retaining the given hierarchical structure. Section 6 concludes the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML