File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4030_intro.xml

Size: 4,838 bytes

Last Modified: 2025-10-06 14:02:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4030">
  <Title>Nearly-Automated Metadata Hierarchy Creation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Human-readable hierarchies of category metadata are needed for a wide range of information-centric applications, including information architectures for web sites (Rosenfeld and Morville, 2002) and metadata for browsing image and document collections (Yee et al., 2003).</Paragraph>
    <Paragraph position="1"> In the information architecture community, methods for creation of content-oriented metadata tend to be almost entirely manual (Rosenfeld and Morville, 2002).</Paragraph>
    <Paragraph position="2"> The standard procedure is to gather lists of terms from existing resources, and organize them by selecting, merging and augmenting the term lists to produce a set of hierarchical category labels. Usually the metadata categories are used as labels which are assigned manually to the items in the collection.</Paragraph>
    <Paragraph position="3"> We advocate instead a nearly-automated approach to building hierarchical subject category metadata, where suggestions for metadata terms are automatically generated and grouped into hierarchies and then presented to information architects for limited pruning and editing.</Paragraph>
    <Paragraph position="4"> To be truly useful, these suggested groupings should be close to the final product; if the results are too scattered, a simple list of the most well-distributed terms is probably more useful (a similar phenomenon is seen in machine-aided translation systems (Church and Hovy, 1993)).</Paragraph>
    <Paragraph position="5"> More specifically, we aim to develop algorithms for generating category sets that (a) are intuitive to the target audience who will be browsing a web site or collection, (b) reflect the contents of the collection, and (c) allow for (nearly) automated assignment of the categories to the items in the collection.</Paragraph>
    <Paragraph position="6"> For a category system to be intuitive, modern information science practice finds that it should consist of a set of IS-A (hypernym) hierarchies1, from which multiple labels can be selected and assigned to an item, following the tenants of faceted classification (Rosenfeld and Morville, 2002; Yee et al., 2003). For example, a medical journal article will often simultaneously have terms assigned to it from anatomy, disease, and drug category hierarchies. Furthermore, usability studies suggest that the hierarchies should not be overly deep nor overly wide, and preferably should have concave structure (meaning broader at the root and leaves, narrower in the middle) (Bernard, 2002).</Paragraph>
    <Paragraph position="7"> Previous work on automated methods has primarily focused on using clustering techniques, which have the advantage of being automated and data-driven. However, a major problem with clustering is that the groupings show terms that are associated with one another, rather than hierarchical parent-child relations. Studies indicate that users prefer organized categories over associational clusters (Chen et al., 1998; Pratt et al., 1999).</Paragraph>
    <Paragraph position="8"> We have tested several approaches, including K-means clustering, subsumption (Sanderson and Croft, 1999), computing lexical co-occurrences (Schutze, 1993) and building on the WordNet lexical hierarchy (Fellbaum, 1998). We have found that the latter produces by far the most intuitive groupings that would be useful for creation of a re-usable, human-readable category structure. Although the idea of using a resource like WordNet for this type of application seems rather obvious, to our knowledge it has not been used to create subject-oriented meta-data for browsing. This may be in part because it is very 1Part-of (meronymy) relations are also intuitive, but are not considered here.</Paragraph>
    <Paragraph position="9"> large and the word senses are assumed to be too fine-grained (Mihalcea and Moldovan, 2001), or its structure is assumed to be inappropriate.</Paragraph>
    <Paragraph position="10"> However, we have found that, for some collections, starting with the assumption that there will be a small amount of hand-editing done after the automated processing, combined with a bottom-up approach that extracts out those parts of the hypernym hierarchy that are relevant to the collection, and a compression algorithm that simplifies the hierarchical structure, we can produce a structure that is close to the target goals.</Paragraph>
    <Paragraph position="11"> Below we describe related work, the method for converting WordNet into a more usable form, and the results of using the algorithm on a test collection.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML