File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/00/a00-1038_relat.xml

Size: 4,332 bytes

Last Modified: 2025-10-06 14:15:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1038">
  <Title>Large-scale Controlled Vocabulary Indexing for Named Entities</Title>
  <Section position="3" start_page="276" end_page="277" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> The Carnegie Group's Text Categorization Shell (TCS) (Hayes, 1992) uses shallow knowledge engineering techniques to categorize documents with respect to large sets of predefined topics. Each topic requires the development of a rule set that includes terms, contextual information, weighting, if-then rules and other pattern matching operations.</Paragraph>
    <Paragraph position="1"> This initially involved a manual, iterative approach to rule development, although Hayes (1992) discusses their intent to explore ways to automate this. TCS accuracy is quite good. One application deployed at Reuters achieved 94% recall and 84% precision. Other reported tests achieved recall and precision rates of 90% or better.</Paragraph>
    <Paragraph position="2"> SRA's NameTag (Krupka, 1995) uses a pattem recognition process that combines a rule base with lexicons to identify and extract targeted token classes in text, such as company, people and place names. It achieved 96% recall and 97% precision when tested on Wall Street Journal news articles at MUC-6. Aone et al. (1997) describes a NameTag-based application for indexing English and Japanese language texts. NameTag addresses a key weakness of approaches that use predefined topic sets. Predefined topic sets are inherently limited in their coverage to those topics that have been explicitly defined. NameTag can recognize any number of companies or entities of other domains whose names have structure that the rules can recognize. (Not all entity domains have structure that pattern recognition processes can exploit, e.g., product names).</Paragraph>
    <Paragraph position="3"> A problem for pattern recognition approaches has to do with our requirement to assign CVTs. Pattern recognition approaches extract patterns such as company names as they appear in the text. Limited coreference resolution may link variant forms of names with one another to support choosing the best variant as a &amp;quot;semi-controlled&amp;quot; vocabulary term, but this does not allow for the assignment of true primary and secondary CVTs. SRA has attempted to address this through its Name Resolver function, which reconciles extracted names with an authority file, but the authority file also limits scope of coverage for CVTs to those that are defined and maintained in the authority file. The system must also go beyond straight recognition in order to make a distinction between documents with major references to the targeted entities and documents with lesser or passing references.</Paragraph>
    <Paragraph position="4"> SRA's NameTag addresses this with the calculation of relevance scores for each set of linked variant forms.</Paragraph>
    <Paragraph position="5"> Preliminary research suggests that recognizing named entities in data and queries may lead to a significant improvement in retrieval quality (Thompson &amp; Dozier, 1999). Such an approach may complement Entity Indexing, but it does not yet meet the controlled vocabulary indexing and accuracy requirements for Entity Indexing.</Paragraph>
    <Paragraph position="6"> Our own Term-based Topic Identification (TFI) system (Leigh, 1991) combines knowledge engineering with limited learning in support of document categorization and indexing by CVTs. We have used TI'I since 1990 to support a number of topically-related news or legal document collections. Categories are defined through topic definitions. A definition includes terms to look up, term weights, term frequency thresholds, document selection scoring thresholds, one or more CVTs, and source-specific document structure information.</Paragraph>
    <Paragraph position="7"> Although creating TI'I topic definitions is primarily an iterative, manual task, limited regression-based supervised learning tools are available to help identify functionally redundant terms, and to suggest term weights, frequency thresholds and scoring thresholds. When these tools have been used in building topic definitions, recall and precision have topped 90% in almost all tests.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML