File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1024_intro.xml

Size: 2,684 bytes

Last Modified: 2025-10-06 14:00:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1024">
  <Title>Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In any real world use, a Natural Language Processing (NLP) system will encounter words that are not in its lexicon, what we term 'unknown words'. Unknown words are problematic because a NLP system will perform well only if it recognizes the words that it is meant to analyze or translate: the more words a system does not recognize the more the system's performance will degrade. Even when unknown words are infrequent, they can have a disproportionate effect on system quality. For example, Min (1996) found that while only 0.6% of words in 300 e-mails were misspelled, this meant that 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)).</Paragraph>
    <Paragraph position="1"> Words may be unknown for many reasons: the word may be a proper name, a misspelling, an abbreviation, a number, a morphological variant of a known word (e.g. recleared), or missing from the dictionary. The first step in dealing with unknown words is to identify the class of the unknown word; whether it is a misspelling, a proper name, an abbreviation etc. Once this is known, the proper action can be taken, misspellings can be corrected, abbreviations can be expanded and so on, as deemed necessary by the particular text processing application. In this paper we introduce a system for categorizing unknown words. The system is based on a multi- component architecture where each component is responsible for identifying one category of unknown words. The main focus of this paper is the components that identify names and spelling errors.</Paragraph>
    <Paragraph position="2"> Both components use a decision tree architecture to combine multiple types of evidence about the unknown word. Results from the two components are combined using a weighted voting procedure. The system is evaluated using data from live closed captions - a genre replete with a wide variety of unknown words.</Paragraph>
    <Paragraph position="3"> This paper is organized as follows. In section 2 we outline the overall architecture of the unknown word categorizer. The name identifier and the misspelling identifier are introduced in section 3. Perfor- null mance and evaluation issues are discussed in section 4. Section 5 considers portability issues. Section 6 compares the current system with relevant preceding research. Concluding comments can be found in section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML