File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1060_intro.xml

Size: 5,739 bytes

Last Modified: 2025-10-06 14:01:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1060">
  <Title>Named Entity Recognition using an HMM-based Chunk Tagger</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Named Entity (NE) Recognition (NER) is to classify every word in a document into some predefined categories and &amp;quot;none-of-the-above&amp;quot;. In the taxonomy of computational linguistics tasks, it falls under the domain of &amp;quot;information extraction&amp;quot;, which extracts specific kinds of information from documents as opposed to the more general task of &amp;quot;document management&amp;quot; which seeks to extract all of the information found in a document.</Paragraph>
    <Paragraph position="1"> Since entity names form the main content of a document, NER is a very important step toward more intelligent information extraction and management. The atomic elements of information extraction -- indeed, of language as a whole -- could be considered as the &amp;quot;who&amp;quot;, &amp;quot;where&amp;quot; and &amp;quot;how much&amp;quot; in a sentence. NER performs what is known as surface parsing, delimiting sequences of tokens that answer these important questions. NER can also be used as the first step in a chain of processors: a next level of processing could relate two or more NEs, or perhaps even give semantics to that relationship using a verb. In this way, further processing could discover the &amp;quot;what&amp;quot; and &amp;quot;how&amp;quot; of a sentence or body of text.</Paragraph>
    <Paragraph position="2"> While NER is relatively simple and it is fairly easy to build a system with reasonable performance, there are still a large number of ambiguous cases that make it difficult to attain human performance.</Paragraph>
    <Paragraph position="3"> There has been a considerable amount of work on NER problem, which aims to address many of these ambiguity, robustness and portability issues. During last decade, NER has drawn more and more attention from the NE tasks [Chinchor95a] [Chinchor98a] in MUCs [MUC6] [MUC7], where person names, location names, organization names, dates, times, percentages and money amounts are to be delimited in text using SGML mark-ups.</Paragraph>
    <Paragraph position="4"> Previous approaches have typically used manually constructed finite state patterns, which attempt to match against a sequence of words in much the same way as a general regular expression matcher. Typical systems are Univ. of Sheffield's  NER. These systems are mainly rule-based.</Paragraph>
    <Paragraph position="5"> However, rule-based approaches lack the ability of coping with the problems of robustness and portability. Each new source of text requires significant tweaking of rules to maintain optimal performance and the maintenance costs could be quite steep.</Paragraph>
    <Paragraph position="6"> The current trend in NER is to use the machine-learning approach, which is more Computational Linguistics (ACL), Philadelphia, July 2002, pp. 473-480. Proceedings of the 40th Annual Meeting of the Association for attractive in that it is trainable and adaptable and the maintenance of a machine-learning system is much cheaper than that of a rule-based one. The representative machine-learning approaches used in NER are HMM (BBN's IdentiFinder in [Miller+98]  Besides, a variant of Eric Brill's transformation-based rules [Brill95] has been applied to the problem [Aberdeen+95]. Among these approaches, the evaluation performance of HMM is higher than those of others. The main reason may be due to its better ability of capturing the locality of phenomena, which indicates names in text. Moreover, HMM seems more and more used in NE recognition because of the efficiency of the Viterbi algorithm [Viterbi67] used in decoding the NE-class state sequence. However, the performance of a machine-learning system is always poorer than that of a rule-based one by about 2% [Chinchor95b] [Chinchor98b]. This may be because current machine-learning approaches capture important evidence behind NER problem much less effectively than human experts who handcraft the rules, although machine-learning approaches always provide important statistical information that is not available to human experts. As defined in [McDonald96], there are two kinds of evidences that can be used in NER to solve the ambiguity, robustness and portability problems described above. The first is the internal evidence found within the word and/or word string itself while the second is the external evidence gathered from its context. In order to effectively apply and integrate internal and external evidences, we present a NER system using a HMM. The approach behind our NER system is based on the HMM-based chunk tagger in text chunking, which was ranked the best individual system [Zhou+00a] [Zhou+00b] in CoNLL'2000 [Tjong+00]. Here, a NE is regarded as a chunk, named &amp;quot;NE-Chunk&amp;quot;. To date, our system has been successfully trained and applied in English NER. To our knowledge, our system outperforms any published machine-learning systems. Moreover, our system even outperforms any published rule-based systems.</Paragraph>
    <Paragraph position="7"> The layout of this paper is as follows. Section 2 gives a description of the HMM and its application in NER: HMM-based chunk tagger. Section 3 explains the word feature used to capture both the internal and external evidences. Section 4 describes the back-off schemes used to tackle the sparseness problem. Section 5 gives the experimental results of our system. Section 6 contains our remarks and possible extensions of the proposed work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML