File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1088_intro.xml

Size: 2,636 bytes

Last Modified: 2025-10-06 14:01:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1088">
  <Title>Unsupervised Named Entity Classification Models and their Ensembles</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Named entity extraction is an important step for various applications in natural language processing. Named entity extraction involves identifying named entities in the text and classifying their types such as person, organization, location, time expressions, numeric expressions, and so on (Sekine and Eriguchi, 2000).</Paragraph>
    <Paragraph position="1"> One might think the named entities can be classified easily using dictionaries because most of named entities are proper nouns, but this is wrong opinion. As time passes, new proper nouns are created continuously. Therefore it is impossible to add all those proper nouns to a dictionary. Even though named entities are registered in the dictionary it is not easy to decide their senses. They have a semantic (sense) ambiguity that a proper noun has different senses according to the context (Nina Wacholder, et al., 1997). For example, 'United States' refers either to a geographical area or to the political body which governs this area. The semantic ambiguity is occured frequently in Korean (Seon, et al. 2001). Let us illustrate this.</Paragraph>
    <Paragraph position="3"> In the above examples, 'KAIST' has different categories although same postposition, 'e-seo', followed. The classification of named entities in Korean is a little more difficult than in English.</Paragraph>
    <Paragraph position="4"> There are two main approaches to classify named entities. The first approach employs hand-crafted rules. It costs too much to maintain rules because rules and dictionaries have to be changed according to the application. The second belongs to a supervised learning approach, which employs a statistical method.</Paragraph>
    <Paragraph position="5"> As it is more robust and requires less human intervention, several statistical methods based on a hidden Markov model (Bikel et al., 1997), a Maximum Entropy model (Borthwich et al., 1998) and a Decision Tree model (Bechet et al.</Paragraph>
    <Paragraph position="6"> 2000) have been studied. The supervised learning approach requires a hand-tagged training corpus, but it can not achieve a good performance without a large amount of data because of data sparseness problem. For example, Borthwich (1999) showed the performance of 83.45% in Precision and 77.42% in F-measure for identifying and classifying the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML