File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/m98-1017_intro.xml

Size: 3,855 bytes

Last Modified: 2025-10-06 14:06:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1017">
  <Title>DESCRIPTION OF THE NTU SYSTEM USED FOR MET2</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> People, affairs, time, places and things are five basic entities in a document. When we catch the fundamental entities, we can understand a document to some degree. Natural Language Processing Laboratory (NLPL) in Department of Computer Science and Information Engineering (CSIE), National Taiwan University (NTU) starts to study named entity extraction problem in 1993. At first, we focus on the extraction of Chinese person names, transliterated person names [1] and organization names [2]. The training data and the testing data in these experiments are selected from three Taiwan newspaper corpora (China Times, Liberty Times News and United Daily News). Chen and Lee [3] reported the precision rate and the recall rate for the extraction of Chinese person names, transliterated person names and organization names are (88.04%, 92.56%), (50.62%, 71.93%) and (61.79%, 54.50%), respectively in the 16th International Conference on Computational Linguistics.</Paragraph>
    <Paragraph position="1"> We employ these results to several applications. Chen and Wu [4] considered person names as one of clues in sentence alignment. Chen and Lee [3] show its application to anaphora resolution. Chen and Bian [5] proposed a method to construct white pages for Internet/Intranet users automatically. We extract information from World Wide Web documents, including proper nouns, E-mail addresses and home page URLs, and find the relationship among these data. Chen, Ding and Tsai [6,7] dealt with proper noun extraction for information retrieval.</Paragraph>
    <Paragraph position="2"> In MUC-7 and MET-2, we attend named entity extraction tasks for both English and Chinese. We extend our previous work on this problem to cover more named entity types such as locations, date/time expressions and monetary and percentage expressions. Several issues have to be addressed during extension.</Paragraph>
    <Paragraph position="3"> One of the major differences between Chinese and English language processing is that segmentation is required for Chinese. That is, we have to identify word boundary in Chinese sentences beforehand. That makes Chinese named entity extraction tasks more changeable. Besides, the vocabulary set and the Chinese coding set used in Taiwan and in China are not the same. The documents adopted in MET-2 are selected from newspapers in China, thus we have to transform simplified Chinese characters in GB coding set to traditional Chinese characters in Big-5 coding set before testing. A word that is known may become unknown due to transformation. For example, the character &amp;quot;GFA&amp;quot; in &amp;quot;G87GFA&amp;quot; (early morning) is used in traditional Chinese characters. However, &amp;quot;G44&amp;quot; is used in simplified Chinese characters and it is also a legal traditional Chinese character that denotes another meaning. In other words, the mapping from GB to Big5 is &amp;quot;G87G44&amp;quot;, which is an unknown word based on our dictionary. The different vocabulary set between China and Taiwan results in different segmentation.</Paragraph>
    <Paragraph position="4"> This paper is organized as follows. Section 2 illustrates the flow of named entity extraction and the summary scores of our team in MET-2 formal run. Sections 3, 4 and 5 propose methods to extract named people, organizations and locations. Section 6 deals with the rest of named entities, i.e., date/time expressions and monetary and percentage expressions. After each section, we discuss the sources of errors in the formal run. Section 7 concludes the remarks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML