File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1203_intro.xml
Size: 4,818 bytes
Last Modified: 2025-10-06 14:00:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1203"> <Title>Knowledge Extraction for Identification of Chinese Organization Names</Title> <Section position="2" start_page="0" end_page="15" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> The occurrences of unknown words cause difficulties in natural language processing. The word set of a natural language is open-ended.</Paragraph> <Paragraph position="1"> There is no way of collecting every words of a language, since new words will be created for expressing new concepts, new inventions, newborn babies, new organizations. Therefore how to identify new words in a text will be the most challenging task for natural language processing. It is especially true for Chinese.</Paragraph> <Paragraph position="2"> Each Chinese morpheme (usually a single character) carries meanings and most arc polyscmous. New words are easily constructed by combining morphemes and their meanings are the semantic composition of morpheme components. However there are also semantically non-compositional compounds, such as proper names. In Chinese text, there is no blank to mark word boundaries and no inflectional markers nor capitalization markers to denote the syntactic or semantic types of new words. Hence the unknown word identification for Chinese became one of the most difficult and demanding research topic.</Paragraph> <Paragraph position="3"> There are many different types of unknown words and each has different morphsyntactic and morph-scmantic structures. In principle their syntactic and semantic categories can be determined by their content and contextual information, but there arc many difficult problems have to be solved. First of all it is not possible to find a uniform representational schema and categorization algorithm to handle different types of unknown words due to their different morphsyntactic structures. Second, the clues for identifying different type of unknown words are also different. For instance, identification of names of Chinese people is very much relied on the surnames, which is a limited set of characters. The statistical methods are commonly used for identifying proper names (Chen & Lee 1996, Chang ct al. 1994, Sun et al. 1994). The identification of general compounds is more relied on the morphemes and the semantic relations between morphemes (Chcn & Chcn 2000). The third difficulty is the problems of ambiguities, such as structure ambiguities, syntactic ambiguities and semantic ambiguities. For instances, usually a morpheme character/word has multiple meaning and syntactic categories and may play the roles of common words or proper names.</Paragraph> <Paragraph position="4"> Therefore the ambiguity resolution became one of the major tasks.</Paragraph> <Paragraph position="5"> In this paper we focus our attention on the identification of the organization names.</Paragraph> <Paragraph position="6"> It is considered to be a hard task to identify organization names in comparing with the identification of other types of unknown words, because there are not much morphsyntactic and morph-sernantic clues to indicate an organization name. There is no significant preference on the selection of morphemes/characters and the semantic of the morphemes, which gives no clue leading toward the identification. For instance, '~, micro-soW (Microsoft) has the character by character (morpheme by morpheme) translation of 'slightly soR&quot; and there is no marker, such as capitalization, to indicate that it is a proper name. The only reliable clue is its context information. However an organization's full names usually occur at its first mention, unless it is a well-known organization. A full name contains its proper name and organization type, such as '~ &quot;~ , Acer Computer-Company'. The organization types became the major clue of identifying a new organization name. However abbreviated shorter names usually will be used, such as a) omit part of the organization type, for instances '~ ~, Acer Computer', '~ ~..~J, Acer Company', b) omit the organization type totally, for instance '~.~, Acer', or c) the abbreviation, for instance '~, global-electric (Acer-computer)'. Therefore the task became not only the identification of organization names in different forms but also finding their meaning equivalence classes. To achieve the above goal, the knowledge of 1) proper names of organizations, 2) different lines of the businesses, and 3) different organization types, should be equipped. Unfortunately there is no wellprepared knowledge sources containing the above information. Therefore a knowledge extraction model is proposed to extract the above mentioned knowledge from the dictionary and domain texts.</Paragraph> </Section> class="xml-element"></Paper>