File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1039_metho.xml

Size: 10,239 bytes

Last Modified: 2025-10-06 14:14:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1039">
  <Title>Identification and Classification of Proper Nouns in Chinese Texts</Title>
  <Section position="3" start_page="226" end_page="227" type="metho">
    <SectionTitle>
5. Organization Names
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="226" end_page="226" type="sub_section">
      <SectionTitle>
5.1 Structures of Organization Names
</SectionTitle>
      <Paragraph position="0"> Structures of organization names are more complex than those of personal names. Some organization names are composed of proper nouns and content r~-ILIII~H,J is made up of words. For exmnple, '~&amp;quot; '&amp;quot; &amp;quot; ~:' the place name 'T' ?;\[\[; fii' and the content word '~t~',J-'. A personal nmne can also be combined a content word to form an organization name, e.g., '~t,7~l I~1 ~i~\[ ii~ PJi '. Some organization names look like personal names, e.g., '\[j\[~)t'. Some organization names are composed of several related words. For example, ' 7~ II~ ~ ~ ~ &amp;quot;b~i, 3,~ ~ ~' contains four words %'~11~', '~'c':', '~' and '3,~'. Several single-character words can also form an organization name, e.g., ' JWt J~ ii~ ~ '. Some organization names have nested structures.</Paragraph>
      <Paragraph position="1"> Consider the string: '~l&amp;quot;~,~7;~lIfi~ ~PS~,~ C/,i~ /J~PS\[t'. The group '~iJ~/J\~\[l' is a part of the committee ,x~.~.~q-,, and the committee itself is a part of '~I;~3~i~II~,;5'. Such complex structures make identification of organization names very difficult.</Paragraph>
      <Paragraph position="2"> Basically, a complete organization name can be divided into two parts: name and keyword. In the '-~&amp;quot; l'i ' ':' i l-I\[~rtl is a name, and '1l~ example i i-Ill I ~f:~f'l, &amp;quot;~&amp;quot; '&amp;quot; /I~J:' is a keyword. Many words can serve as names, but only some fixed words can be regarded as keywords. Thus, keyword ix an important clue to identify the organizations. However, there are still several difficult problems. First, keyword is usually a common content word. It is not easy to tell out a keyword and a content word. Second, a keyword may appear m the abbreviated form. For exmnple, ' .\]~i~i' ix an incomplete keyword of 'C/~:~iPSill~l~, r\]'. Third, the keyword may be omitted completely. For example, '~_~),~' (Acer). The following shows two rough classifications, and discusses their feattues.</Paragraph>
      <Paragraph position="3">  (l) Complete organization names (a) Structure: This type of organization names is usually composed of proper nouns and keywords.</Paragraph>
      <Paragraph position="4"> (b) Length: Some organization names are very long, so it is hard to decide their length.</Paragraph>
      <Paragraph position="5"> Fortunately, only some keyword like 'l iil ~ J.~&amp;quot;, ' ~ ~', '3,~,~,:~', '~\[\[,~I~', mid so on, have this problem. (c) Ambiguity: Some organization names  with keywords are still mnbiguous. For exmnple, ' X l&amp;quot; ~(~ ii,~:,' and '1\[~, ~'~'. They usually denote reading matters, but not organizations. However, if they are used in some contexts, e.g., &amp;quot;~ l&amp;quot;~f::il;~,~ ~ ~f! J3t\[&amp;quot; and &amp;quot; l\[~, ~ ~\[~ (l~( ~lf )~ &amp;quot;, they should be interpreted as organizations.</Paragraph>
      <Paragraph position="6">  (2) Incomplete organization names (a) Structure: These organization names often omit their keywords.</Paragraph>
      <Paragraph position="7"> (b) Ambiguity: The abbreviated organization  names may be ambiguous. For example, '~t!,~($', '~ ~',' ,~,.,~!~' and '/~'/&amp;quot; .... L~ - are famous sport teams m Taiwm~ or in U.S.A., however, they are also general content words.</Paragraph>
    </Section>
    <Section position="2" start_page="226" end_page="227" type="sub_section">
      <SectionTitle>
5.2 Strategies
</SectionTitle>
      <Paragraph position="0"> This section introduces some strategies used in the identification. Keyword is a good indicator for an identification system. It plays the similar role of surnames. Keyword shows not only the possibility  of an occurrence of an organization name, but also its right boundary. For each sentence, we scan it from left to right to find keywords. Because keyword is a general content word, we need other strategies to tell out its exact meaning. These strategies also have the capabilities to detect the left boundary if there is an organization name.</Paragraph>
      <Paragraph position="1"> Prefix is a good marker for possible left boundary. For example, '\[,,~I~Z' (National), '~(~qL' (Provincial), '~\]~ qi? (Private), and so on. The name part of an organization may be forlned by single characters or words. These two cases are discussed as follows.</Paragraph>
      <Paragraph position="2"> (a) single characters After segmentation, there nmy be a sequence of single characters preceding a possible keyword The character may exist independently. That is, it is a single-character word. In this case, the content word is not a keyword, so that no organization name is found If these characters cannot exist independently, they form the name part of an organization. The left boundary of the organization is determined by the following rule: We insert a single character to the name part until a word is met.</Paragraph>
      <Paragraph position="3"> (b) word(s) Here, a word is composed of at least two characters. If the word preceding the possible keyword is a place name or a personal name, then the word forms the name part of an organization. Otherwise, we use word association model to determine the left boundary. The postulation is: the words to compose a name part usually have strong relationships. The mutual information mentioned in Section 3.2.4.2 is also used to measure the relationship of two words.</Paragraph>
      <Paragraph position="4"> Part of speech is useful to determine the left boundary of an organization. The categories of verbs are very typical. The name part of an organization cannot extend beyond a transitive verb. If a transitive verb precedes a possible keyword, then no organization name is found. Numeral and classifier are also helpful. For exan~ple, '~ HJ' (company) in '~fJ...' (three companies ...) is not a keyword due to the critical parts of speech. Because a tagger is not involved before identification, the part of speech of a word is determined wholly by lexical probability.</Paragraph>
    </Section>
    <Section position="3" start_page="227" end_page="227" type="sub_section">
      <SectionTitle>
5.3 Experiments and Discussions
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the precision and the recall for every section. Section 4 (The International Section) has better precision and recall than other files. Most errors result from organization names without keywords, e.g., ',~,=~fijL~' '~+I(6', /\:L~I~:~!~I, JL~J, and so on. Even keywords appear, e.g., ' \[-riJ~r fJ' and '~r ~ fi: ~-~', there may not always exist organization names. Besides error candidates and organization names without keywords, error left boundary is also a problem. Consider the exalnples: '~\[';~\['-'~!~.:),t~' and 'f~-~'. In the first, '~qS)' should not be included: and in the second, a word ' 3,~'~' is lost.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="227" end_page="228" type="metho">
    <SectionTitle>
6. Applications
</SectionTitle>
    <Paragraph position="0"> The senmntic classification of proper nouns is use fill in many applications. Here, anaphora resolution and sentence aligmnent are presented. In general, pronoun often refers to the nearest proper noun (Chen, 1992). But it is not always true. The following shows a counter example: The first pronoun '1&amp;quot;1f~' (tie) refers to the personal name '-(~,~'. It is a normal example. The second pronoun '~' (he) refers to the same person, but the '~'~' ' ;i;~.~!'. nearest personal name is * ~j~-~ rather than If we know the gender of every personal name, then it is easy to tell out which person is referred, in the above example, the gender of the Chinese pronouns '~\[~' (he) and '/t\[\[~' (she) is masculine and feminine, respectively; tim persons ':~'17:~ ' and '.1:~'~-~'~'!'~' : ;'a, ~42~1'J\],-5C/S are nmle and female, respectively. Therefore, the correct referential relationships can be wellestablished. In the experiment of the gender assignment, 3/4 of Chinese personal name corpus is regarded as training data, and the renmining l/4 is for testing. The correct rate is 89%. Sentence alignment (Chen &amp; Chen, 1994) is important in  setup of a bilingual corpus. Personal name is one of important clues. Its use in aligning English-Chinese text is shown in the paper (Chen &amp; Wu, 1995\].</Paragraph>
  </Section>
  <Section position="5" start_page="228" end_page="228" type="metho">
    <SectionTitle>
7, Concluding Remarks
</SectionTitle>
    <Paragraph position="0"> This paper proposes various strategies to identify and classify Chinese proper nouns. The perfornmnce evahmtion criterion is very strict Not only are the proper nouns identified, but also suitable features are assigned. The perforlnance (precision, recall) for the identification of Chinese personal names, transliterated personal nmnes and organization nmnes is (88.04%, 92.56%), (50.62%, 71.93%) and (61.79%, 54.50%), respectively.</Paragraph>
    <Paragraph position="1"> When the criterion is loosed a little, i.e., Chinese personal nmnes and transliterated personal names are regarded as a category, the performance ~s (81.46%, 91.22%). Compared with the approaches (Sproat et al., 1994: Fung &amp; Wu, 1994: Wang et al., 1994), we deal with more types of proper nouns and we have better performance.</Paragraph>
    <Paragraph position="2"> Some difficult problems should be tackled in the flmlre. Foreign proper nouns may be transformed in part by transliteration and translation. The example &amp;quot;George Town&amp;quot; is transformed into ':~'{{~J4~'. The character 'b~' (town) results in translation and' fq:?(( (George) comes from transliteration. Tlus problem is interesting and worthy of resolving. The performance of identification of organization names ms not good enough, especially for those organization names without keywords. It should be investigated further.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML