File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1203_metho.xml

Size: 19,014 bytes

Last Modified: 2025-10-06 14:07:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1203">
  <Title>Knowledge Extraction for Identification of Chinese Organization Names</Title>
  <Section position="3" start_page="15" end_page="15" type="metho">
    <SectionTitle>
2. STRUCTURES OF ORGANIZATION
NAMES
</SectionTitle>
    <Paragraph position="0"> There is no rigid structure for an organization name as mentioned in the previous section.</Paragraph>
    <Paragraph position="1"> Roughly speaking an organization name is composed by two major components. The first part is the proper name and the second part is the organization type. The second part contains the major key words lead toward the identification of an organization, since the organization types, such as '~J, company', '~'~&amp;quot; foundation', &amp;quot;/J',~J~ group', '~\[\] enterprise' etc, tells what kind of organizations they are. If it is a company, to be more informative the line of business usually goes with the key word '~\] company', for instances '~ff~.t..~J food company', '~ ~ computer company', ' ~M ~='~ ~ investment consultant company', but in most cases the keyword '~B\] company' will be ignored, such as ~--~( President food).</Paragraph>
    <Paragraph position="2"> Sometimes the line of business and the organization type go together to become a single word, such as ,qa~ middle school', 'f,l~ bank' , '~: ~ hospital'. By observing the structure of the organization name, it seems that once a complete list of the organization types is well prepared, then it is not hard to identify the organizations by their Rill names. The only complication is that abbreviated names occur more frequently than fidl names. The identifier '.~..~\] company' is usually ignored in real text. The lines of business became the major identifier for a company and many business lines are common words, such as '~1~ food', '~JJ~ computer', 'TJ~0~ cement'. Therefore it is necessary to make the distinction between a common compounds and a company name, for examples, '~J~J~n health food&amp;quot; vs. ,~m~ President food', '~fgj~ personal computer' vs. '~ ~ Acer computer'. Although they are two-way ambiguous, usually they have only one preference reading.</Paragraph>
    <Paragraph position="3"> In conclusion, the types and the proper names of organizations will be the major clues lead toward the identification of the organizations. In addition, it is also better to have a list of well known organization names, such that the well known company names, like '~.~i\[ MicrosoR', can be identified immediately. Most of the knowledge preparation works should be done by oflline approaches. The prepared knowledge would be utilized to online identification of newly coined organiT~ations.</Paragraph>
    <Paragraph position="4"> The equivalent classes of the well-known organizations are also classified by a similarity-based approach.</Paragraph>
  </Section>
  <Section position="4" start_page="15" end_page="16" type="metho">
    <SectionTitle>
3. KNOWLEDGE EXTRACTION
</SectionTitle>
    <Paragraph position="0"> There are two knowledge sources. One is the CKIP Chinese lexicon and another is the Chinese text from WWW. The lexicon provides a partial list of important organizations and the information extracted from them will be the initial knowledge of the identification system. The texts from WWW provide ample of new organization names implicitly. The problem is how to extract some, if not all, of them from the texts. Once we have a list of organization names. The proper names for organizations and the organization types will be extracted by analyzing the morphological structures of the organization names. However an effective morphological analyzer depends upon the availability of the knowledge of the organization types, but the lists of the organization types are not available yet.</Paragraph>
    <Paragraph position="1"> As we mentioned before the complete organization names have two parts. The first part is the proper name and the second part is the organization type. The number of different proper names is unlimited and on the other hand the number of different organization types is limited. Th/s property will be utilized to separate the variable parts, i.e. the proper name, and the constant parts, i.e. the organization type, from the organization names.</Paragraph>
    <Paragraph position="2">  The numbers of organization names in the lexicon is very limited, since only the important organizations in the common domain will be collected. Therefore the initial knowledge extracted from lexicon is also very limited. To make the sources of knowledge more adequate, vast amount of new organization names should be extracted from each different domain corpus. Unfortunately none of the existing corpora had tagged the organization names. Therefore we are going to design a semi-automatic method to extract the high frequency organization names from text corpora.</Paragraph>
    <Paragraph position="3"> The locality of occurrences of keywords in a text will be utilized for keyword extraction. Once an organization name occun-ing in a text it is very probably reoccurred in the same text. The recurrence property had been utilized to extract keywords or key-phrases from text (Chien 1999, Fung 1998, Smadja 1993). However not all keywords arc organization names. The knowledge extracted from the lexicon, i.e. the list of the organization types will be the initial knowledge for identifying organization names. In addition to the initial knowledge, the structure property of the organization names will be also utilized in classifying extracted keywords into organization names and non-organization names. The extraction processes will be repeated for extracting new organizations and therefore extracting new organization types. The more knowledge would have been extracted the more accurate of the organization identification will achieve.</Paragraph>
    <Section position="1" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
3.1 Morphological Analysis for Organization
Names
</SectionTitle>
      <Paragraph position="0"> There are 1391 number of words in the CKIP lexicon classified as organizations. Table I shows some of the examples.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="16" end_page="17" type="metho">
    <SectionTitle>
CKIP dictionary
</SectionTitle>
    <Paragraph position="0"> As we observed, the morphological structure of an organization name usually is a compounding of a proper name and a organization type. The organization type might be a compounding of a line of business and a type, for instances i.~JJ~ ~..~J (computer company), ~\]~,~(bank), qb~ (middle school), or simply a line of business, for instances AM(food), ~-~J~(computer), 7~ ~ (cement). The proper names are variables, since each organization type may have many different institutions with different names. The types are constants.</Paragraph>
    <Paragraph position="1"> There is a limited number of constants attached with many different proper names to form different organization names.</Paragraph>
    <Paragraph position="2"> Therefore to extract the organization types is equivalent to extract the high frequency ending morphemes. Table 2 shows the top 20 high frequency ending morphemes extracted from the 1391 organization names and in deed they are organization types.</Paragraph>
    <Paragraph position="4"> ranked with their occurrence frequencies extracted from the 1391 organization names</Paragraph>
    <Section position="1" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
3.2 Automatic Extraction of Organization
Names
</SectionTitle>
      <Paragraph position="0"> A Web spider can extract text from each different domain through WWW. Then keyword extraction technique is applied on domain texts to retrieve possible keywords.</Paragraph>
      <Paragraph position="1"> The kcyword set includes organization names, personal names, general compounds, and also error extraction. Most of which are not organization names. It is supposed that the available list of the organization types will be the source of knowledge to identify candidates of organizations. However such a method only identifies the organizations of the known organization types and provides new proper names only. It will not identify new types of organizations. Therefore we use a new method to extract the organization names by using the structure property of organization names.</Paragraph>
      <Paragraph position="2"> Extraction Algorithm for Organization Types: Step 1. Using a Web spider to collect Chinese texts of a fixed domain, such as domain of finance and business, from .WWW.</Paragraph>
      <Paragraph position="3"> Step 2. Extract high frequency keywords in the text (Smadja 93, Chang &amp; Su 97, Chien 99).</Paragraph>
      <Paragraph position="4"> Step 3. For the keywords of length 3,4, and 5, each keyword is divided into two parts X and Y. X is a candidate of proper name and  Y is a candidate of organization type. The X is the initial two-characters of the keyword and Y is the remained characters. (Since most proper names of organizations have two characters, we can extract the organization types of the lengths 1, 2 and 3 from three different groups of keywords with lengths 3, 4 and 5 respectively.) Extract the organization type Y, if for some keywords X+Y, the following conditions hold.  a) X satisfies one of the following cases. 1. X is not in the lexicon, i.e. X is an unknown word.</Paragraph>
      <Paragraph position="5"> 2. X has the categories of Nb or No, i.e. it is a known proper name (bib) or a location name (No).</Paragraph>
      <Paragraph position="6"> b) For each Y, assumed to be the organization  type, there must have more than n number of different X, such that X+Y in the extracted keyword list. In practice, the threshold value n was set to 2.</Paragraph>
      <Paragraph position="7"> In general, Chinese company names like most proper names are non-common word (unknown words). However sometimes they are place names (No), but rarely they are common nouns, adjectives, or verbs. Therefore in order to avoid too many false alarms, such as &amp;quot;~.,~,/~ super computer&amp;quot;, to be considered as a company name, the condition a) of step 3 is set. The reason to setup the condition b) is that each organization type Y should have many different organizations which have the same organization type Y, such as '~~ Acer computer', '~,.&amp;quot;~_,~ Leo computer', ' ~ ~ ~ ~ Blue-slcy computer', ...etc.. The real implementation shows the different threshold value n gives the different precision and recall for identification. For the first iteration of knowledge extraction, we suggest to have higher recall rate. Set the threshold value low and manually select the final list of the organization types. For the future automatic knowledge extraction, in order to increase the precision of the information extraction higher threshold values are suggested.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="17" end_page="19" type="metho">
    <SectionTitle>
4. E~ERIME~ ~S~TS
</SectionTitle>
    <Paragraph position="0"> The knowledge extraction processes for Chinese org~tion names are carried out by different stages. At the first stage, the words marked with semantic category of organization were accessed from the CKIP dictionary. There are 1391 word organization types. As mentioned in section 3.1, a pseudo morphological analysis process was carried out, which try to find the high frequency ending morphemes. Since the structure of an organization name is a composition of X+Y, where X is a proper name and Y is a organization type. There are 546 different ending morphemes. The high frequency ending morphemes are exactly to be the morphemes for common organization types.</Paragraph>
    <Paragraph position="1"> Many of them are monosyllabic words and they are polysemous, as shown in Table 1.</Paragraph>
    <Paragraph position="2"> For the future identification, the disambiguation process has to be carried out for those polysemous ending morphemes (Chen &amp; Chen 2000). The extracted morphemes and list of organizations will be the first collection of the organization types. At the second stage, we try to extract new organizations names from different domain text. Each different domain has many new organization types. For instance in the domain of finance and business, there are many company names, which have completely different word strings for the organization types as in the extracted list by the first stage.</Paragraph>
    <Paragraph position="3"> The algorithm shown in the section 3.2 was carried out. At the step 1, 31787 texts of news of the finance and business domain were extracted from http://www.cnyes.com.</Paragraph>
    <Paragraph position="4"> At step 2, 40675 keywords were extracted from the news corpus. At step 3, organization names were identified and the organization types were extracted. If the threshold value n -- 2, 92 types were extracted and among them 83 are correct organization types. The precision is 90%. If the threshold was set to 3, only 56 types were extracted and all of them happen to be correct. The precision increased to 100%, but of course the recall rate dropped. We don't know the exact recall rate, since there are too many keywords in the training set.</Paragraph>
    <Paragraph position="5"> However the recall rate is not important, since the whole knowledge extraction process is a recurrent process. The knowledge extraction procedures should be repeatedly applied on the different set of text and at each iteration more information will be extracted. Hence the precision is much more important than the recall. The knowledge sources for future identification of organizations are the accumulated lists of the organization names, the proper names of organizations and the organization types.</Paragraph>
    <Paragraph position="6"> Table 3 contains the extracted organization types while the threshold value n=3. The organization types are classified by their lengths and sorted by their frequencies of uses. Table 4 contains the extracted organization types which associated with exactly two different names and the last line shows the error extractions. Among newly extracted organization types only 23 of them  are already in the old list.</Paragraph>
    <Paragraph position="7">  associated with two different names and the last lines show the error exWactions.</Paragraph>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
4.1 Strategies for On-line Identification of
Organization Names
</SectionTitle>
      <Paragraph position="0"> The knowledge about organizations extracted from the dictionaries and domain texts will be used to identify organization names at on-line sentence processing. During the word segmentation process, an organization name is either identified immediately (if it is a known organization name), or it will be segmented into two segments of X+Y or several segments of (xl+x2+...+xn)+Y, where X is a proper names, Y is the organization type. When the proper name X is a new word, it will be segmented into shorter segments (xl+x2+...+xn). To simplify the experiment process, we assume the proper names X are either the words of categories Nb (i.e. proper names) or Nc (i.e. the place names) or a two-character unknown word. For the identification experiment, a corpus extract from a T.V. news (http://www.ttv.com.tw) The patterns of X+Y in the testing corpus were searched. 117 different organizations were identified. Among thern 56 are known organizations, i.e. they are in the organization name list. 61 of them arc identified by the composition of X+Y and 52 of them are correct.</Paragraph>
      <Paragraph position="1"> It counts the precision of 52/61=85% for identifying new names. The total performance is the precision of 108/I 17=92%.</Paragraph>
      <Paragraph position="2"> The knowledge-based approach for identifying organization names seems very promising. It outperforms the reports of the precision of 61.79% and the recall of 54.50% in (Chen &amp; Lee 1996) and the experiment was carried out under the condition that the knowledge extraction process is in its initial stage. We expect that performance of the algorithm will become better and better while the knowledge extraction process continuously performs.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
4.2 Automatic Extraction of Name
Equivalent Classes
</SectionTitle>
      <Paragraph position="0"> The abbreviated names are very frequently occurred in the real text especially in the domain of the stock market. By observing the abbreviation names, the heuristic rules for abbreviating a company name can be concluded as follows.</Paragraph>
      <Paragraph position="1"> Abbreviation rule: If the proper name of a company is unique, then take the proper name as its abbreviation name, such as '~, Microsoft'. Otherwise the abbreviation will be a compound of key-characters from part of its proper name and part of its line of business, such as ~ is the abbreviation of'~ m~l~, China petroleum'.</Paragraph>
      <Paragraph position="2"> An experiment was carried out to find the full names of the abbreviations of company names shown in the price table of the Taiwan stock market. The purposes of this experiment are a) to fred the equivalent classes of company names and b) to have some idea about the recall rate of the current knowledge extraction process.</Paragraph>
      <Paragraph position="3"> The matching process between the abbreviations and the extracted organization name lists is as follows.</Paragraph>
      <Paragraph position="4"> I. For each abbreviation name matches the organization names in the organization name list. Find all the organization names containing the abbreviation name.</Paragraph>
      <Paragraph position="5"> 2. Rank the matched organization names according to the following criterion.</Paragraph>
      <Paragraph position="6"> The first rank: The proper name of the organization name is exactly matched with the abbreviation name.</Paragraph>
      <Paragraph position="7"> The second rank: The abbreviation is compounding of key-characters from part of the proper name and part of the line of business of the matched organization names.</Paragraph>
      <Paragraph position="8"> If there are many candidates with the same rank, then rank them according to their frequencies occurring in the training corpus.</Paragraph>
      <Paragraph position="9">  There are 471 abbreviated company names in the price list of the stock market. 302 of them have matched candidates. Each abbreviation name may match many different organization names. The recall rate for the top ranked candidate is 282/471=60%. The precision of the first rank candidate is 282/302=93%. Table 5 shows some of the results.</Paragraph>
      <Paragraph position="10"> Abbr. Candidates arranged in the order of their ranks  and the matched candidates (the correct answer is highlighted by the boldface characters)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML