File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1017_metho.xml
Size: 20,635 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1017"> <Title>DESCRIPTION OF THE NTU SYSTEM USED FOR MET2</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> FLOW OF NAMED ENTITY EXTRACTION </SectionTitle> <Paragraph position="0"> The following shows the flow of named entity extraction in MET-2 formal run.</Paragraph> <Paragraph position="1"> (1) Transform Chinese texts in GB codes into texts in Big-5 codes. (2) Segment Chinese texts into a sequence of tokens.</Paragraph> <Paragraph position="2"> 2 (3) Identify named people.</Paragraph> <Paragraph position="3"> (4) Identify named organizations.</Paragraph> <Paragraph position="4"> (5) Identify named locations.</Paragraph> <Paragraph position="5"> (6) Use n-gram model to identify named organizations/locations. (7) Identify the rest of named expressions. (8) Transform the results in Big-5 codes into the results in GB codes. Steps (1) and (2) form the preprocessing of the named entity extraction tasks. As mentioned in Section 1, Big-5 traditional character set and GB simplified character set are adopted in Taiwan and in China, respectively. Our system is developed on the basis of Big-5 codes, so that the transformation of the official documents in MET-2 into the documents in terms of Big-5 codes is necessary. Characters used both in simplified character set and tradition character set always result in error mapping. For example, G0AG66 vs. G0AGD5, G4BGDB vs. G4BG67, G1FG4C vs. G1FGD6, G4EG2D vs. G4EG9E, G8EGEB vs. G88GEB, G57GC3G79 vs. GF9GC3G79, G6AG93GFA vs. G8EG93GFA, GC6G0BGCBG30 vs. GC6G0BGCBG51, GEAGB4 vs. GEAG94, G8EG24 vs. G8EGA1, and so on.</Paragraph> <Paragraph position="6"> A Chinese sentence is composed of a sequence of characters without any word boundary. Step (2) tries to identify words on the basis of a dictionary and segmentation strategies. We list all the possible words by dictionary look-up, and then resolve ambiguities by segmentation strategies. Our dictionary is trained from CKIP corpus [8], of which articles are collected from Taiwan newspapers, magazines, and so on. The vocabulary used in MET-2 documents may be different from the vocabulary trained from Taiwan corpora, so that more unknown words are introduced. For example, &quot;G8AGB7GACG74&quot; vs. &quot;G8AGB7GACGBF&quot;, &quot;GE3G46&quot; vs. &quot;GE3GB0&quot;, &quot;GA2 G42GB5&quot; vs. &quot;G5CG42GB5&quot;, &quot;GA4G5C&quot; vs. &quot;GA4G64G5C&quot;, and so on. That will interfere with the named entity extraction because named entities are often unknown words too.</Paragraph> <Paragraph position="7"> Table 1 summarizes the results of MET-2 formal run of our team. The F-measures in terms of P&R, 2P&R, and P&2R are 79.61%, 77.88% and 81.42%, respectively. The recall rate and the precision rate (object scores) for the extraction of name, time and number expressions are (85%, 79%), (91%, 98%) and (95%, 85%), respectively. We will discuss the major errors for each type of named entities.</Paragraph> </Section> <Section position="5" start_page="0" end_page="12" type="metho"> <SectionTitle> NAMED PEOPLE EXTRACTION </SectionTitle> <Paragraph position="0"> The naming methods are totally different for Chinese person names and transliterated person names. The following two subsections deal with each of them.</Paragraph> <Paragraph position="1"> Identification of Chinese Person Names Chinese person names are composed of surnames and names. Most Chinese surnames are single character and some rare ones are two characters. The following shows three different types: (1) Single character like 'GEA', GA9', 'GC1' and 'GEC'.</Paragraph> <Paragraph position="2"> (2) Two characters like 'GB2GD1' and 'G99GCB'.</Paragraph> <Paragraph position="3"> (3) Two surnames together like 'G26GB1'.</Paragraph> <Paragraph position="4"> Most names are two characters and some rare ones are single characters. Theoretically, every character can be considered as names rather than a fixed set. Thus the length of Chinese person names ranges from 2 to 6 characters.</Paragraph> <Paragraph position="5"> Three kinds of recognition strategies are adopted: (1) name-formulation rules (2) context clues, e.g., titles, positions, speech-act verbs, and so on (3) cache Name-formulation rules form the baseline model. It proposes possible candidates. The context clues add extra scores to the candidates. Cache records the occurrences of all the possible candidates in a paragraph. If a candidate appears more than once, it has high tendency to be a person name. The following illustrates each strategy in details.</Paragraph> <Paragraph position="6"> Name-formulation rules are trained from a person name corpus in Taiwan [9]. It contains 1 million Chinese person names. Each contains surname, name and sex. During training, we divide the corpus into two partitions according to sex of persons. In our method, we postulate that the formulation of names is different for male and female. At first, we get 598 surnames from this 1M person name corpus, and then compute the probabilities of these characters to be surnames. Of these, surnames of very low frequency like &quot; G79&quot;, &quot;G4E&quot;, etc., are removed from this set to avoid too much false alarms. Only 541 surnames are left, and are used to trigger the person name identification system. Next, the probability of a Chinese character to be the first character (the second character) of a name is computed for male and female, separately. The following models are adopted to select the possible candidates. We consider the above three types of surnames.</Paragraph> <Paragraph position="7"> Model 1. Single character denote the first and the second surnames,</Paragraph> <Paragraph position="9"> to be a surname or a name.</Paragraph> <Paragraph position="10"> For different types of surnames, different models are adopted. Because the surnames with two characters are always surnames, Model 2 neglects the score of surname part. Both Models 1 and 3 consider the score of surname. We compute the probabilities using female and male training tables, respectively. In Models (1) and (2), either male score or female score must be greater than thresholds. In Model (3), the person names must denote a female. In this case, the probability to be female must be greater than the probability to be male. The above three models can be extended to the single-character names. When a candidate cannot pass the thresholds, its last character is cut off and the remaining string is tried again. Thresholds are trained from the 1-million person name corpus. We let 99% of the training data pass the thresholds.</Paragraph> <Paragraph position="11"> Besides the baseline model, titles, positions and special verbs are important local clues. When a title such as 'G6BG84' (President) appears before (after) a string, it is probably a person name. There are 476 titles in our database. Person names usually appear at the head or the tail of a sentence. Persons may be accompanied with speech-act verbs like &quot;G1EG39&quot;, &quot;GDC&quot;, &quot;G93G37&quot;, etc. For these cases, extra scores are added to help strings pass the thresholds.</Paragraph> <Paragraph position="12"> Finally, we present a global clue. A person name may appear more than once in a document. We use cache to store the identified candidates and reset cache when next document is considered. There are four cases shown below when cache is used: clues for boundary. For those similar pairs that have different weights, the entry having high weight is selected. If both have high weights, both are chosen. When both have low weights, the score of the second character of a name part is critical. It determines if the character is kept or deleted.</Paragraph> <Paragraph position="13"> Identification of Transliterated Person Names Transliterated person names denote foreigners. Compared with Chinese person names, the length of The transliterated names trained from MET data are regarded as a built-in name set.</Paragraph> <Paragraph position="14"> (2) character condition Two special character sets are retrieved from MET training data, Hornby [10] and Huang [11]. The first character of transliterated names must belong to a 280-character set, and the remaining characters must appear in a 411-character set. The character condition is a loose restriction. The string that satisfies the character condition may denote a location, a building, an address, etc. It should be employed with other clues (refer to (3)-(5)).</Paragraph> <Paragraph position="15"> they are used at the first time.</Paragraph> <Paragraph position="16"> (5) special verbs Persons always appear with some special verbs like &quot;G1EGBB&quot;, &quot;G5DG99&quot;, and so on. Thus the same set of verbs used in Chinese person names are also used for transliterated person names.</Paragraph> <Paragraph position="17"> Besides the above strategies, a complete transliterated person name is composed of first name, middle name and last name. For example, GCAG94GF0G43GEFGFAG15G54GEFGCAG51GEFGB2G28G05G42. The first, middle and last names are connected by a dot.</Paragraph> <Paragraph position="18"> Cache mechanism is also helpful in the identification of transliterated names. A candidate that satisfies the character condition and one of the clues will be placed in the cache. At the second time, the clues may disappear, but we can recover the transliterated person name by checking cache. The following shows an example: ... G26GE1G56G5BGCB ...GEB... G26GE1G56GD9GB8 ...</Paragraph> <Paragraph position="19"> Title does not show up, when the name is mentioned again.</Paragraph> <Paragraph position="20"> The summary report in Table 1 shows the recall rate and the precision for person names are 91% and 74%, respectively. The major errors are listed below: (1) segmentation In our treatment, segmentation is done before named entity extraction. Part of person names may be regarded as words during segmentation. The following show some examples. The characters &quot;GEAGC5&quot;, &quot;G26GC1&quot;, and &quot;G67G1F&quot; are common content words.</Paragraph> <Paragraph position="22"> In this case, the person name is missed.</Paragraph> <Paragraph position="23"> (2) surname set and character set Those characters not listed in surname set are not considered as surnames, so that they cannot trigger our identification system. The characters &quot;G2C&quot; and &quot;G15&quot; in person names &quot;G2CG02G3C&quot; and &quot;G15 GE1G0D&quot; are typical examples. Similarly, if the character of a transliterated person name does not belong to the predefined character set, the character will be neglected. For example, &quot;GCF&quot; in &quot;G43G05 GCFG0AGF2&quot; is not listed in the character set, and the scope error happens.</Paragraph> <Paragraph position="24"> Titles are important clues for the identification of transliterated person names. Even if a transliterated name satisfies the character condition, it is not identified without title. The name &quot;G43 GDB&quot; in the string &quot;G6BG8AG43GDB&quot; is missed because &quot;G6BG8A&quot; is not listed in our title set.</Paragraph> </Section> <Section position="6" start_page="12" end_page="12" type="metho"> <SectionTitle> (6) Japanese names </SectionTitle> <Paragraph position="0"> The current version cannot deal with Japanese names like &quot;G8DGC4G76GFEGAE&quot;.</Paragraph> </Section> <Section position="7" start_page="12" end_page="12" type="metho"> <SectionTitle> NAMED ORGANIZATION EXTRACTION </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="12" end_page="12" type="sub_section"> <SectionTitle> Extraction Algorithm </SectionTitle> <Paragraph position="0"> The structure of organization names is more complex than that of person names. Basically, a complete organization name can be divided into two parts, i.e., name and keyword. The following specifies the rules we adopted to formulate its structure.</Paragraph> <Paragraph position="1"> e.g., GA9G6AG84GF7 GF7GE3 In current version, we collect 776 organization names and 1059 organization name keywords. Transliterated person names and location names in the above rules still have to satisfy the character condition mentioned in last section. However, the character set is trained from transliterated person name corpus. It may not be suitable for location names. Consider an example &quot;GDDG95GA8GE2G0AG66G42G2EGEF&quot;. &quot;GDDG95GA8 GE2&quot;, which is a lake in China, is not a transliterated name. The characters &quot;G95&quot; and &quot;GA8&quot; do not belong to the character set. Here, we utilize the feature of multiple occurrences of organization names in a document and propose n-gram model to deal with this problem. Although cache mechanism and n-gram use the same feature, i.e., multiple occurrences, their concepts are totally different. For organization names, we are not sure when a pattern should be put into cache because its left boundary is hard to decide. In our n-gram model, we select those patterns that meet the following criteria: (1) It must consist of a name and an organization name keyword.</Paragraph> <Paragraph position="2"> (2) Its length must be greater than 2 words.</Paragraph> <Paragraph position="3"> (3) It does not cross sentence boundary and any punctuation marks.</Paragraph> <Paragraph position="4"> (4) It must occur at lease two times.</Paragraph> <Paragraph position="5"> Table 1 shows the recall rate and the precision rate for the extraction of organization names are 78% and 85%, respectively. The following shows the error analysis.</Paragraph> <Paragraph position="6"> (1) more than two content words between name and keyword In current version, we accept only two interference words. Thus, the string &quot;GC4G66 G38G7A G1EGCB G2D G52 GDDG50&quot; is not recognized.</Paragraph> <Paragraph position="7"> (2) absent of keywords Keywords are important indicators for right boundary. The string &quot;GFAG35G5BG1AG9CG4AG2F&quot; is lack of keyword, so it is missed.</Paragraph> <Paragraph position="8"> Currently, we have 45 location keywords. The following shows some examples: 'GB5', 'GC4GFF', 'GDDG58', 'G27G3E', 'G27G42', 'G27G36', 'G27GFB', 'G41GD5', 'G41G51', 'G68', 'G68GC4GFF', etc. There are 16,442 built-in location names in current versions. For the treatment of location names without keywords, we also introduce some locative verbs like 'G67G35', 'GF2GEA', and so on. The objects following this kind of verbs may be location names. For example, in the string &quot;G5DGEAGFAG58G27GA5&quot;, &quot;GFAG58G27GA5&quot; will be identified. Cache is also useful. For example, assume 'GFAG15GD3GA6G68' is recognized as a location name and placed in cache. When 'GFAG15GD3GA6' appears, it will be identified as a location name even if the location name keyword is omitted. N-gram model is also employed to recover those names that do not meet the character condition.</Paragraph> <Paragraph position="9"> Table 1 shows the recall rate and the precision rate in this part are 78% and 69%, respectively. The performance is worse than that of named people and named organization. The major types of errors are shown below.</Paragraph> <Paragraph position="10"> (1) character set The characters &quot;G13&quot; and &quot;GD5&quot; in the string &quot;G13G74GD5G53&quot; do not belong to our transliterated character set. Actually, it denotes a Japanese location name.</Paragraph> <Paragraph position="11"> (2) wrong keyword The character &quot;GF4&quot; is an organization keyword. Thus the string &quot;G6EG4AGE6G24GF4&quot; is misregarded as an organization name.</Paragraph> <Paragraph position="12"> (3) common content words The words such as &quot;GF3GD1&quot;, &quot;GA9G7A&quot;, etc., are common content words. We do not give them special tags.</Paragraph> </Section> </Section> <Section position="8" start_page="12" end_page="12" type="metho"> <SectionTitle> (4) single-character locations </SectionTitle> <Paragraph position="0"> The single-character locations such as &quot;GC4&quot;, &quot;G09&quot;, and so on, are missed during recognition. Total 57 errors are of this type.</Paragraph> <Paragraph position="1"> (5) interference words between name part and keywords There are words between name part and keywords. For example, &quot;GA4G64G44GC7GF1GC4GFF&quot; and &quot;GADG54G23 GD6GDBG6FG1EGCBG48&quot;. Here the words &quot;GC7GF1&quot; and &quot;GDBG6F&quot; are common words in China newspaper, but seldom used in Taiwan.</Paragraph> </Section> <Section position="9" start_page="12" end_page="12" type="metho"> <SectionTitle> OTHER ENTITY EXTRACTION </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="12" end_page="12" type="sub_section"> <SectionTitle> Extraction Algorithm </SectionTitle> <Paragraph position="0"> We use grammar rules to capture the remaining entities, including date/time expressions and monetary and percentage expressions. The following shows the specification of each type of expressions. Each rule is accompanied with an example.</Paragraph> <Paragraph position="1"> Rule-based approach is simple. We can add, delete and modify rules quickly without modifying the identification programs. However, the above rules cannot capture ambiguous cases. For example, &quot;G1D&quot; may mean the address number (e.g., GC4GB5G58G87G1D) or the date number (e.g., G96G0BG87G1D). Augmented grammar rules are needed to introduce constraints to check if the extracted entity can fit into the context. The summary report in Table 1 shows that the recall rate and the precision rate for date expression, time expression, monetary expression and percentage expression are (94%, 88%), (98%, 70%), (98%, 98%) and (83%, 98%), respectively. The major errors are shown as follows: (1) propagation errors Because we employ a segmentation system to identify basic tokens before entity extraction, some words like &quot;G48GD5&quot;, &quot;GD5G4C&quot;, etc., are regarded as terms. In this way, &quot;GD5&quot; is always missed. Similarly, named people are extracted before date expressions. The errors resulting from the previous steps propagate to the next steps. Consider the following example.</Paragraph> <Paragraph position="2"> G351998GFAGCAG34GD8GA6 ...</Paragraph> <Paragraph position="3"> The named people extraction procedure regards &quot;GFAGCAG34GD8GA6&quot; as a transliterated name. After that, &quot;1998GFA&quot; is missed because the date unit is absent.</Paragraph> <Paragraph position="4"> (2) absent date units In some sentences, the date unit &quot;GFA&quot; does not appear, so that &quot;G82G87G87GDB&quot; is missed. In some examples like &quot;G87G0BG93&quot;, the date unit should appear but it is absent. Thus it is also not captured. (3) absent keywords Some keywords are not listed. For example, &quot;GACGA5GF4G0BGCB&quot;, &quot;G36GD5&quot;, and so on. Thus, for &quot;G99GE8 GACGA5GF4G0BGCB8GF158GE0&quot; and &quot;1960GFAG36GD5&quot;, only some fragments, e.g., &quot;G99GE8&quot;, &quot;8GF158GE0&quot;, and &quot;1960GFA&quot; are identified.</Paragraph> <Paragraph position="5"> (4) rule coverage Patterns like &quot;GD5GECG2BG76GFA&quot; are not considered in this version, thus they are missed. Similarly, the percentage expressions like &quot;2/3&quot;, &quot;G89G93GE0GC7G82&quot;, &quot;G24GDBGE0GC7G82&quot;, and so on, are not represented in our grammar.</Paragraph> <Paragraph position="6"> (5) ambiguity Some characters like &quot;GF1&quot; can be used in time and monetary expressions. Expression &quot;G93G89GF1G85 G85G30G04GD8&quot; is divided into two parts: &quot;G93G89GF1&quot; and &quot;G85G85G30G04GD8&quot;. Similarly, the strings &quot;G93GE0&quot; and &quot;G82G0B&quot; are words. In our pipelined model, &quot;G87GF1G93GE0&quot; and &quot;G97GE8G82G0B&quot; will be missed.</Paragraph> </Section> </Section> class="xml-element"></Paper>