File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0115_metho.xml

Size: 16,349 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0115">
  <Title>Statistical Acquisition of Terminology Dictionary*</Title>
  <Section position="3" start_page="143" end_page="145" type="metho">
    <SectionTitle>
2. New Word Extraction
</SectionTitle>
    <Paragraph position="0"> A Chinese word is usually composed of no more than 4 Chinese characters. Most of the words are uni-grams, hi-grams, tri-grams and 4-grams. Uni-grams only consist of one character, and most of them are common words and then can be found in universal dictionaries. The number of n-grams with n&gt;4 is very small, and the occurrence of most of them is rare. Among the 9000 most frequently used words, far below 1% of them are longer than 4 characters \[ 9 \] . In addition, most of these words are idioms or terminologies, then can be extracted in the phrase generation phase. Therefore, in this section, only bi-grams, tri-grams and 4-grams are taken imo consideration.</Paragraph>
    <Paragraph position="1"> Now consider two neighboring characters A and B. We call these two characters as a bi-gram candidate. They belong to either the same word, or two neighboring words. We can intuitively suppose that the two characters are more correlate to each other when they belong to the same word. Therefore, we may choose a statistic to measure the correlation coefficient of neighboring characters, then use this statistic to judge the probability that they belong to the same word.</Paragraph>
    <Paragraph position="2"> The correlation coefficient could be measured by several methods, such as co-occurrence frequency, mutual information, generalized likelihood estimation, chi-square test, Dice coefficient. Among them, chi-square test needs special attention. First, it is closely related to the binomial distribution model of text. Second, the computation is quite simple. Experiment in section 5 also showed that it could lead to better performance. Following is the detailed description of this method.</Paragraph>
    <Paragraph position="3"> Compare each bi-gram (4, B) candidate to every two neighboring characters ( C,, C,-1) in the text sequence C-- ( CIC:'&amp;quot;C,C,-z &amp;quot;&amp;quot;Cn ), where n is the size of the text, and record the comparison results. Thus there are four types of results altogether:  n. 1 vl. 2 Vii ill If the characters A and B occur independently, then we would expect P(AB)=P(A) XP(B), where P(ABJ is the probability of A and B occurring next to each other; P(A) is the probability of A, P(B) is the probability of B. To test the null hypothesis P(ABJ=P(A) XP(B), we compute the chi-square statistic:</Paragraph>
    <Paragraph position="5"> The above equation can be simplified as: Z 2 = n(n. x n= - n,2 x n22) 2 nt X/q2.Xn.t X n.2 We define the correlation coefficient of characters A and B to be the value of chi-square test. Those bi-gram candidates with correlation coefficient smaller than a pre-defined threshold are considered to occur randomly and should be discarded. Others are sorted according to their correlation coefficient in descending order.</Paragraph>
    <Paragraph position="6"> Tri-gram and 4-gram candidates are processed in the same way. To compute the correlation coefficient of all tri-grams, we shouldn't set the null hypothesis to P(ABC)=P(A) XP(B) XP(C), otherwise we would be faced with the critical problem of data sparseness and then get unreliable and vulnerable results. In alternate, we just look a tri-gram as the combination ofa bi-gram and a character, then calculate their correlation coefficient. Similarly, a 4-gram can be looked either as the combination ofa tri-gram and a character, or two bi-grams.</Paragraph>
    <Paragraph position="7"> The rest of bi-gram, tri-gram, 4-gram candidates constitute 3 separate tables. In these tables, many candidates are available in the universal dictionary, others are potential words. These potential words are carefully examined by skillful computer professionals, and many of them are accepted and then appended to the dictionary in order to improve segmentation precision. These words are called new words. Human intervention is still inevitable, since statistical methods not only generate useful, but also noisy words. Thresholds can be applied to limit this effect, but &amp;quot;an't eliminate it.</Paragraph>
    <Section position="1" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
Terminology Word Extraction
</SectionTitle>
      <Paragraph position="0"> rminology words are divided into two subsets and treated respectively. Most of them have s in the universal dictionary. These words should be extracted from the new word tables.</Paragraph>
      <Paragraph position="1"> number of new words is limited, and most of new words are domain specific words qnologies and proper names, this work is also done manually.</Paragraph>
      <Paragraph position="2"> &amp;quot;minologies are available in the universal dictionary. They are either frequently used  words, such as &amp;quot;i=t'~ ( computer )&amp;quot; and &amp;quot;~.~ ( network )&amp;quot;, or have meanings outside of science areas, such as &amp;quot;f'~tL~ ( agent )&amp;quot; and &amp;quot; ~.~.~ ( procedure )&amp;quot;. These words are also extracted in statistical method.</Paragraph>
      <Paragraph position="3"> If a word is a terminology, then it probably occurs more often in related domain corpus than normal. Let Pc(W) be the frequency of word W in domain corpus, P,(W) be the normal frequency of W. If Pc(W)&gt;&gt;P,(W), W is extracted and further examined by professionals, otherwise it is discarded. In the following experiment, this formula is replaced with Pc(W) &gt; T2 * P,(W), where T2 is a threshold. Similar method could be found in Zhou95 \[ 15\] .</Paragraph>
      <Paragraph position="4"> To gather all word frequency information in a specific domain, the domain corpus should be first segmented with the augmented dictionary. The normal frequency could be obtained either from a balanced on-line frequency dictionary or a universal corpus. Since on-line frequency dictionary is not available for us, another universal corpus is used. For those words which appear in the domain corpus, but don't appear in the universal corpus, P, is approximately replaced with the average frequency of all words.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="145" end_page="145" type="metho">
    <SectionTitle>
4. Terminology Phrase Generation
</SectionTitle>
    <Paragraph position="0"> Terminology phrases are word pairs composed of terminology words and other words.</Paragraph>
    <Paragraph position="1"> Current research only concerns word pairs. Terminology phrases are generated in three steps.</Paragraph>
    <Paragraph position="2"> At the first step, all the candidate phrases are extracted. The whole corpus is segmented with the augmented dictionary in advance. A small window is put over each terminology word appearing in the text sequence. Candidate terminology phrases are those word pairs which are composed of one terminology word and another word inside this terminology's border window.</Paragraph>
    <Paragraph position="3"> Those word pair's with too low frequencies are filtered out.</Paragraph>
    <Paragraph position="4"> Whether a word pair is a phrase is measured by its weight. At the next step, most of candidates are also filtered out if their weights are too small. A word pair's weight is mainly decided by its correlation coefficient. In addition, two heuristic rules are adopted to modify the weights:  introduced for this reason. This table contains more than 1000 Chinese function words, such as &amp;quot;~ (of)&amp;quot;and&amp;quot;~ (be)&amp;quot;.</Paragraph>
    <Paragraph position="5"> At the last step, all the remaining word pairs are manually examined. Those accepted phrases as well as terminologies words compose the final terminology dictionary.</Paragraph>
  </Section>
  <Section position="5" start_page="145" end_page="389" type="metho">
    <SectionTitle>
5. Implementation and Results
</SectionTitle>
    <Paragraph position="0"> Two corpora were chosen for this research. One is a Computer World corpus (CW). It is composed of all articles of the newspaper &amp;quot;Computer World ( ~t'~'LIJ~L~- )&amp;quot; from 1990 to 1994. The 100M bytes corpus contains more than 40M Chinese characters. The other is a universal corpus -- XinHua news ( ~.~.~+- ~.,kK~ ) corpus (XN). It contains more than 8,000 I</Paragraph>
    <Paragraph position="2"> news articles with 10M bytes of text.</Paragraph>
    <Paragraph position="3"> CW corpus contains many computer terminologies, most of which just appeared in last two decades. Therefore, only a small number of them have entries in universal dictionaries. XN corpus also contains many new words, but the number is much smaller.</Paragraph>
    <Paragraph position="4"> To collect new words, each article was scanned and all the bi-gram, tri-gram and 4-gram candidates with frequency greater than threshold T\] were extracted ( for CW corpus, Tr=4, for XN corpus, T~=2 ). In addition, some shorter candidates were actually parts of longer ones, and couldn't exist independently. For example, every time &amp;quot;~31~rL&amp;quot; was seen in the text, it followed &amp;quot;i~'; every time &amp;quot;l~&amp;quot; was seen, it was followed by &amp;quot;~:&amp;quot;. So &amp;quot;~g~L&amp;quot; and &amp;quot;\[~g&amp;quot; are only parts of longer candidates &amp;quot;~ ( computer )&amp;quot; and &amp;quot;1~: (Afghanistan)&amp;quot;. Thus they should be removed from candidate tables.</Paragraph>
    <Paragraph position="5"> The remaining candidates were sorted by their correlation coefficient in descending order.</Paragraph>
    <Paragraph position="6"> Those candidates on the top of the table have higher probability to be real words. To evaluate the computing methods, we may consider the distribution in the candidate table of those words available in the dictionary. These words are called as available words. Let D be the sorted candidate table, DS be a sub.table of D starting from the beginning of D. Two evaluation standards precision and recall were defined as follows:  becomes higher, MI is better than others. Since only top of the table should be further examined manually, CHI method was chosen.</Paragraph>
    <Paragraph position="7"> Figure 3 demonstrates the Recall-Precision curves of two corpora using CHI method.</Paragraph>
    <Paragraph position="8"> Although XN corpus is only one tenth of CW in size, it gains better results. This result can be attributed to the fact that XN corpus contains less new words.</Paragraph>
    <Paragraph position="9"> There are more than 400,000 bi-gram candidates in CW corpus. Among them, 17,779 are available words. Only 61,584 candidates have frequencies greater than Ti(Ti=4), including 7,089 available words. These candidates compose the bi-gram candidate table. New words are extracted from the top 16% of this table. Among these 9,856 high-rank candidates, 4,041 are available in the dictionary, which amount to 57% of all the available words in the whole table. The remaining 5,815 were potential new words and then further examined by computer professionals. Finally, 1,699 were accepted. Similar results were obtained from tri-gram and 4-gram candidates. A little more differently, the proportion of available words in tri-grarn and 4-gram candidate tables is much smaller than in hi-gram table. Therefore, new words were only extracted from the top 4% tri-grams and the top 2% 4-grams. The quantities of accepted tri-grams and 4-grams is also smaller than that of bi-grarns. Table 2 presents the vocabulary distribution of CW corpus. Among the whole vocabulary, more than 10% are extracted new words. Later the recall and precision were recalculated using the augmented dictionary. Figure 4 demonstrates the Recall-Precision curves of Computer World corpus using original dictionary and augmented dictionary respectively. We can find that the precision is significantly improved aRer new words were appended.</Paragraph>
    <Paragraph position="11"> To extract terminology words from new words, all new words were manually examined and put to any of three categories: terminology words, proper names and other domain specific words, or to say, those words which are related to this domain to some degree, but cannot be considered as terminology of this domain, for example: ~eg:~:~ ( cable 'IV ) and computer domain. Table 3 shows the distribution of new words. Table 4 presents some example words with highest correlation coefficient. From table 3 and table 4 we can see, about one fourth of new words are terminology words; another one fourth are proper names; the rest are other domain specific words. Those words with highest correlation coefficient are almost terminology words and proper names. In addition, many tri-grams are proper names, because most of Chinese names are composed of 3 characters. Since Chinese name recognition is also an complex problem in Chinese real text processing, this method can also be utilized to recognize names.</Paragraph>
    <Paragraph position="12">  proper names  To extract terminologies from the original universal dictionary, the frequency of each of the 25,277 words in CW corpus was compared to the frequency in XN corpus. The threshold of T2 was set to 3. only 1,938 words' frequencies in CW corpus were three times higher than in XN and then satisfied this threshold limitation. These words were further categorized manually. The categorization results are demonstrated in table 5.</Paragraph>
    <Paragraph position="13">  We can find terminologies extracted from the universal dictionary are much fewer than those extracted from new words: of the 1,938 words, only 323 were accepted finally. In addition, to make sure only a small portion of terminology words had been missed, 1,000 words were randomly selected from the rest 23,329 words and only 4 were found to be terminologies. This helped to explained that most of the terminology words in the universal dictionary had been extracted.</Paragraph>
    <Paragraph position="14"> Terminology phrases were later extracted from the combination of 1,034 terminology words and their neighboring words within a distance of +-3. There are altogether 35,178 phrase candidates with frequency greater than a threshold T3 (here T3ffi3). Random sampling showed that 30% of them are acceptable terminology phrases. These candidates' weights were computed in the method introduced in section 4. Then they were sorted in descending weight order. Figure 5 shows the approximate recall-precision curve of terminology phrase extraction. The reason for approximate evaluation was that it was impossible to manually examine all 35,178 terminology phrases, therefore only randomly selected 3,000 candidates were examined. From figure 5, we can find that the performance of phrase extraction wasn't as good as that of word extraction. This phenomenon can be explained by the fact that some highly associated candidates still couldn't compose terminology phrases. Most of these pseudo phrases can be divided into two classes:  Terminology phrases were extracted from the top 20% ( with precision of about 50%) terminology phrase candidates, these candidates were examined manually and 3,471 were accepted. These 3,471 phrases as well as the 1,034 words compose our computer terminology dictionary. Table 6 presents some example terminologies with high rank.</Paragraph>
    <Paragraph position="15"> 100 pieces of article of 72K bytes were randomly selected to test the coverage of this terminology dictionary. A simple automatic pattern matching program was used to identify terminologies and 1,174 occurrences of terminologies were spotted. This identification procedure was also done by several graduate students major in computer science. The automatic recognition</Paragraph>
    <Paragraph position="17"> results were compared to the union set of three experimenters. 89.5% of all terminologies found by experimenters were als0 found by the program. And 73.9% of all the program output was judged to be correct. The relatively lower precision can be attributed to the fact that some terminologies, especially those available in the original dictionary, have meaning outside computer domain. In large scale natural langnage processing applications where context</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML