File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/w97-0115_abstr.xml
Size: 6,139 bytes
Last Modified: 2025-10-06 13:48:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0115"> <Title>Statistical Acquisition of Terminology Dictionary*</Title> <Section position="2" start_page="0" end_page="143" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Terminologies are specialized words and compound words used in a particular domain, such as computer science. They are extensively used in scientific articles. Previous research had shown that about 25% of the words in science abstract were technical words \[ 6 \]. Therefore, the ability to automatic identification of terminology could greatly aid any domain related natural language processing applications, such as automatic indexing, information retrieval and document categorization. For example, automatic indexing is the foundation of many other relevant tasks. It needs to automatically identify those words which most appropriately reflect a text's theme. Since terminologies are highly relevant to the text's domain, they are proved to be much valuable index words. Even in more universal applications such as semantic analysis and translation, terminologies also play important roles, and therefore require special treatment.</Paragraph> <Paragraph position="1"> Unfortunately, the identification of terminology is a hard work. Most terminologies don't have entries in universal dictionaries. In addition, terminology dictionaries are hilly variable in the coverage. For example, computer science dictionaries' coverage of computer science terminology ranged from 24% to 66% \[ 6\] .</Paragraph> <Paragraph position="2"> * This paper is supported by Chinese Natural Science Foundation and high technology 863 project.</Paragraph> <Paragraph position="4"> With regard to Chinese, the identification procedure is even more difficult. First, there are scarcely any available machine readable Chinese dictionaries for specialized domains. Therefore, the generation of terminology dictionary would inevitably require a great deal of tedious and time consuming manual work. Second, in most Indo-European languages, even a word couldn't be found in the dictionary, it still could be separated by the spaces between it and neighboring words; however, Chinese is written in character sequences, with no delimiters between successive words. Hence the first step of Chinese information processing is necessarily to segment the character sequences into word sequences. The main knowledge base of segmentation is the dictionary. However, most of the terminologies couldn't be found in the dictionary.</Paragraph> <Paragraph position="5"> Therefore, before further processing, those domain specific words which are unavailable in the dictionary should be extracted and added to it. This procedure is called new word extraction.</Paragraph> <Paragraph position="6"> Due to the availability of large scale on-line real text, corpus based natural language research has become one of the focuses of computational linguistics. Among all the corpus l~ased researches, some of them are quite similar to the work reported here, including sublanguage vocabulary identification \[ 6 \] , automatic suggestion of significant terminology \[ 15 \] , identification and translation of technical terminology\[ 3 \], automatic extraction of terminology \[4\] . For example, Haas introduced a method for automatic identification of snblanguage vocabulary words. First, words that could be easily identified as belonging to the vocabulary of the given domain were extracted, then the rest of the vocabulary were extracted using these seed words.</Paragraph> <Paragraph position="7"> Another relevant research is statistical collocation extraction. In fact, terminology phrase belongs to one certain kind of collocation m fixed collocation, whether two or more words can compose a collocation is measured by the correlation coefficient of these words \[ 11 \] . If these words' correlation coefficient is large enough, they may probably make up a collocation. There are many statistical methods to calculate words' correlation coefficient, including co-occtarence frequency \[ 10\], mutual information \[ 1 \] ,generalized likelihood estimation \[5\], chi-square test \[2\] \[7\] , Dice coefficient \[ 11 \] , etc.</Paragraph> <Paragraph position="8"> There are also many valuable works in China, especially about the distinctive new word extraction of Chinese text. Wang Kai-zhu presented a statistical method to extract possible words from texts. Weights of possible words were calculated using their frequency and length information \[ 13\] . Zhang Shu-wu also presented a strategy which made use of co-occurrence frequencies to collect new words \[ 14\] . Pascale Fung extended a tool originally designed for extracting English compounds - CXtract to collect new words in order to improve the segmentation precision \[ 9 \] .</Paragraph> <Paragraph position="9"> Due to the distinct characteristic of Chinese, there is still no systematic approach to generate practical and relatively complete Chinese terminology dictionaries from on-line corpora. In this paper, a semi-automatic approach is developed to extract technical words and phrases from corpora. This approach integrates such methods as new word collecting, terminology word extraction and terminology phrase generation. It can significantly reduce the manual effort in the generation of terminology dictionary. First, those domain specific words which can't be found in the universal dictionary are identified. Second, terminology words are extracted from these new words as well as the universal dictionary. Then compound words which are combined by terminology words and other words are generated.</Paragraph> <Paragraph position="10"> The following sections are organized as such: Section 2 introduces the identification of domain specific words; Section 3 describes how to extract terminology words from the universal dictionary; Section 4 presents the method for terminology phrase extraction; Section 5 provides detailed experimental results; The final section is the concluding remarks.</Paragraph> </Section> class="xml-element"></Paper>