File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1096_metho.xml
Size: 16,539 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1096"> <Title>AN IBM-PC ENVIRONMENT FOR CHINESE CORPUS ANALYSIS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AN IBM-PC ENVIRONMENT FOR CHINESE CORPUS ANALYSIS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> This paper describes a set of computer programs for Chinese corpus analysis. These programs include (1) extraction of different characters, bigrams and words; (2) word segmentation based on bigram, maximal-matching and the combined technique; (3) identification of special terms; (4) Chinese concordancing; (5) compiling collocation statistics and (6) evaluation utilities. These programs run on the IBM-PC and batch programs co-ordinate the use of these programs.</Paragraph> </Section> <Section position="3" start_page="0" end_page="584" type="metho"> <SectionTitle> L INTRODUCTION </SectionTitle> <Paragraph position="0"> Corpus analysis utilities are developed and widely available for English. For example, tbe Oxford Concordance program is available for over 10 kinds of mainframe computer (Hockey and Martin, 1987) and the Longman mini-concordancer (Trible and Joncs, 1990) is available for the sales. Further enhancement of these utilities include compiling collocation statistics (Smadja, 1993) and semi-automatic gloassary construction (Tong, 1993). Current research has focused on bilingual corpora (Gale and Clmrch, 1993) with the alignment of parallel-text becomeing an important technical problem. However, there has been little development of corpus analysis tools for Chinese. Since using Chinese fiJr compnters has only become more generally availablein the last ten years, analysis utilities for Chinese are not widely. Although no integrated environment is available for Chinese corpus analysis, many specific analysis programs have been reported in the literature (Kit et al., 1989; Tong et al., 1993; Chang and Chen, 1993; Zhou et al., 1993). A Chinese concordance program a,~d a clmractcr-list extraction program are freely available from a Singapore (FTP) network site (Guo and Liu, 1992). tlowever, the programs run in the SUN workstations while many users, particularly non-computing experts, interact with an IBM-PC in Chinese, rather than a SUN workstation.</Paragraph> <Paragraph position="1"> Tile rapid adwmce of microcomputers has mitigated many storage and processing speed problcms.</Paragraph> <Paragraph position="2"> As for storage, the hard disk capacity can reach as high as 340M bytes which is adequate in comparison with tile demand for a corpus (8M bytes from the PH corpus) and a dictionary (10M bytes). Using a 486 processor, tile processing speed is acceptable if the user expect data to be analyzed over-night, similar to submilting a batch job to a mainframe computer. For example; the utililies we are developing ranked around 42,000 words in a few minutes and produced about one-hundred lines of keyword-in-context in a few seconds for a 4 million character Chinese corpus.</Paragraph> <Paragraph position="3"> This paper describes our effort to develop corpus analysis programs for Chinese. Tile programs are written in Turbo C++, implemented on an IBM-PC (486) with a 120M byte hard disk. The programs are divided into several types: a. format conversion program (norm.exe, phseg.exe, wform.exe) b. extraction of characters, bigrams and words (exsega.exe, exsegmi.exe,bigram.exe, miana.exe, worddh.exe, wral'~ka.exe, wlranka.cxe) c. word segmentation programs (bisegl.exe, whash.exc, bimaxn.exe) d. concordaucing programs (kwic.exe, kwicw.exe) e. collocation statistics programs (cxtract.exe, cxtractw.exe, cxstat.exe, cxstatw.exe) f. gcncral (evaluation) programs (wcomp.exe, scgperf.exe) To run these analysis utilities, a Chinese computing environment called Eten nmst be set up; otherwise Chinese characters cannot be displayed or entered. Since there are many different Chinese characlers (i.e. 13,000) compared with Western languages, cach Chincse charactcr is specified by two bytes instead of one. llowcver, many docmuent includes both single-byte characters and two-byte Chinese characters. Tiros, tile conversion prograln, norlll.exe, is used to convcrt all the single--byte characters (i.e. A..Z,a.. z ..... :,;, {,~L#,$,%,^,&, *, (,),-,+,=,l,\,/,<,>,', {, }, 1,1, 0..9, ~, <space>, _ and &quot;) i,~to their corresponding two-byte equivalent, for simplicity. For example, the-singlebyte character &quot;a&quot; is converted to &quot;~ &quot; (2-byte). This program also changes tile docmnent iulo a clause or phrase format, using the -e option, where a new line is inserted after a punctuation mark (e.g. comma or fidl stop):</Paragraph> <Paragraph position="5"> If the text are segmented into words by space or &quot;/&quot; markers, it is possible to change or delete these markers using the -s option. Once, the document is cotwerted into two-byte format using norm.axe, the other utilities can be used. Batch programs can be written to use these utilities. For example, tile following batch program extracts different characters, performs bigram segmentation, extracts different words and obtain only the top 10% of the extracted words for compiling key-word in contexts and collocation statistics.</Paragraph> <Paragraph position="6"> norm -t %1 -o carp.trap -e 2 -s 5 /&quot; 1-byte to 2-byte; phrase format; delete space */ exsegs -t carp.trap -w 2 -b 0 /* extract different charaotsrs and bigrams =1 bigram-m 10 1&quot; sort bigram and extract top 10% */ bisagl /&quot; segment using the top 10% bigrams '/ wotddh 1= extract different words =/ wranka-m 10 1&quot; sort and extract top 10% words &quot;1 kwic -t carp.trap -k words.cut > kwie.lst \[* concordancing on the top 10% words '/ cxtract 1&quot; extract different characters from contexts &quot;/ oxstat /&quot; compile collocation statistics '1</Paragraph> </Section> <Section position="4" start_page="584" end_page="585" type="metho"> <SectionTitle> IL EXTRACTION PROGRAMS </SectionTitle> <Paragraph position="0"> The extraction progranls assume that the text is not segmented. Thus, norm.axe should be used to remove markers fi'om the seglnented text.</Paragraph> <Paragraph position="1"> The programs, exsega.exe and exsegmi.cxe, extract different characters and their co-occurring characters, stored in cfreq.tmp (Fig 2) and bifile/mifile.tnlp, respectively. The first program obtains the co-occurrence frequencies while the second obtains the inutnal infornmtion. By default, tbe programs do not count punctuation but this can be override using the -a option, The different characters can be supplemented with information about their frequencies, pcrce,ltagcs and clunnlative percentages if the -w option is set to 2. Kong Basic latw. The characters are ranked by their frequencies. The first number is the fi'equency, followed by the percentage, cumulative percentage and rank number. By default, all tltc different ch:lracters are stored. However, sometimes only tile most frequently or infrequently occurring characters are interesting candidates for filrther investigation (e.g.</Paragraph> <Paragraph position="2"> concordaucing). The user can select characters by their frequencies (i.e. -f and -g options), the top or bottom N% (i.e. -m and -n options), their ranks (i.e. -r and -s options) and by their frequencies above two standard deviations phlS the mean (Smadja, 1993) (i.e. -z option). By default, the extracted bigrams have frequencies above unity but this can be override using the -b option. The bigrams stored can be sorted according to their frequencies or their mutual information in descending order using bigram.exe and miens.axe, respectively. The sorted bigratns are stored in bifile.rnk or mifile.rnk. The user can select different bigrams using options available for exseg programs (i.e. -f, -g, -m, -n, -r, -s and -z options). Both programs give the frequency distribution of the bigram frequencies and the log of their freqnencies. The selected bigrams will become usefid for detecting componnd nouns or word segmentation (Zhang et el., 1992).</Paragraph> <Paragraph position="3"> Given the text is segtnented by &quot;/&quot; markers (space markers can be converted using norm.exe), worddh.exe can extract all the different words from the text and compute word frequencies. The program extracted 42,613 words from the PIt corpus. There is no limit to the number of different words that it can extract but it needs some disk space to hold temporary files. The extracted words are stored in words.lst and they are sorted in descending frequencies using wranka.exe, hi addition, wlranka.exe sorts the extractcd words firstly by word length and secondly by their frequencies. This is particularly usefid to examine compound notms, technical terms and translated words as they tend 10 be long. Furthermore, the segmentation program, whash.exe, needs the words to be order by their length.</Paragraph> <Paragraph position="4"> 111. WORI) SEGMENTATION PROGRAMS Unlike English, Chinese words are not delimited by any tentative markers like spaces although Chinese clanses are easily identified (Fig 1). Many segmentation programs were proposed (Chiang et el, 1993; Fan and Tsai, 1988). We have re-implemented the n~axinml-matchillg technique (Kit et al, 1989) using a word list, L, because it is simple to program and achieved one of the best segmentation performance (I2% error rate). However, the segmentation accuracy is degraded significantly (to 15% error rate in (Luk, 1993)) when the text has many compotmd notms and technical terms since the accuracy depends on the coverage of L. A word segmentation program using bigrams as well as combining bigrams and maximal-matching was subsequently developed.</Paragraph> <Paragraph position="5"> The basic idea of tnaximal-nlalching is to match the input clause from left-to-right with entries in the given word list, L. If there is more than one matches, the longest entry is selected. The process iterates with the remaining clause at the end with the clause matched with the longest entry. Apart from luaxilnal-matching, whash.exe divides and output the text in the clause format (Fig 2). The file that holds the word list can be specified using the -b option and the text using the -t option. Tile word list should rank tile words, firstly, by their length in descending order (use wlranka) and, secondly, by their .frequencies. Usually, the segmented clauses are displayed on tile screen for visual inspection after which the ou'tput can be redirected using the > option (MS DOS 5.0 option). The current whash.exe program can hold around 20,000 Chinese words in the main memory for segmentation but this is not large enough for a general Chinese dictionary (Fu, 1987) which has about 54,000 entries.</Paragraph> <Paragraph position="6"> The bigram technique does not need any dictionary for segmentation. This technique needs a set of bigrams extracted, from the text or from a general corpus. Typically, tile top 10% of tile bigrams are captured and ranked according to their co-occurrence frequencies (CF) or mutual information (MI). This is due to the fac that if tile distributions of CF and MI are normal, then the top 10% corresponds to the 10% significance level. The distribution of MI lypically does appear normal bnt not for CF. The top N% bigranls are stored ill either bifile.cut or mifile.cut, The bigram segmentation program, bisegl.exe, loads the bigrams using the -b option. A segmentation marker is placed between two characters in the text if the bigram of these two adjacent characters does not appear in bifile.cut or mifile.cut. This segmentation is the same as performing nearest-neighbour clustering of substrings (l,nk, 1993). The program detected many non-words depending on N.</Paragraph> <Paragraph position="7"> However, the number of non-words are significantly reduced if we restrict to examining only the top N% (say 10) of the frequently occurring words.</Paragraph> <Paragraph position="8"> Both maximal-umtching and bigranl techniques were combined, in order to detect words not in the word list and reduce tile amount of non-words detected (Luk, 1993). Maximal-matching is carried out first and the bigram technique is used to combine consecutive single-character words in the segmented text since words not in L are usually segmented into smaller ones by maximalmatching. The test data shows that the combined technique reduced tbe error rate by 33&quot;/o and detected 33% of the desired words not in L. The combined techinque is written as a batch program as follow:</Paragraph> </Section> <Section position="5" start_page="585" end_page="585" type="metho"> <SectionTitle> IV. CONCORDANCE PROGRAMS </SectionTitle> <Paragraph position="0"> We modified tile concordance program by Guo and Lin (1992) since tile program assumed that the main nlenlory can hold the entire corpus or text. Instead, the modified program loads a portiou called a page into the main memory and performs matching to find the appropriate contexts. The page size can be changed using file -p option but we fouud that tile program operates well at -p 10000 (which is the default size).</Paragraph> <Paragraph position="1"> The modified programs, kwic.exe and kwicw.exe, can process files of size just over 2G bytes which is much bigger than the hard disk.</Paragraph> <Paragraph position="3"> Note that the line ntm'tbers are on the left-most posilioes and the keyword is delimited by &quot;<&quot; and &quot;>&quot;.</Paragraph> <Paragraph position="4"> A keyword file mnst be specified using the -k option and each keyword sltould be terminated by &quot;/&quot;.</Paragraph> <Paragraph position="5"> The nunlber of characters in tile left and right contexts can be spccified in bytes, using file -1 and -r options respectively. If-n 0 is specified then lille numbers will appear on the left. There are additional options for indexing in the original concordance programs but these options are not important in tile current implementation.</Paragraph> <Paragraph position="6"> Tbe kwicw.exe deals with segmented text. tlere, the -1 and -r options specify the number of words in the left and right contexts. The length of each context (approx.</Paragraph> <Paragraph position="7"> 1000 characters allocated) can hold 20 words assuming that each word has 24 characters.</Paragraph> </Section> <Section position="6" start_page="585" end_page="586" type="metho"> <SectionTitle> V. COMPILING COLLOCATION STATISTICS </SectionTitle> <Paragraph position="0"> Collocation statistics (Fig 4) refers to tile frequencies of each different words or characters at different positions in the contexts of a keyword. These frequencies are usefid for detecting significant collocation in English but these frequencies are tedious and error prone to conlpile by hand. We have also written programs lo compile these statistics lbr Chinese but factorial analysis (l\]iber, 1993) still rem,'fins Io be implenlcnlcd.</Paragraph> <Paragraph position="1"> Chinese concordancing is carricd out first to extract the relevant contexts. The output of concordancing shonld be storcd in kwic.lst. Theu, cxtractl.exe will extract all the different words in the context, using an FSM to decode the kwie format. The program sorts these words according to their fieqncncy of occurrence in the context. The different words are stored in cxtract.crk and the user can select candidates using options as in exsega.exe Next, cxstat.exe compile the frequencies of these different words at differeut positions in the contexts. The statistics are stored in cxtract.sla. For segmented text, kw~cw.exe, cxtractw.exe and cxsiaiw.exe are used instead.</Paragraph> <Paragraph position="2"> displayed on the left and flJe square brackets show tile frequency of occurrence in the context of the keyword. The mlgle brackets indicate tile position orthe key,.vord.</Paragraph> <Paragraph position="3"> Unlike Smadja (1993), the ke~vord rnay be part of a Chinese word. Thus, the program can compile statistics about different prefixes, suffixes or stems of a Chinese word. This is particularly interesting for itwestigatiog translated terms and compound nouns.</Paragraph> </Section> class="xml-element"></Paper>