File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/c94-1088_abstr.xml

Size: 12,260 bytes

Last Modified: 2025-10-06 13:48:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1088">
  <Title>Character-based Collocation for Mandarin Chinese</Title>
  <Section position="2" start_page="0" end_page="542" type="abstr">
    <SectionTitle>
PI(OJ1;CT NOTE: I~ARGI; TEXT COI(I'ORA II. Background: Corpus and Computational Platfonn
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Collocation has been established as an essential tool in computational linguistics (Church and Mercer 1993). In addition, various col\[ocatiomd programs have been proven to bc indispensable in automatic acquisition of&amp;quot; \[exical information (e.g. Sinclair 1991, and Bibcr i993).</Paragraph>
    <Paragraph position="1"> Sincc words arc the natural and undisputed units in available text corpora, virtually all the current collocationa\[ programs are word-based. However, there are languages where texts do not conventionally mark words, such as Chinese. l_Jnlcss a large tagged corpus is available, a word-based collocation system in these languages faces tile following inevitable difficulties.</Paragraph>
    <Paragraph position="2"> First, hand-segnlentation of a large corpus is tedious and financially nearly impossible. Second, automatic scgmentation prograln can neither identify words not listed ill tile lexicon nor correctly segment all -Milch are listed. Third, estimation of lexical probability relies on word-ficqucncy counts based on the inaccurate results of automatic scgmcntation thus the deviation tcnds to be greater than standard tolerance.</Paragraph>
    <Paragraph position="3"> Text corpora without wordbreaks, nevertheless, also has their advantages. Take Chinese for example, tim basic units of text corpora are zi4 'character', a fairly faithful representation of the morphemic level of tim language. In other words, if we take Chinese Icxt This collocation system is developed on the 20 million charactcr modern Chinese corpus at Academia Sinica (Huang and Chen 1992, lquang In Press). This corpus is composed mostly of newspaper texts. It is cstimatcd to have 14 million words. Following industrial standard in Taiwan, our collocation system can deal with any corpora encoded by BIG-5 code. The program is dcvclopcd undcr a UNiX cnvironmcnt on HP workstation. It should, howcver, be portable to any UNIX machiac with compatible Chincse solution. The collocation systcm is currcntly used in research by more than 10 linguists affiliated with thc Chinese Knowledge Information Processing (CKIP) group at Academia Sinica. It is also open to any visiting scholar for on-site USe.</Paragraph>
    <Paragraph position="4"> III. Overall Design of the System There are two major modules in tile collocation system: one deals directly with unsegmented texts and the other which incorporates automatic segmentation before collocation. The two modules share tile pre-process of KWIC search module, which allows userspecitqed linguistic patterns (Ituang and Chen 1992). They also share three common routines to detect character collocation, to identify possible collocation words through N-grams, and to contextually filter texts with user-sl3ecified strings.</Paragraph>
    <Paragraph position="5">  The overall design of the system is schematically represented in diagram 1.</Paragraph>
    <Paragraph position="6">  There arc three collocatiomd tools awfilable in this system without segmenting the texts inlo words. First, character collocation allows automatic acquisition of sub-lcxical information, sttcll as the conditions on morpho-lexical rt, les. This is attested by the studies on the notion o1' word in the mental lexicon reported in \[luang ct al. (1993), and the generalizations of productive dcrivational rules in Mandarin offered in I long ct al. (1992). Take note that when applying KWIC search Io the corpus, a user has the liee(Iom to specify a key that is a single character, a multi-character siring, or even a discontinuous string of characters. These charactcr strings may or may not be words. Thus the extracted collocationaL relation is not simply between characters. It can also be between characters and either a simplex word, a compound, or a phrase. '\['he collocational relation in our systel)\] is nleastlrcd alld rcl)resented by both Mutual lnforunation (Church and l lanks I990) and frequency. The user can choose to sort and rank the collocates by cither criteria. S/lie can also specify threshold wlhie by eithe! criteria. Usually, the n\]ost effective method is to use licquency threshold and Mutual Inlbrmation ranking (1 luang In Press). In addition to the measures of correlation, distribution of the collocates is also indicated in terms of positions relative to the key and liequency of occurrences at each position for each collocating character.</Paragraph>
    <Paragraph position="7"> Second, lexical information can also be derived.fi'om this collocational system regardless of its lack of demarcation of lexical items. This is achieved through a silnplc Markov lnodel. Once the KWIC search extracted the relevant contexts, a simple N-gram routine can be perlbrrrtcd on lhe context(s) specified by the user. Dcl~cmling on the purpose of the study and the size of relcwmt texts, the length of the targct sequence as well as \[lie Ihreshold l)Ulllbcr Call be specified. For instance, a linguist may want to look lor all two or three character sequences that occtlr over 5 times after a key verb. This would likely lima out a list of possible arguments (i.e.</Paragraph>
    <Paragraph position="8"> syntactic words) for that verb. lleuce lexical information StlCh as semantic restriction of the predicates on its post-arguments can be indirectly extracted. In our system, the user is allowed to iterate the N-gram search by desigmttin~, different contexts and string length (N). The lbllowing is an example of collocation without segmentation, t luang ctal. (1994) argue that Mandarin light verbs select the verbs they nomiualize. This is supported by the N-gram collocation restllts in diagram 2. The collocation is extracted from a 20 million clmracter corpus and the collocation window is 5 characters to the right el'the key word. It shows that the verbjin4xing2 (ypically nominalizcs a process verb.</Paragraph>
    <Paragraph position="9">  Last, the user can specify a character string in the context as a filter. The lnost usefill application is to specify a string that forms a syntactic word. This is a technique commonly used to resolve categorical or sense ambiguities. Combining both N-gram search and string filtering, fi'equcncy-based word collocatipn is achieved without segmentation.</Paragraph>
    <Paragraph position="10"> V. Collocation After Segmentation When lexical or phrasal relation is the focus of the study, the above collocation module may sometimes be  inadequate. In tiffs case, we will necd to apply the automatic segmentation/tagging program such that we can acquire information involving word pairs as well as grammatical categories. The automatic segmentation proccdurc is an revised version of the program reported in Chen and Liu (1992). The on-line lexicon is the CKIP lexicon of more than 80 thousand cntries (Chen 1994).</Paragraph>
    <Paragraph position="11"> We did not automatically segment and tag the whole corpus for very good reasons. First, without a correctly tagged corpus, no statistically-based tagger can perform satisfactorily yet.</Paragraph>
    <Paragraph position="12"> Second, tllcrc is no practical way to recover incorrectly identified words. That is, when the automatic taggcr takes a character fi'om a target word to form an inal~propriate word with a neighboring character; that target word is lost and cannot be identified in this context. Tiros, it will be linguistically more felicitous to allow KWIC to identify all matching strings and allow filtering of incorrect matched words in later steps.</Paragraph>
    <Paragraph position="13"> Last, segmented texts restrict tim available collocation inlbrmation exclusively at word levcl. For instance, not only morphelne-nmrplaenae collocation will not be availablc, neither can correlations bctwccn a mnrphcnac and a word be extracted.</Paragraph>
    <Paragraph position="14"> In contrast, when optional scglnentation is performed on-line on the result of KWIC search, the collocational systcnr can be applied to any electronic text corpora with minimal pre-proccssing. This current approach also allows us to mix sub-lcxical, lexical, and extra-lexical conditions according to our research need.</Paragraph>
    <Paragraph position="15"> Even though the post-segmentatiou module shared three routines with tim module without segmentation, they do differ non-trivially in their applications. First, the character collocation module is basically tim same. The additional step of segmentation excludes accidental string matches. For instance, with qu4shi4 &amp;quot;to pass away' as the keyword, KWIC may extract the incorrect context 'tal qu4 shi4jie4 ge,l di4 lu3 xing2'. This error in identifying word boundaries can be easily avoided when the text is correctly segnrented. In this case, the correct segmentation is 'tal qu4 shi4jie4 ge4di4 ht3xing2 (s/he go work\[ everywhere travel)'. Second, N-gram in this module now can include both sequences of characters and sequences of words.</Paragraph>
    <Paragraph position="16"> Two additional tools directly utilize grammatical tags.</Paragraph>
    <Paragraph position="17"> Tim first one is tim computing of tim distribution of grammatical categories in the context. The second is contextual filte,'ing in terms of grammatical categories. One caution needs to be mentioned here. As mentioned earlicr, we do not have a highly reliable automatic tagger yet because the,'e is no dependably tagged large Chinese corpus, l lence our automatic segmentation program looks up the categories of the words but do not attempt to resolve ambiguity. Since categorically ambiguous words make up only around 20% of the texts (Chen and Liu 1992, Chen et al In Preparation), keeping all possible tags seem to be an acceptable compromise for the moment. But this also means that a user must be on the lookout for possible errors caused by multiple tags. Our system allows the use,&amp;quot; to view the categorical distribution of tim whole context, as well as to focus on a smaller context and specific categories. Diagram 3 shows tim categorical collocation of the head of the post-verbal argument of Imo4de2 'to get/receive.' We obtained this information by first perform the discontinuous KWIC on huo4de2 and the relative clause head marker de. After segmentation and collocatio!L we restrict tim disphty to tim first position to tim right of de, and to the two major categories of N and V. The result shows that this verb typically take subclasses of common noun and (nominalized) transitive verbs as argulr~ents.</Paragraph>
    <Paragraph position="18">  Last, the word-based collocation system is tim part of our system that will take the most processing capacity.</Paragraph>
    <Paragraph position="19"> This is also the only part of our system that is still being tested at this moment. Word frequencies of our corpus have already been calculated and stored. The automatically segmented word-based collocation module should be available for linguistic research within weeks.</Paragraph>
    <Paragraph position="20"> VI. Conclusion In this paper, we described a collocation system that works on text corpora without word marks. Tiffs system has tim advantage of extracting sub-lexical information. This is also particularly useful in studying Chinese language co,pora since sociological words are distinct fi'om syntactic words in Chinese (Chao 1968). Thus in linguistic and literary computing, it is often necessary to formulate generalizations based on zi4, the sociological word. The teclmiques reported in this paper should also find applications in two aspects of future computational linguistic research. Fi,st, it can be applied to other language text corpora for extraction of sub-lexical collocation. Second, it can be applied to text corpora  which do not come with clear word demarcation: including corpora in languages in which sociological words and syntactic words do not coincide and spoken corpora.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML