File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2403_metho.xml
Size: 19,491 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2403"> <Title>Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool</Title> <Section position="4" start_page="18" end_page="21" type="metho"> <SectionTitle> 3 Automatic Identification and extrac- </SectionTitle> <Paragraph position="0"> tion of Chinese MWEs In order to test the feasibility of automatic identification and extraction of Chinese MWEs on a large scale, we used an existing statistical tool built for English and a Chinese corpus built at CCID. A CCID tool is used for tokenizing and POS-tagging the Chinese corpus. The result was thoroughly manually checked by Chinese experts at CCID. In this paper, we aim to evaluate this existing tool from two perspectives a) its performance on MWE extraction, and b) its performance on a language other than English. In the following sections, we describe our experiment in detail and discuss main issues that arose during the course of our experiment.</Paragraph> <Section position="1" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 3.1 MWE extraction tool </SectionTitle> <Paragraph position="0"> The tool we used for the experiment exploits statistical collocational information between near-context words (Piao et al., 2005). It first collects collocates within a given scanning window, and then searches for MWEs using the collocational information as a statistical dictionary. As the collocational information can be extracted on the fly from the corpus to be processed for a reasonably large corpus, this process is fully automatic. To search for MWEs in a small corpus, such as a few sentences, the tool needs to be trained on other corpus data in advance.</Paragraph> <Paragraph position="1"> With regards to the statistical measure of collocation, the option of several formulae are available, including mutual information and log likelihood, etc. Our past experience shows that log-likelihood provides an efficient metric for corpus data of moderate sizes. Therefore it is used in our experiment. It is calculated as follows (Scott, 2001).</Paragraph> <Paragraph position="2"> For a given pair of words X and Y and a search window W, let a be the number of windows in which X and Y co-occur, let b be the number of windows in which only X occurs, let c be the number of windows in which only Y occurs, and let d be the number of windows in which none of them occurs, then</Paragraph> <Paragraph position="4"> In addition to the log-likelihood, the t-score is used to filter out insignificant co-occurrence word pairs (Fung and Church, 1994), which is calculated as follows: In order to filter out weak collocates, a threshold is often used, i.e. in the stage of collocation extraction, any pairs of items producing word affinity scores lower than a given threshold are excluded from the MWE searching process. Furthermore, in order to avoid the noise caused by functional words and some extremely frequent words, a stop word list is used to filter such words out from the process.</Paragraph> <Paragraph position="5"> If the corpus data is POS-tagged, some simple POS patterns can be used to filter certain syntactic patterns from the candidates. It can either be implemented as an internal part of the process, or as a post-process. In our case, such pattern filters are mostly applied to the output of the MWE searching tool in order to allow the tool to be language-independent as much as possible.</Paragraph> <Paragraph position="6"> Consequently, for our experiment, the major adjustment to the tool was to add a Chinese stop word list. Because the tool is based on Unicode, the stop words of different languages can be kept in a single file, avoiding any need for adjusting the program itself. Unless different languages involved happen to share words with the same form, this practice is safe and reliable. In our particular case, because we are dealing with English and Chinese, which use widely different characters, such a practice performs well.</Paragraph> <Paragraph position="7"> Another language-specific adjustment needed was to use a Chinese POS-pattern filter for selecting various patterns of the candidate MWEs (see Table 6). As pointed out previously, it was implemented as a simple pattern-matching program that is separate from the MWE tool itself, hence minimizing the modification needed for porting the tool from English to Chinese language. null A major advantage of this tool is its capability of identifying MWEs of various lengths which are generally representative of the given topic or domain. Furthermore, for English it was found effective in extracting domain-specific multi-word terms and expressions which are not included in manually compiled lexicons and dictionaries. Indeed, due to the open-ended nature of such MWEs, any manually compiled lexicons, however large they may be, are unlikely to cover them exhaustively. It is also efficient in finding newly emerging MWEs, particularly technical terms, that reflect the changes in the real world.</Paragraph> </Section> <Section position="2" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 3.2 Experiment </SectionTitle> <Paragraph position="0"> In this experiment, our main aim was to examine the feasibility of practical application of the MWE tool as a component of an MT system, therefore we used test data from some domains in which translation services are in strong demand. We selected Chinese corpus data of approximately 696,000 tokenised words (including punctuation marks) which cover the topics of food, transportation, tourism, sports (including the Olympics) and business.</Paragraph> <Paragraph position="1"> In our experiment, we processed the texts from different topics together. These topics are related to each other under the themes of entertainment and business. Therefore we assume, by mixing the data together, we could examine the performance of the MWE tool in processing data from a broad range of related domains. We expect that the different features of texts from different domains will have a certain impact on the result, but the examination of such impact is beyond the scope of this paper.</Paragraph> <Paragraph position="2"> As mentioned earlier, the Chinese word tokeniser and POS tagger used in our experiment has been developed at CCID. It is an efficient tool running with accuracy of 98% for word tokenisation and 95% for POS annotation. It employs a part-of-speech tagset of 15 categories shown in Table 2. Although it is not a finely grained tagset, it meets the need for creating POS pattern filters for MWE extraction.</Paragraph> <Paragraph position="3"> Since function words are found to cause noise in the process of MWE identification, a Chinese stop list was collected. First, a word frequency list was extracted. Next, the top items were considered and we selected 70 closed class words for the stop word list. When the program searches for MWEs, such words are ignored.</Paragraph> <Paragraph position="4"> The threshold of word affinity strength is another issue to be addressed. In this experiment, we used log-likelihood to measure the strength of collocation between word pairs. Generally the log-likelihood score of 6.6 (p < 0.01 or 99% confidence) is recommended as the threshold (Rayson et al., 2004), but it was found to produce too many false candidates in our case. Based on our initial trials, we used a higher threshold of 30, i.e. any word pairs producing log-likelihood score less than this value are ignored in the MWE searching process. Furthermore, for the sake of the reliability of the statistical score, when extracting collocates, a frequency threshold of five was used to filter out low-frequency words, i.e. word pairs with frequencies less than five were ignored.</Paragraph> <Paragraph position="5"> An interesting issue for us in this experiment is the impact of the length of collocation searching window on the MWE identification. For this purpose, we tested two search window lengths 2 and 3, and compared the results obtained by using them. Our initial hypothesis was that the shorter window length may produce higher precision while the longer window length may sacrifice precision but boost the MWE coverage.</Paragraph> <Paragraph position="6"> The output of the tool was manually checked by Chinese experts at CCID, including cross checking to guarantee the reliability of the results. There were some MWE candidates on which disagreements arose. In such cases, the candidate was counted as false. Furthermore, in order to estimate the recall, experts manually identified MWEs in the whole test corpus, so that the output of the automatic tool could be compared against it. In the following section, we present a detailed report on our evaluation of the MWE tool.</Paragraph> </Section> <Section position="3" start_page="20" end_page="21" type="sub_section"> <SectionTitle> 3.3 Evaluation </SectionTitle> <Paragraph position="0"> We first evaluated the overall precision of the tool. A total of 7,142 MWE candidates (types) were obtained for window lengths of 2, of which 4,915 were accepted as true MWEs, resulting in a precision of 68.82%. On the other hand, a total of 8,123 MWE candidates (types) were obtained for window lengths of 3, of which 4,968 were accepted as true MWEs, resulting in a precision of 61.16%. This result is in agreement with our hypothesis that shorter search window length tends to produce higher precision.</Paragraph> <Paragraph position="1"> Next, we estimated the recall based on the manually analysed data. When we compared the accepted MWEs from the automatic result against the manually collected ones, we found that the experts tend to mark longer MWEs, which often contain the items identified by the automatic tool. For example, the manually marked MWE Wang Qiu Yun Dong Fa Zhan Ji Hua (development plan for the tennis sport) contains shorter MWEs Wang Qiu Yun Dong (tennis sport) and Fa Zhan Ji Hua (development plan) which were identified by the tool separately. So we decided to take the partial matches into account when we estimate the recall. We found that a total 14,045 MWEs were manually identified and, when the search window length was set to two and three, 1,988 and 2,044 of them match the automatic output, producing recalls of 14.15% and 14.55% respectively. It should be noted that many of the manually accepted MWEs from the automatic output were not found in the manual MWE collection. This discrepancy was likely caused by the manual analysis being carried out independently of the automatic tool, resulting in a lower recall than expected. Table 3 lists the precisions and recalls.</Paragraph> <Paragraph position="2"> Furthermore, we evaluated the performance of the MWE tool from two aspects: frequency and MWE pattern.</Paragraph> <Paragraph position="3"> Generally speaking, statistical algorithms work better on items of higher frequency as it depends on the collocational information. However, our tool does not select MWEs directly from the collocates. Rather, it uses the collocational information as a statistical dictionary and searches for word sequences whose constituent words have significantly strong collocational bonds between them. As a result, it is capable of identifying many low-frequency MWEs. Table 4 lists the breakdown of the precision for five frequency bands (window length = 2).</Paragraph> <Paragraph position="4"> (window length = 2).</Paragraph> <Paragraph position="5"> As shown in the table above, the highest precisions were obtained for the frequency range between 3 and 99. However, 2,082 of the accepted MWEs have frequencies of one or two, accounting for 42.36% of the total accepted MWEs. Such a result demonstrates again that our tool is capable of identifying low-frequency items. An interesting result is for the top frequency band (greater than 100). Against our general assumption that higher frequency brings higher precision, we saw the lowest precision in the table for this band. Our manual examination reveals this was caused by the high frequency numbers, such as &quot;one&quot; or &quot;two&quot; in the expressions &quot;[?] Ge &quot; (a/one) and &quot;[?] Chong &quot; ( a kind of). This type of expression were classified as uninteresting candidates in the manual checking, resulting in higher error rates for the high frequency band.</Paragraph> <Paragraph position="6"> When we carry out a parallel evaluation for the case of searching window length of 3, we see a similar distribution of precision across the frequency bands except that the lowest frequency band has the lowest precision, as shown by Table 5. When we compare this table against Table 4, we can see, for all of the frequency bands except the top one, that the precision drops as the search window increases. This further supports our earlier assumption that wider searching window tends to reduce the precision.</Paragraph> <Paragraph position="7"> (window length = 3).</Paragraph> <Paragraph position="8"> In fact, not only the top frequency band, much of the errors of the total output were found to be caused by the numbers that frequently occur in the test data, e.g. [?] _U Ge _S (one), Liang _U Ge _S (two) etc. When a POS filter was used to filter them out, for the window length 2, we obtained a total 5,660 candidates, of which 4,386 were accepted as true MWEs, producing a precision of 77.49%. Similarly for the window length 3, a total of 6,526 candidates were extracted in this way and 4,685 of them were accepted as true MWEs, yielding a precision of 71.79%.</Paragraph> <Paragraph position="9"> Another factor affecting the performance of the tool is the type of MWEs. In order to examine the potential impact of MWE types to the performance of the tool, we used filters to select MWEs of the following three patterns: 1) AN: Adjective + noun structure; 2) NN: Noun + noun Structure; 3) FV: Adverb + Verb.</Paragraph> <Paragraph position="10"> As shown in the table, the MWE tool achieved high precisions above 91% when we use a search window of two words. Even when the search window expands to three words, the tool still obtained precision around 90%. In particular, the tool is efficient for the verb phrase type. Such a result demonstrates that, when we constrain the search algorithm to some specific types of MWEs, we can obtain higher precisions. While one may argue that rule-based parser can do the same work, it must be noted that we are not interested in all grammatical phrases, but those which reflect the features of the given domain.</Paragraph> <Paragraph position="11"> This is achieved by combining statistical word collocation measures, a searching strategy and simple POS pattern filters.</Paragraph> <Paragraph position="12"> Another interesting finding in our experiment is that our tool extracted clauses, such as Xiang He Xie Shi Yao (What would you like to drink?) and Xian He Dian Shi Yao ? (Would you like a drink first?). The clauses occur only once or twice in the entire test data, but were recognized by the tool because of the strong collocational bond between their constituent words. The significance of such performance is that such clauses are typical expressions which are frequently used in real-life conversation in the contexts of the canteen, tourism etc. Such a function of our tool may have practical usage in automatically collecting longer typical expressions for the given domains.</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="22" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> As our experiment demonstrates, our tool provides a practical means of identifying and extracting domain specific MWEs with a minimum amount of linguistic knowledge. This becomes important in multilingual tasks in which it can be costly and time consuming to build comprehensive rules for several languages. In particular, it is capable of detecting MWEs of various lengths, sometimes whole clauses, which are often typical of the given domains of the corpus data. For example, in our experiment, the tool successfully identified several daily used long expressions in the domain of food and tourism. MT systems often suffer when translating conversation. An efficient MWE tool can potentially alleviate the problem by extracting typical clauses used in daily life and mapping them to adequate translations in the target language.</Paragraph> <Paragraph position="1"> Despite the flexibility of the statistical tool, however, there is a limit to its performance in terms of precision. While it is quite efficient in providing MWE candidates, its output has to be either verified by human or refined by using linguistic rules. In our particular case, we improved the precision of our tool by employing simple POS pattern filters. Another limitation of this tool is that currently it can only recognise continuous MWEs. A more flexible searching algo- null rithm is needed to identify discontinuous MWEs, which are important for NLP tasks.</Paragraph> <Paragraph position="2"> Besides the technical problem, a major unresolved issue we face is what constitutes MWEs. Despite agreement on the core MWE types, such as idioms and highly idiosyncratic expressions, like Cheng Yu (Cheng-Yu) in Chinese, it is difficult to reach agreement on less fixed expressions.</Paragraph> <Paragraph position="3"> We contend that MWEs may have different definitions for different research purposes. For example, for dictionary compilation, lexicographers tend to constrain MWEs to highly non-compositional expressions (Moon, 1998: 18).</Paragraph> <Paragraph position="4"> This is because monolingual dictionary users can easily understand compositional MWEs and there is no need to include them in a dictionary for native speakers. For lexicon compilation aimed at practical NLP tasks, however, we may apply a looser definition of MWEs. For example, in the Lancaster semantic lexicon (Rayson et al., 2004), compositional word groups such as &quot;youth club&quot; are considered as MWEs alongside non-compositional expressions such as &quot;food for thought&quot; as they depict single semantic units or concepts. Furthermore, for the MT research community whose primary concern is cross-language interpretation, any multiword units that have stable translation equivalent(s) in a target language can be of interest.</Paragraph> <Paragraph position="5"> As we discussed earlier, a highly idiomatic expression in a language can be translated into a highly compositional expression in another language, and vice versa. In such situations, it can be more practically useful to identify and map translation equivalents between the source and target languages regardless of their level of compositionality. null Finally, the long Chinese clauses identified by the tool can potentially be useful for the improvement of MT systems. In fact, most of them are colloquial expressions in daily conversation, and many such Chinese expressions are difficult to parse syntactically. It may be more feasible to identify such expressions and map them as a whole to English equivalent expressions. The same may apply to technical terms, jargon and slang. In our experiment, our tool demonstrated its capability of detecting such expressions, and will prove useful in this regard.</Paragraph> </Section> class="xml-element"></Paper>