File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-3007_evalu.xml
Size: 7,985 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3007"> <Title>Chinese Sketch Engine and the Extraction of Grammatical Collocations</Title> <Section position="5" start_page="51" end_page="53" type="evalu"> <SectionTitle> 4. Evaluation and Future Developments </SectionTitle> <Paragraph position="0"> An important feature of the prototype of the Chinese Sketch Engine is that, in order to test the robustness of the Sketch Engine design, the original regular expression patterns were adopted with minimal modification for Chinese. Even though both are SVO languages with similar surface word order, it is obvious that they differ substantially in terms of assignment of grammatical functions. In addition, the Sinica tagset is different from the BNC tagset and actually has much richer functional information. These are the two main directions that we will pursue in modification and improvement of the</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> Chinese Sketch Engine. 4.1 Word Boundary Representation </SectionTitle> <Paragraph position="0"> Word breaks are not conventionalized in Chinese texts. This poses a challenge in Chinese language processing. The Chinese Sketch Engine inserted space after segmentation, which helps to visualize words. In the future, it will be trivial to allow the conventional alternative of no word boundary markups. However, it will not be trivial to implement fuzzy function to allow searches for non-canonical lemmas (i.e. lemmas that are segmented differently from the standard corpus).</Paragraph> </Section> <Section position="2" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 4.2 Sub-Corpora Comparison </SectionTitle> <Paragraph position="0"> The Chinese Gigaword corpus is marked with two different genres, story and non-story. A still more salient sub-corpus demarcation is the one between Mainland China corpus and Taiwan corpus. Sketch Difference between lemmas form two sub-corpora is being planned. This would allow future comparative studies and would have wide applications in the localization adaptations of language related applications.</Paragraph> </Section> <Section position="3" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 4.3 Collating Frequency Information with POS </SectionTitle> <Paragraph position="0"> One of the convenient features of Sketch Engine that a frequency ranked word list is linked to all major components. This allows a very easy and informative reference. Since cross-categorical derivation with zero morphology is dominant in Chinese, it would help the processing greatly if POS information is added to the word list. Adding such information would also open the possibility of accessing the POS ranked frequency information.</Paragraph> </Section> <Section position="4" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 4.5 Fine-tuning Collocation Patterns </SectionTitle> <Paragraph position="0"> The Sketch Engine relies on collocation patterns, such as (2) above, to extract collocations. The regular expression format allows fast processing of large scale corpora with good results. However, these patterns can be fine-tuned for better results. We give VN collocates with object function as example here. In (6), verbs are underlined with a single line, and the collocated nouns identified by English Word Sketch are underlined with double lines.</Paragraph> <Paragraph position="1"> Other nominal objects that the Sketch Engine misses are marked with a dotted line.</Paragraph> <Paragraph position="2"> 6.a. In addition to encouraging kids to ask, think and do, parents need to be tolerant and appreciative to avoid killing a child's creative sense.</Paragraph> <Paragraph position="3"> b. Children are taught to love their parents, classmates, animals, nature . . . . in fact they are taught to love just about everything except to love China, their mother country.</Paragraph> <Paragraph position="4"> c. For example, the government deliberately chose not to teach Chinese history and culture, nor civics, in the schools.</Paragraph> <Paragraph position="5"> d. At the game there will be a lottery drawing for a motorcycle! And perhaps you'll catch a foul ball or a home run.</Paragraph> <Paragraph position="6"> The sentences in (6) show that the current Sketch Engine tend to only identify the first object when there are multiple objects. The resultant distributional information thus obtained will be valid given a sufficiently large corpus. However, if the collocation patterns are fine-tuned to allow treatment of coordination, richer and more precise information can be extracted.</Paragraph> <Paragraph position="7"> A regular expression collocation pattern also runs the risk of mis-classification. For instance, speech act verbs often allow subject to occur in post-verbal positions, and intransitive verbs can often take temporal nouns in post-verbal positions too.</Paragraph> <Paragraph position="8"> 7. a. ...you can say goodbye to your competitive career.</Paragraph> <Paragraph position="9"> b. `No,' said Scarlet, `but then I don't notice much.' 8. a. Where did you sleep last night? b. ...it arrived Thursday morning.</Paragraph> <Paragraph position="10"> c. From Arty's room came the sound of an accordion.</Paragraph> <Paragraph position="11"> 9. `I'll look forward to that.' `So will I.' Such non-canonical word orders are even more prevalent in Chinese. Chinese objects often occur in pre-verbal positions in various pre-posing constructions, such as topicalization.</Paragraph> <Paragraph position="12"> 10. E P< quan.gu mian.bao, chi le hen jian.kang whole-grain bread, eat LE very healthy 'Eating whole-grain bread is very healthy.' 11a. B> * you ren chang.shi yao jiang zhe he.hua fen.lei, que yue fen yue lei someone try to JIANG the lotus classify, but more classify more tired 'People have tried to decide what category the lotus belongs in, but have found the effort taxing.' b. > 4 ( wo yi.ding yao ba lao.da chu.diao I must want BA the oldest (son) get rid of 'I really want to get rid of the older son.' When objects are pre-posed, they tend to stay closer to the verb than the subject. Adding object marking information, such as ba , jiang , lian would help correctly identify collocating pre-posed objects. However, for those unmarked pre-posed structures, closeness to the verb may not provide sufficient information. Several rules will need to be implemented jointly.</Paragraph> <Paragraph position="13"> The above example underlines a critical issue. That is, whether relative position alone is enough to identify positional information. The Sketch Engine is in essence a powerful tool extracting generalizations from annotated corpus data. We have shown that it can extract useful grammatical information with POS tag alone. If the corpus is tagged with richer annotation, the Sketch Engine should be able to extract even richer information.</Paragraph> <Paragraph position="14"> The Sinica Corpus tagset adapts to the fact that Chinese has a freer word order than English by incorporating semantic information with the grammatical category. For instance, locational and temporal nouns, proper nouns, and common nouns each are assigned a different tag. Verbs are sub-categorized according to activity and transitivity. Such information is not available in the BNC tagset and hence not used in the original Sketch Engine design. We will enrich the collocation patterns with the annotated linguistic information from the Sinica Corpus tagset. In particular, we are converting ICG lexical subcategorization frames (Chen and Huang 1990) to Sketch Engine collocation patters. These ICG frames, called Basic Patterns and Adjunct Patterns, have already been fully annotated lexically and tested on the Sinica Corpus. We expect their incorporation to improve Chinese Sketch Engine results markedly.</Paragraph> </Section> </Section> class="xml-element"></Paper>