File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2153_intro.xml
Size: 6,118 bytes
Last Modified: 2025-10-06 14:05:40
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2153"> <Title>AN EFFICIENT SYNTACTIC TAGGING TOOL FOR CORPORA @</Title> <Section position="3" start_page="0" end_page="949" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> In recent years, the corpora, either monolingual or bilingual,plays an important role in MT and linguistics rcscarches(Komatsu, jin & Yasuhara, 1993; Sato, 1993; \[sabcllc & Dymetman,t993). This is because the corpora with large amount of running text is considered as an ideal resources of linguistic knowledge. However, to acquire knowledge \['rom the corpora(Watenabc, 1993; Mitamura, Nyberg, Carboncll, 1993), or effectively use the scntcnces as examples, as in example based approach(Nagao, 1984, O. Furusc & H.Iida, 1992), the corpora has to be annotated with certain inlbrmation which may be of morphological information, syntactic inl'ormation and semantic information.</Paragraph> <Paragraph position="1"> Take Chinese monolingual corpora, For instance, the raw corpora, i.c. the text which has not bccn scgmcntcd into word strings, can only be uscd tbr statistics of Chinesc character, howevcr, if you want to work out the frequency of words, the corpora has to bc segmcntcd into word strings, i.c., it has to be annotated with word boundary information. Further morc, if you want to obtain the co-occurrence frcqucncy of each two adjacent part of speeches, which is helpful to the study of part of speech (POS) tagging, you must annotate the corpora with POS inIbrmation. And if&quot; you want to extract the syntactic knowledge from corpus, the corpus must be attached with syntactic information such as dependency relation and phrase structure etc., and such a corpora is called tree bank which is used as the rcsources for knowledge acquisition and cxamplcs in EBMT research.</Paragraph> <Paragraph position="2"> There are usually five levels of annotation tbr a corpora, which includes word boundary tagging, POS tagging, sense tagging, syntactic relation tagging and semantic relation tagging, with the depth of tagging increases. To improve the tagging automarion and keep good consistency, a mechanism is rcquircd at each level of tagging to acquire knowledge fiom hunaan intervention and the annotated corpus. The knowledge acquired should be fed back to the tagging model to improve the tagging automation and correctness.</Paragraph> <Paragraph position="3"> Our group has bcen doing the research on Chincse corpus for many years, and has done successful experiments on word boundary tagging, POS tagging(Bai & Xia, 1992), sense tagging(Tong, Huang & Guo, 1993). The syntactic relation tagging, however, has not been resolved well because of some reasons. First, there is no clear answer about which grammar lbrmalism, such as phrase structure granamar, or dependency grammar or any othcr grammar is suitable for large scale running text syntactic tagging? Second, how to save humanZs labor from tagging, and keep good (i) supported by National Foundation of Natural Science of China. consistency? For the first question, some rescarchers adopt phrase structure grammar (PSG) as thc tagging formalisms(Lecch & Garside 1991), and some adopt dependency grammar(DG) 1993, Komatsu, Jin, & Yasuhara, 1993). In comparison with PSG, the authors think, DG has some advantages. First, it is economical and convcnient to use DG for thc syntactic relation tagging of corpus because there is no non-terminal node in the parse tree ofDG; Sccnd, DG stresses relations among individual words, the acquisition of collocation knowledge and syntactic relation among words is straight; Third, there is relatively straight map bctween dependency tree and case reprcsentation.</Paragraph> <Paragraph position="4"> Based on the above discussion, the authors chosen dependency grammar as the syntactic formalism for corpora, and defined 44 kinds of dcpendency relation tbr Chinese(Zhou & Huang 1993).</Paragraph> <Paragraph position="5"> For the second question, we must develop an efficicnt tagging tool, fbr which wc nccd takc account of two factors: (1) the power of acquiring tagging knowledge from the human intervention, in order to improve the automation level; (2) the ability ot&quot; keeping good consistency.</Paragraph> <Paragraph position="6"> Simmons & Yu (1992) introduced the context-dependent grammar for English parsing, in which the context-dependent rules can be acquired through an interactive mechanism, the phrase structure analysis and case analysis were conducted through a stack based shift/shift parser, with success ratio reached as high as 99%. Inspircd by their work, we designed a dependency relation tagging tool \['or Chinese corpus, called CSTT.</Paragraph> <Paragraph position="7"> CSTT takes the context-dependent grammar as well. It can learn the humants knowledge of tagging. In the initial stage, the tagging is mainly done by human, the system records the operation of human and forms tagging rules, when the rules are accumulated to some number, the system can help human to tag, such as provides human with annotation operations which human did belbre in the same context, or even do some annotation itself in some cases. The annotation automation gets higher and higher and good consistency is thus guaranteed. It should be mentioned that since PSG non-terminal symbols are used in shift / reduce tagging process, CSTT can produce syntactically tagge d sentences of PSG version as well. In addition, both versions of tree can be mapped into each other by providing with a set of transfcr rules.</Paragraph> <Paragraph position="8"> A small corpora of 1300 sentences of daily life is used for experiment, with the average length of 20 Chinese characters per sentence,For the first 300 sentences, 1455 rules were obtained, and for the whole corpora,totally 6521 rules was obtained. The tagging automation was improved continually with the rules increased, and the automatic tagging ratio is above 50% after 1200 sentences were tagged.</Paragraph> </Section> class="xml-element"></Paper>