File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1003_abstr.xml
Size: 5,906 bytes
Last Modified: 2025-10-06 13:42:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1003"> <Title>Learning Chinese Bracketing Knowledge Based on a Bilingual Language Model</Title> <Section position="1" start_page="0" end_page="2" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper proposes a new method for automatic acquisition of Chinese bracketing knowledge from English-Chinese sentence-aligned bilingual corpora. Bilingual sentence pairs are first aligned in syntactic structure by combining English parse trees with a statistical bilingual language model. Chinese bracketing knowledge is then extracted automatically. The preliminary experiments show automatically learned knowledge accords well with manually annotated brackets. The proposed method is particularly useful to acquire bracketing knowledge for a less studied language that lacks tools and resources found in a second language more studied. Although this paper discusses experiments with Chinese and English, the method is also applicable to other language pairs.</Paragraph> <Paragraph position="1"> Introduction The past few years have seen a great success in automatic acquisition of monolingual parsing knowledge and grammars. The availability of large tagged and syntactically bracketed corpora, such as Penn Tree bank, makes it possible to extract syntactic structure and grammar rules automatically (Marcus 1993). Substantial improvements have been made to parse western language such as English, and many powerful models have been proposed (Brill 1993, Collins 1997). However, very limited progress has been achieved in Chinese.</Paragraph> <Paragraph position="2"> Knowledge acquisition is a bottleneck for real appication of Chinese parsing. While some methods have been proposed to learn syntactic knowledge from annotated Chinese corpus, most of the methods depended on the annotated or partial annotated data(Zhou 1997, Streiter 2000). Due to the limited availbility of Chinese annotated corpus, tests of these methods are still small in scale. Although some institutions and universities currently are engaged in building Chinese tree bank, no large scale annotated corpus has been published until now because the complexity in Chinese syntatic sturcture and the difficulty in corpus annotation (Chen 1996).</Paragraph> <Paragraph position="3"> This paper proposes a novel method to facilitate the Chinese tree bank construction. Based on English-Chinese bilingual corpora and better English parsing, this method obtains Chinese bracketing information automatically via a bilingual model and word alignment results.</Paragraph> <Paragraph position="4"> The main idea of the method is that we may acquire knowledge for a language lacking a rich collection of resources and tools from a second language that is full of them.</Paragraph> <Paragraph position="5"> The rest of this paper is organized as follows : In the next section, a bilingual language model is introduced. Then, a bilingual parsing method supervised by English parsing is proposed in section 2. Based on the bilingual parsing, Chinese bracketing knowlege is extracted in section 3. The evaluation and discussion are given in section 4. We conclude with discussion of future work.</Paragraph> <Paragraph position="6"> 1 A bilingual language model - ITG Wu (1997) has proposed a bilingual language model called Inversion Transduction Grammar (ITG), which can be used to parse bilingual sentence pairs simultaneously. We will give a brief description here. For details please refer to (Wu 1995, Wu 1997).</Paragraph> <Paragraph position="7"> The Inversion Transduction Grammar is a bilingual context-free grammar that generates two matched output languages (referred to as L and L ). It also differs from standard context-free grammars in that the ITG allows right-hand side production in two directions: straight or inverted. The following examples are two ITG productions:</Paragraph> <Paragraph position="9"> Each nonterminal symbol stands for a pair of matched strings. For example, the nonterminal A stands for the string-pair (A ) denotes the string-pair generated by B. The operator [ ] performs the usual concatenation, so that C -> [A . On the other hand, the operator <> performs the straight concatenation for language 1 but the reversing concatenation for language 2, so that C -> <A B> yields C .</Paragraph> <Paragraph position="10"> The inverted concatenation operator permits the extra flexibility needed to accommodate many kinds of word-order variation between source and target languages (Wu 1995).</Paragraph> <Paragraph position="11"> There are also lexical productions of the following form in ITG: A -> x/y This means that a symbol x in language L is translated by the symbol y in language L . x or y may be a null symbol e, which means there may be no counterpart string on other side of the bitext.</Paragraph> <Paragraph position="12"> ITG based parsing matches constituents for an input sentence-pair. For example, Figure 1 shows an ITG parsing tree for an English-Chinese sentence-pair. The inverted production is indicated by a horizontal line in the parsing tree. The English text is read in the usual depth-first left to right order, but for the Chinese text, a horizontal line means the right sub-tree is traversed before the left. The generated parsing results are: We can also represent the common structure of the two sentences more clearly and compactly with the aid of <> notation: where the horizontal line from Figure 1 corresponds to the <> level of bracketing. Any ITG can be converted to a normal form, where all productions are either lexical productions or binary-fanout nonterminal productions(Wu 1997). If probability is associated with each production, the ITG is called the Stochastic Inversion Transduction Grammar (SITG).</Paragraph> </Section> class="xml-element"></Paper>