File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0106_intro.xml
Size: 2,962 bytes
Last Modified: 2025-10-06 14:05:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0106"> <Title>Trainable Coarse Bilingual Grammars for Parallel Text Bracketing</Title> <Section position="3" start_page="0" end_page="69" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A number of empirical studies have found bracketing to be a useful type of corpus annotation (e.g., Pereira & Schabes 1992; Black et al. 1993). Bracketed corpora have been available for some time in English, and to some extent other European languages, the best-known example being perhaps the Penn Treebank (Marcus 1991). However, at present bracketed corpora for Chinese are unknown, as is the case for many other languages. Moreover, even for better-studied languages, parallel bracketed texts are scarce.</Paragraph> <Paragraph position="1"> The problem of bracketing such corpora is the focus of two new strategies described in this paper. The strategies build upon stochastic inversion transduction grammars (SITGs), a formalism that we have been developing for bilingual language modeling. Numerous experiments have shown parallel bilingual corpora to provide a rich source of constraints for statistical analysis (e.g., Brown et al. 1990; Gale & Church 1991 ; Gale et al. 1992; Church 1993; Brown et al. 1993; Dagan et al. 1993; Fung & Church 1994; Wu & Xia 1994; Fung & McKeown 1994). SITGs are a generalization of context-free grammars that have several desirable properties for parallel corpus analysis; a brief summary of these properties is given in Section 2.</Paragraph> <Paragraph position="2"> Our first strategy is to expropriate a very simple, coarse monolingual grammar of English as the backbone for a bilingual English-Chinese SITG, which is then used for bracketing parallel text. The effect of this is to transfer knowledge of English syntactic constraints (or more precisely, probabilistic preferences) to the bilingual task. This is discussed in Section 3.</Paragraph> <Paragraph position="3"> Our second strategy is to apply an unsupervised training algorithm to tune the probabilistic parameters of the SITG. For this purpose we have devised an EM-based algorithm, a bilingual generalization of the inside-outside method, that iteratively improves the likelihood of the training corpus. This is discussed in Section 4.</Paragraph> <Paragraph position="4"> It is important to stress at the outset that aparallel bracketed corpus is different from a bracketed parallel corpus. The latter is simply a parallel corpus in which both halves have been independently bracketed. In contrast, in a parallel bracketed corpus, the bracketed sub-constituents are themselves parallel in the sense that explicit matching relationships are designated between sub-constituents of each half. This is a much more interesting kind of annotation if it can be accomplished, especially for machine translation applications.</Paragraph> </Section> class="xml-element"></Paper>