XML Viewer - w95-0106

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0106_metho.xml
Size: 19,942 bytes
Last Modified: 2025-10-06 14:14:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0106">
  <Title>Trainable Coarse Bilingual Grammars for Parallel Text Bracketing</Title>
  <Section position="4" start_page="69" end_page="70" type="metho">
    <SectionTitle>
2 Stochastic Inversion Transduction Grammars
</SectionTitle>
    <Paragraph position="0"> In Wu (1995b) we define an inversion tranduction grammar (ITG) formalism for bilingual language modeling, i.e., modeling of two languages (referred to as L1 and L2) simultaneously. The description here is necessarily brief; for further details the reader is referred to Wu (1995a, 1995b).</Paragraph>
    <Paragraph position="1"> An ITG is a context-free grammar that generates output on two separate streams, together with a matching that associates the corresponding tokens and constituents of each stream. The formalism also differs from standard context-free grammars in that the concatenation operation, which is implicit in any production rule's right-hand side, is replaced with two kinds of concatenation with either straight or inverted orientation. Thus, the following are two distinct productions in an ITG:</Paragraph>
    <Paragraph position="3"> Consider each nonterminal symbol to stand for a pair of matched strings, so that for example (A1, A2) denotes the string-pair generated by A. The operator \[ \] performs the &amp;quot;usual&amp;quot; pairwise concatenation so that \[AB\] yields the string-pair (C1, C2) where C1 = AiB~ and C~ = A2B2. But the operator () concatenates constituents on output stream 1 while reversing them on stream 2, so that C1 = A~B~ but C2 = B2A2.</Paragraph>
    <Paragraph position="4"> The inverted concatenation operator permits the extra flexibility needed to accommodate many kinds of word-order variation between source and target languages. Since inversion is permitted at any level of rule expansion, a derivation may intermix productions of either orientation within the parse tree. More on the ordering flexibility will be said later.</Paragraph>
    <Paragraph position="5"> There are also lexical productions of the form: A --~ x/y where x and y am symbols of languages L1 and L2, respectively. Either or both x and y may take the special value E denoting an empty string, allowing a symbol of either language to have no counterpart in the other language by being matched to an empty string. We call x/~ an Ll-singleton and c/y an L2-singleton. Parsing, in the context of ITGs, means to take as input a sentence-pair rather than a sentence, and to output a parse tree that imposes a shared hierarchical structuring on both sentences. For example, Figure 1 shows a parse tree for an English-Chinese sentence translation. The English is mad in the usual depth-first left-to-right order, but for the Chinese, a horizontal line means the right subtree is traversed before the left, so that the following sentence pair is generated:  (1) a. \[\[\[The Authority\]Np \[will \[\[be accountable\]vv \[to \[the \[\[Financial Secretary\]NN \]NNN \]iP \]PP \]VP  \]vv \]sv ./o \]s b. \[\[\[~/~\]NP \[N~ \[\[~1 \[\[\[~ NNN \]NNN \]Ne \]eP \[t~\]VV \]VP \]VP \]SP ./o IS Alternatively, we can show the common structure of the two sentences more compactly using bracket notation with the aid of the () operator:</Paragraph>
  </Section>
  <Section position="5" start_page="70" end_page="71" type="metho">
    <SectionTitle>
(2) \[\[\[The/e Authority/~t~ \]NP \[will/~ (\[be/e accountable/~\]vv \[to/~ \[the/e \[\[Financial/liaR
Secretary/~J\]NN \]NNN \]NP \]PP )VP \]VP \]SP .\[o \]S
</SectionTitle>
    <Paragraph position="0"> where the horizontal line from Figure 1 corresponds to the () level of bracketing.</Paragraph>
    <Paragraph position="1"> A stochastic inversion transduction grammar is an ITG where a probability is associated with each production, subject to the constraint that</Paragraph>
    <Paragraph position="3"> where ai_..~\] = P(i ~ \[jk\]li), bi(x, y) = P(i ~ x/yli), W1 and W2 are the vocabulary sizes of the two languages, and N is the number of nonterrninal categories.</Paragraph>
    <Paragraph position="4"> Under the stochastic formulation, the objective of parsing is to find the maximum-likelihood parse for a given sentence pair. A general algorithm for this is given in Wu (1995b).</Paragraph>
    <Paragraph position="5"> The following convenient theorem is proved in Wu (1995b), which indicates that any ITG can be converted to a normal form, where all productions are either lexical productions or binary-fanout productions:  Theorem 1 For any inversion transduction grammar G, there exists an equivalent inversion transduction grammar G' in which every production takes one of the following forms: S ---~ e/e A ~ x/e A ~ \[BC\] A ~ x/y A ~ e/y A ~ (BC)  for all i, j English-Chinese lexical translations for all i English vocabulary for all j Chinese vocabulary  The algorithms in this paper assume that ITGs are in this normal form, with one slight relaxation. Lexical productions of the form A ~ x/y may generate multiple-word sequences, i.e., x and g may each be more than one word. This does not affect the generative power, but allows probabilities to be placed on collocation translations. The form is called lexical normal form.</Paragraph>
    <Paragraph position="6"> ITGs impose two desirable classes of constraints on the space of possible matchings between sentences. Crossing constraints prohibit arrangements where the matchings between subtrees cross each another, unless the subtrees' immediate parent constituents are also matched to each other. Aside from linguistic motivations stemming from the compositionality principle, this constraint is important for computational reasons, to avoid exponential bilingual matching times. Fanout constraints limit the number of direct sub-constituents of any single constituent, i.e., the number of subtrees whose matchings may cross at any level. We have shown that ITGs inherently permit nearly free matchings for fanouts up to four, with strong constraints thereafter creating a rapid falloff in the proportion of matchings permitted (Wu 1995a). This characteristic gives ITGs just the right degree of flexibility needed to map syntactic structures interlingually.</Paragraph>
  </Section>
  <Section position="6" start_page="71" end_page="76" type="metho">
    <SectionTitle>
3 Coarse Bilingual Grammars
</SectionTitle>
    <Paragraph position="0"> Because the expressiveness of ITGs naturally constrains the space of possible matchings in a highly appropriate fashion, the possibility arises that the information supplied by a word-translation lexicon alone may be adequately discriminating to match constituents, without language-specific monolingual grammars for the source and target languages, simply by bringing the ITG constraints to bear in tandem with lexical matching.</Paragraph>
    <Paragraph position="1"> That is, the bilingual SITG parsing algorithm can perform constituent identification and matching using only a generic, language-independent bracketing grammar.</Paragraph>
    <Paragraph position="2"> Several earlier experiments (Wu 1995a) tested out variants of this hypothesis, using generic SITGs similar to the one shown in Figure 2, which employs only one nonterminal category. The first two productions are sufficient to generate all possible matchings of ITG expressiveness (this follows from the normal form theorem). The remaining productions are all lexical. Productions of the A --+ ui/v~ form list all word translations found in the translation lexicon, and the others list all potential singletons without corresponding translations. Thus, a parser with this grammar can build a bilingual parse tree for any possible ITG matching on a pair of input sentences.</Paragraph>
    <Paragraph position="3"> Probabilities on the grammar are placed as follows. The bii distribution encodes the English-Chinese translation lexicon with degrees of probability on each potential word translation. A small e-constant can be chosen for the probabilities bl, and b,i, so that the optimal matching resorts to these productions only when it is otherwise impossible to match the singletons. The result is that the maximum-likelihood parser selects the parse tree that best meets the combined lexical translation preferences, as expressed by the bij probabilities. Performance, as reported in Wu (1995a), was encouraging, with precision on automatically-filtered  sentence pairs in the 80% range with the aid of supporting heuristics. However, there are of course inherent limitations of any approach that relies entirely on crossing- and fanout-constrained lexical matching. In particular, if the sub-constituents of any constituent appear in the same order in both languages, lexical matchings do not provide the discriminative leverage to identify the sub-constituent boundaries. This applies to both straight and inverted orientations; an example with inverted orientation is shown in Figure 3. In such cases, specific grammatical information about one or both of the languages is needed.</Paragraph>
    <Paragraph position="4"> Grammatical information is far less easily available for Chinese than for English, however, with respect to part-of-speech lexicons as well as grammars. The SITG formalism offers another possibility: the generic bracketing grammar can be replaced with a context-free backbone designed for English.</Paragraph>
    <Paragraph position="5"> It is critical under this approach that the English grammar be reasonably robust. It should also avoid being too specific, since to be effective at bracketing, its structure must accomodate Chinese to a reasonably broad extent. For these reasons it is best to employ a simple, coarse grammar, with fallback productions that simulate the generic bracketing grammar when the English productions are too inflexible.</Paragraph>
    <Paragraph position="6"> As before, the lexical productions will constitute the bulk of the rules set. However, we can now distinguish between different part-of-speech nonterminals. Different part-of-speech nonterminals may generate the same words. We can accomodate the fact that no Chinese part-of-speech lexicon is available with noninformative distributions as follows:  1. The conditional distribution over L ~ ui/e productions is estimated from the frequencies for each English part-of-speech L.</Paragraph>
    <Paragraph position="7"> 2. The conditional distribution over L ~ ui/v# productions is estimated from the frequencies for the English part-of-speech L uniformly distributed over the set of matching Chinese words.</Paragraph>
    <Paragraph position="8"> 3. The conditional distribution over L ~ ~/v~ productions is uniformly distributed over the Chinese vocabulary.</Paragraph>
    <Paragraph position="10"> Because the grammar is coarse while the lexicon is fine, the approach retains the previous approach's high sensitivity to lexical matching constraints.</Paragraph>
    <Paragraph position="11"> It is interesting to constrast this method with the &amp;quot;parse-parse-match&amp;quot; approaches that have been reported recently for producing parallel bracketed corpora (Sadler &amp; Vendelmans 1990; Kaji et al. 1992; Matsumoto et al. 1993; Cranias et al. 1994; Gfishman 1994). &amp;quot;Parse-parse-match&amp;quot; methods first bracket a parallel corpus by parsing each half individually using a monolingual grammar. 1 Heuristic procedures are subsequently used to select a matching between the bracketed constituents across sentence-pairs. These approaches can encounter difficulties with incompatibilities between the monolingual grammars used to parse the texts. The grammars will usually be of unrelated origins, not designed to make interlingual matching easy.</Paragraph>
    <Paragraph position="12"> Furthermore, how to deal with ambiguities presents another serious problem. Most sentences in the corpus will have multiple possible parses. In a pure &amp;quot;parse-parse-match&amp;quot; approach, however, the monolingual parsers must arbitrarily select one bracketing with which to annotate the corpus. The resulting parse may be incompatible with the parse chosen for the other half of the sentence-pair, causing a matching error even though some alternative parse might in fact been compatible.</Paragraph>
    <Paragraph position="13"> The coarse bilingual grammar approach proposed here solves these problems by choosing the parse a Of course, this assumes that adequate grammars are available for both languages, contrary to our present assumptions.  structure for both sentences simultaneously with the interlingual constituent matching criteria. The weighting of the bracketing constraints and matching constraints is probabilistic. Even if a sentence pair's translations truly contain structural mismatches that are beyond syntactic accounts, the soft constraint optimization permits graceful degradation in the bilingual parse. The parser will attempt to match those constituents for which a partial decomposition and matching can be found, parsing the rest largely according to the English grammar backbone.</Paragraph>
    <Paragraph position="14"> More sophisticated &amp;quot;parse-parse-match&amp;quot; procedures postpone ambiguity resolution until the matching stage (Kaji et al. 1992; Matsumoto et al. 1993; Grishman 1994). This tactic bears closer resemblance to our approach, but still requires ad hoc heuristics to determine exactly how the matching task influences the monolingual parses that are chosen. On the other hand, the present framework incorporates all these aspects within a single probabilistic optimization.</Paragraph>
    <Paragraph position="15"> Another alternative approach discussed in Wu (1995b) is to first use a monolingual grammar to bracket only the English half of the text, followed by a SITG parallel bracketing procedure constrained by the English brackets. However, this hybrid approach is subject to the same incompatibility and ambiguity problems that arise for pure &amp;quot;parse-parse-match&amp;quot; procedures; thus the proposed coarse bilingual grammar approach is superior for the same reasons given above.</Paragraph>
    <Paragraph position="16"> For our experiments, we employed the grammar shown in Figures 4 and 5, with only 50 syntactic productions and 13 nonterminal categories, including 6 part-of-speech categories. Each syntactic production occurs in both straight and inverted orientations, to model ignorance of the ordering tendencies of the corresponding Chinese constituents. The part-of-speech categories were designed by conflating categories in the Brown corpus tagset, under the following general principle: categories should be as broad as possible, while still maintaining reasonable discriminativeness for bracketing structure. Thus, notice that adjectives and nouns are conflated, since complex nominal phrases have largely similar parse structures regardless of  the difference between adjective and noun labels. Similarly, all verbs including auxiliaries are grouped to allow simple tail-recursive compounding. The S category (not to be confused with the start symbol SO) is a placeholder for miscellaneous items including punctuation and adverbs, and functions as a fallback category similar to the A nonterminal in the generic bracketing grammars.</Paragraph>
    <Paragraph position="17"> Probabilities were placed on the syntactic productions uniformly, but all inverted productions were They are right to do so  assigned a slightly smaller probability in order to break ties in favor of straight matchings. Probabilities were placed on the lexical productions as discussed above, with the following additional provisions. The translation lexicon was automatically learned from the HKUST English-Chinese Parallel Bilingual Corpus via statistical sentence alignment (Wu 1994) and statistical Chinese word and collocation extraction (Fung &amp; Wu 1994; Wu &amp; Fung 1994), followed by an EM word-translation learning procedure (Wu &amp; Xia 1994). The latter stage gives us the lexical translation probabilities. The translation lexicon contained approximately 6,500 English words and 5,500 Chinese words, and was not manually corrected for this experiment, having about 86% translation accuracy. The English part-of-speech lexicon with relative frequencies was derived from the English portion of our corpus as tagged by Brill's (1993) tagger.</Paragraph>
    <Paragraph position="18"> Our preliminary experiments show improved parsing behavior in general, compared to generic bracketing grammars. Examples of the output are shown in Figure 6. The latter example shows problematic behavior on the example given earlier in Figure 3 of sentence pairs without sufficient ordering discrimination. Although an attempt is made in this case to fit the English constraints, the main difficulty is that the translation &amp;quot;so/~ ~&amp;quot; was missing from the automatically-learned lexicon; also, the simple grammar lacks infinitival clauses.</Paragraph>
  </Section>
  <Section position="7" start_page="76" end_page="78" type="metho">
    <SectionTitle>
4 An EM Algorithm for Training SITGs
</SectionTitle>
    <Paragraph position="0"> An unavoidable consequence of using more structured, complex grammars--coarse though they may be--is that the bilingual matching process becomes more sensitive to the syntactic production probabilities than under the earlier generic bracketing grammar approaches. Performance therefore suffers if the probabilities are not appropriate, a serious problem given that the syntactic production probabilities above are manually, and arbitrarily, set to be uniform.</Paragraph>
    <Paragraph position="1"> It therefore becomes desirable to find means to tune the syntactic production probabilities automatically, so as to be optimal with respect to some training data set. Note that we do not expect the parallel training corpus to be parsed or otherwise syntactically annotated beforehand. To this end we present an EM (expectation-maximization) algorithm for iteratively improving the syntactic production parameters of a SITG, according to a likelihood criterion. The method is a generalization of the inside-outside algorithm for SCFG estimation (Baker 1979; Lari &amp; Young 1990).</Paragraph>
    <Paragraph position="2"> A few notational preliminaries: we will denote the sentence pairs by (E, C) where the English sentence E -- el,..., eT and the corresponding Chinese sentence C = cl,... , ev are vectors of observed symbols (that is, lexemes or words). As an abbreviation we write e~..t for the sequence of words e~+~, e~+2,..., et, and similarly for ~..~. It will be convenient to use a 4-tuple of the form q = (s, t, u, v) to identify each node of the parse tree, where the substrings es..t and c~..~ both derive from the node q. Denote the nonterminal label on q = (s, t, u, v) by gq or PSstu~, with the convention that PS~t~ = 0 means that e,..t and e~..~ are not derived from a single common nonterminal.</Paragraph>
    <Paragraph position="3"> The inside probabilities, defined as:</Paragraph>
    <Paragraph position="5"> O&lt;S&lt;s t&lt;S_&lt;T v&lt;U&lt;V 0&lt;U&lt;u (,-s)(u-~)~o (s-t)(,~-u)#o The estimation procedure for adjusting the model parameter set ff is defined in terms of the inside and outside probabilities. We begin by considering for each nonterminal the probability of its use in a derivation of the observed sentence-pair:</Paragraph>
    <Paragraph position="7"> The behavior of a typical training run is shown in Figure 7. The relative movement of the log likelihood is what is important here. The absolute magnitudes are not meaningful since they are largely determined by the fixed lexical translation probabilities. What is significant is that due to the relatively small number of parameters being trained, convergence is achieved within two or three iterations. (The rise in perplexity afterwards is caused by numerical error on overtrained parameters; we terminate training as soon as this</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML