File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1019_metho.xml
Size: 14,379 bytes
Last Modified: 2025-10-06 14:07:03
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1019"> <Title>Unit Completion for a Computer-aided Translation System</Title> <Section position="3" start_page="0" end_page="136" type="metho"> <SectionTitle> 2 The Core Engine </SectionTitle> <Paragraph position="0"> The core of TRANSTYPE is a completion engine which comprises two main parts: an evaluator which assigns probabilistic scores to completion hypotheses and a generator which uses the evaluation function to select the best candidate for completion.</Paragraph> <Section position="1" start_page="0" end_page="135" type="sub_section"> <SectionTitle> 2.1 The Evaluator </SectionTitle> <Paragraph position="0"> The evaluator is a function p(t\[t', s) which assigns to each target-text unit t an estimate of its probability given a source text s and the tokens t' which precede t in the current translation of s. 1 Our approach to modeling this distribution is based to a large extent on that of the IBM group (Brown et al., 1993), but it differs in one significant aspect: whereas the IBM model involves a &quot;noisy channel&quot; decomposition, we use a linear combination of separate predictions from a language model p(tlt ~) and a translation model p(tls ). Although the noisy channel technique 1We assume the existence of a deterministic procedure for tokenizing the target text.</Paragraph> <Paragraph position="1"> This bill is examined in the house of commons word-completion task unit-completion task</Paragraph> <Paragraph position="3"> the target words the user is expected to produce. The next two columns indicate respectively the prefixes typed by the user and the completions proposed by the system in a word-completion task. The last two columns provide the same information for the unit-completion task. The total number of keystrokes for both tasks is reported in the last line. + indicates the acceptance key typed by the user. A completion is denoted by a/13 where a is the typed prefix and 13 the completed part. Completions for different prefixes are separated by *.</Paragraph> <Paragraph position="4"> is powerful, it has the disadvantage that p(slt' , t) is more expensive to compute than p(tls ) when using IBM-style translation models. Since speed is crucial for our application, we chose to forego the noisy channel approach in the work described here. Our linear combination model is described as follows:</Paragraph> <Paragraph position="6"> where a(t', s) E \[0, 1\] are context-dependent interpolation coefficients. For example, the translation model could have a higher weight at the start of a sentence but the contribution of the language model might become more important in the middle or the end of the sentence* A study of the weightings for these two models is described elsewhere* In the work described here we did not use the contribution of the language model (that is, a(t', s) = O, V t', s).</Paragraph> <Paragraph position="7"> Techniques for weakening the independence assumptions made by the IBM models 1 and 2 have been proposed in recent work (Brown et al., 1993; Berger et al., 1996; Och and Weber, 98; Wang and Waibel, 98; Wu and Wong, 98). These studies report improvements on some specific tasks (task-oriented limited vocabulary) which by nature are very different from the task TRANSTYPE is devoted to. Furthermore, the underlying decoding strategies are too time consuming for our application* We therefore use a translation model based on the simple linear interpolation given in equation 2 which combines predictions of two translation models -- Ms and M~ -both based on IBM-like model 2(Brown et al., 1993).</Paragraph> <Paragraph position="8"> Ms was trained on single words and Mu, described in section 3, was trained on both words and units.</Paragraph> <Paragraph position="10"> where Ps and Pu stand for the probabilities given respectively by Ms and M~. G(s) represents the new sequence of tokens obtained after grouping the tokens of s into units. The grouping operator G is illustrated in table 2 and is described in section 3.</Paragraph> </Section> <Section position="2" start_page="135" end_page="136" type="sub_section"> <SectionTitle> 2.2 The Generator </SectionTitle> <Paragraph position="0"> The task of the generator is to identify units that match the current prefix typed by the user, and pick the best candidate according to the evaluator. Due to time considerations, the generator introduces a division of the target vocabulary into two parts: a small active component whose contents are always searched for a match to the current prefix, and a much larger passive part over (380,000 word forms) which comes into play only when no candidates are found in the active vocabulary. The active part is computed dynamically when a new sentence is selected by the translator. It is composed of a few entities (tokens and units) that are likely to appear in the translation. It is a union of the best candidates provided by each model Ms and M~ over the set of all possible target tokens (resp. units) that have a non-null translation probability of being translated by any of the current source tokens (resp.</Paragraph> <Paragraph position="1"> units). Table 2 shows the 10 most likely tokens and units in the active vocabulary for an example source sentence.</Paragraph> <Paragraph position="2"> that. is * what. the . prime, minister . said * and. i * have. outlined* what. has.</Paragraph> <Paragraph position="3"> happened . since* then..</Paragraph> <Paragraph position="4"> c' - est. ce-que, le- premier - ministre, adit.,.et.j', ai. r4sum4- ce. qui.s'- estproduit - depuis * .</Paragraph> <Paragraph position="5"> g(s) that is what * the prime minister said * , and i * have . outlined * what has happened * since sentences (t is the translation of s in our corpus). G(s) is the sequence of source tokens recasted by the grouping operator G. A8 indicates the 10 best tokens according to the word model, Au the 10 best units according to the unit model.</Paragraph> </Section> </Section> <Section position="4" start_page="136" end_page="138" type="metho"> <SectionTitle> 3 Modeling Unit Associations </SectionTitle> <Paragraph position="0"> Automatically identifying which source words or groups of words will give rise to which target words or groups of words is a fundamental problem which remains open. In this work, we decided to proceed in two steps: a) monolingually identifying groups of words that would be better handled as units in a given context, and b) mapping the resulting source and target units. To train our unit models, we used a segment of the Hansard corpus consisting of 15,377 pairs of sentences, totaling 278,127 english tokens (13,543 forms) and 292,865 french tokens (16,399 forms).</Paragraph> <Section position="1" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 3.1 Finding Monolingual Units </SectionTitle> <Paragraph position="0"> Finding relevant units in a text has been explored in many areas of natural language processing. Our approach relies on distributional and frequency statistics computed on each sequence of words found in a training corpus. For sake of efficiency, we used the suffix array technique to get a compact representation of our training corpus. This method allows the efficient retrieval of arbitrary length n-grams (Nagao and Mori, 94; Haruno et al., 96; Ikehara et al., 96; Shimohata et al., 1997; Russell, 1998).</Paragraph> <Paragraph position="1"> The literature abounds in measures that can help to decide whether words that co-occur are linguistically significant or not. In this work, the strength of association of a sequence of words w\[ = wl,..., wn is computed by two measures: a likelihood-based one p(w'~) (where g is the likelihood ratio given in (Dunning, 93)) and an entropy-based one e(w'~) (Shimohata et al., 1997). Letting T stand for the training text and m a token:</Paragraph> <Paragraph position="3"> Intuitively, the first measurement accounts for the fact that parts of a sequence of words that should be considered as a whole should not appear often by themselves. The second one reflects the fact that a salient unit should appear in various contexts (i.e.</Paragraph> <Paragraph position="4"> should have a high entropy score).</Paragraph> <Paragraph position="5"> We implemented a cascade filtering strategy based on the likelihood score p, the frequency f, the length l and the entropy value e of the sequences. A first filter (.~&quot;1 (lmin, fmin, Pmin, emin)) removes any sequence s for which l(s) < lmin or p(s) < Pmin or e(s) < e,nin or f(s) < fmin. A second filter (~'2) removes sequences that are included in preferred ones. In terms of sequence reduction, applying ~1 (2, 2, 5.0, 0.2) on the 81,974 English sequences of at least two tokens seen at least twice in our training corpus, less than 50% of them (39,093) were filtered: 17,063 (21%) were removed because of their low entropy value, 25,818 (31%) because of their low likelihood value.</Paragraph> </Section> <Section position="2" start_page="136" end_page="137" type="sub_section"> <SectionTitle> 3.2 Mapping </SectionTitle> <Paragraph position="0"> Mapping the identified units (tokens or sequences) to their equivalents in the other language was achieved by training a new translation model (IBM 2) using the EM algorithm as described in (Brown et al., 1993). This required grouping the tokens in our training corpus into sequences, on the basis of the unit lexicons identified in the previous step (we will refer to the results of this grouping as the sequence-based corpus). To deal with overlapping possibilities, we used a dynamic programming scheme which optimized a criterion C given by equation 4 over a set S of all units collected for a given language plus all single words. G(w~) is obtained by returning the path that maximized B(n). We investigated several Ccriteria and we found C~--a length-based measurc to be the most satisfactory. Table 2 shows an output of the grouping function.</Paragraph> <Paragraph position="2"> \[nous devons,0.61\] \[il rant,0.19\] \[nous,0.14\] \[ce projet de 1oi,0.35\] \[projet de loi .,0.21\] \[projet de loi,0.18\] \[les canadiens,0.26\] \[des canadiens,0.21\] \[la population,0.07\] \[m. le prdsident :,0.80\] \[a,0.07\] \[h la,0.06\] Ice qui se passe,0.21\] Ice qui se,0.16\] \[et,0.15\] \[dvidemment ,0.26\] \[naturellement,0.08\] \[bien stir,0.08\] \[plait-il h la chambre d' adopter,0.49\] \[la motion ?,0.42\] \[motion training corpus. The third column reports its 3-best ranked target associations (a being a token or a unit, p being the translation probability). The second half of the table reports NP-associations obtained after the filter described in the text.</Paragraph> <Paragraph position="3"> We investigated three ways of estimating the parameters of the unit model. In the first one, El, the translation parameters are estimated by applying the EM algorithm in a straightforward fashion over all entities (tokens and units) present at least twice in the sequence-based corpus 2. The two next methods filter the probabilities obtained with the Ez method. In E2, all probabilities p(tls ) are set to 0 whenever s is a token (not a unit), thus forcing the model to contain only associations between source units and target entities (tokens or units). In E3 any parameter of the model that involves a token is removed (that is, p(tls ) = 0 if t or s is a token).</Paragraph> <Paragraph position="4"> The resulting model will thus contain only unit associations. In both cases, the final probabilities are renormalized. Table 3 shows a few entries from a unit model (Mu) obtained after 15 iterations of the EM-algorithm on a sequence corpus resulting from the application of the length-grouping criterion (dr) over a lexicon of units whose likelihood score is above 5.0. The probabilities have been obtained by application of the method E2.</Paragraph> <Paragraph position="5"> We found many partially correct associations Cover the years/au fils des, we have/nous, etc) that illustrate the weakness of decoupling the unit identification from the mapping problem. In most cas2The entities seen only once are mapped to a special &quot;unknown&quot; word es however, these associations have a lower probability than the good ones. We also found few erratic associations (the first time/e'dtait, some hon.</Paragraph> <Paragraph position="6"> members/t, etc) due to distributional artifacts. It is also interesting to note that the good associations we found are not necessary compositional in nature (we must/il Iaut, people of canada/les canadiens, of eourse/6videmment, etc).</Paragraph> </Section> <Section position="3" start_page="137" end_page="138" type="sub_section"> <SectionTitle> 3.3 Filtering </SectionTitle> <Paragraph position="0"> One way to increase the precision of the mapping process is to impose some linguistic constraints on the sequences such as simple noun-phrase contraints (Ganssier, 1995; Kupiec, 1993; hua Chen and Chen, 94; Fung, 1995; Evans and Zhai, 1996). It is also possible to focus on non-compositional compounds, a key point in bilingual applications (Su et al., 1994; Melamed, 1997; Lin, 99). Another interesting approach is to restrict sequences to those that do not cross constituent boundary patterns (Wu, 1995; Furuse and Iida, 96). In this study, we filtered for potential sequences that are likely to be noun phrases, using simple regular expressions over the associated part-of-speech tags. An excerpt of the association probabilities of a unit model trained considering only the NP-sequences is given in table 3. Applying this filter (referred to as JrNp in the following) to the 39,093 english sequences still surviving after previous filters ~'1 and ~'2 removes 35,939 of them (92%). model spared ok good nu u saved; ok: number of target units accepted by the user; good: number of target units that matched the expected whether they were proposed or not; nu: number of sentences for which no target unit was found by the translation model; u: number of sentences for which at least one helpful unit has been found by the model, but not necessarily proposed.</Paragraph> <Paragraph position="1"> More than half of the 3,154 remaining NP-sequences contain only two words.</Paragraph> </Section> </Section> class="xml-element"></Paper>