File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2402_metho.xml

Size: 22,750 bytes

Last Modified: 2025-10-06 14:10:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2402">
  <Title>Grouping Multi-word Expressions According to Part-Of-Speech in Statistical Machine Translation</Title>
  <Section position="3" start_page="9" end_page="9" type="metho">
    <SectionTitle>
2 Baseline Translation System
</SectionTitle>
    <Paragraph position="0"> This section describes the SMT approach that was used in this work. A more detailed description of the presented translation system is available in Mari~no et al. (2005). This approach implements a translation model which is based on bilingual n-grams, and was developed by de Gispert and Mari~no (2002).</Paragraph>
    <Paragraph position="1"> The bilingual n-gram translation model actually constitutes a language model of bilingual units which are referred to as tuples. This model approximates the joint probability between source and target languages by using 3-grams as it is described in the following equation:</Paragraph>
    <Paragraph position="3"> where t refers to target, s to source and (t,s)n to the nth tuple of a given bilingual sentence pair.</Paragraph>
    <Paragraph position="4"> Tuples are extracted from a word-to-word aligned corpus. More specifically, word-to-word alignments are performed in both directions, source-to-target and target-to-source, by using GIZA++ (Och and Ney, 2003), and tuples are extracted from the union set of alignments according to the following constraints (de Gispert and Mari~no, 2004): * a monotonous segmentation of each bilingual sentence pairs is produced, * no word inside the tuple is aligned to words outside the tuple, and * no smaller tuples can be extracted without violating the previous constraints.</Paragraph>
    <Paragraph position="5"> As a consequence of these constraints, only one segmentation is possible for a given sentence pair. Figure 1 presents a simple example illustrating the tuple extraction process.</Paragraph>
    <Paragraph position="6">  aligned bilingual sentence pair.</Paragraph>
    <Paragraph position="7"> A tuple set is extracted for each translation direction, Spanish-to-English and English-to-Spanish. Then the tuple 3-gram models are trained by using the SRI Language Modelling toolkit (Stolcke, 2002).</Paragraph>
    <Paragraph position="8"> The search engine for this translation system was developed by Crego et al. (2005). It implements a beam-search strategy based on dynamic programming. The decoder's monotonic search modality was used.</Paragraph>
    <Paragraph position="9"> This decoder was designed to take into account various different models simultaneously, so translation hypotheses are evaluated by considering a log-linear combination of feature functions. These feature functions are the translation model, a target language model, a word bonus model, a lexical model and an inverse lexical model.</Paragraph>
  </Section>
  <Section position="4" start_page="9" end_page="11" type="metho">
    <SectionTitle>
3 Experimental Procedure
</SectionTitle>
    <Paragraph position="0"> In this section we describe the technique used to see the effect of multi-words information on the translation model described in section 2.</Paragraph>
    <Section position="1" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Bilingual Multi-words Extraction
</SectionTitle>
      <Paragraph position="0"> First, BMWEs were automatically extracted from the parallel training corpus and the most relevant ones were stored in a dictionary.</Paragraph>
      <Paragraph position="1">  For BMWE extraction, the method proposed by Lambert and Castell (2004) was used. This  verdad . . . . . . + es . . . . . + .</Paragraph>
      <Paragraph position="2"> esto . . . . + . .</Paragraph>
      <Paragraph position="3"> ; . . . + . . .</Paragraph>
      <Paragraph position="4"> siento . [?]+ . . .</Paragraph>
      <Paragraph position="5"> lo [?] [?] . . . .</Paragraph>
      <Paragraph position="6"> I 'm sorry , this is true  source links are represented respectively by horizontal and vertical dashes. method is based on word-to-word alignments which are different in the source-target and target-source directions, such as the alignments trained to extract tuples (section 2). Multi-words like idiomatic expressions or collocations can typically not be aligned word-to-word, and cause a (sourcetarget and target-source) asymmetry in the alignment matrix. An asymmetry in the alignment matrix is a sub-matrix where source-target and target-source links are different. If a word is part of an asymmetry, all words linked to it are also part of this asymmetry. An example is depicted in figure 2.</Paragraph>
      <Paragraph position="7"> In this method, asymmetries in the training corpus are detected and stored as possible BMWEs. Accurate statistics are needed to score each BMWE entry. In the identification phase (section 3.3), these scores permit to prioritise for the selection of some entries with respect to others. Previous experiments (Lambert and Banchs, 2005) have shown than the large set of bilingual phrases described in the following section provides better statistics than the set of asymmetry-based BMWEs.</Paragraph>
      <Paragraph position="8">  Here we refer to Bilingual Phrase (BP) as the bilingual phrases used by Och and Ney (2004). The BP are pairs of word groups which are supposed to be the translation of each other. The set of BP is consistent with the alignment and consists of all phrase pairs in which all words within the target language are only aligned to the words of the source language and vice versa. At least one word of the target language phrase has to be aligned with at least one word of the source language phrase. Finally, the algorithm takes into account possibly unaligned words at the boundaries of the target or source language phrases.</Paragraph>
      <Paragraph position="9"> We extracted all BP of length up to four words, with the algorithm described by Och and Ney. Then we estimated the phrase translation probability distribution by relative frequency:</Paragraph>
      <Paragraph position="11"> In equation 2, s and t stand for the source and target side of the BP, respectively. N(t,s) is the number of times the phrase s is translated by t, and N(s) is the number of times s occurs in the corpus. We took the minimum of both direct and inverse relative frequencies as probability of a BP.</Paragraph>
      <Paragraph position="12"> If this minimum was below some threshold, the BP was pruned. Otherwise, this probability was multiplied by the number of occurrences N(t,s) of this phrase pair in the whole corpus. A weight l was introduced to balance the respective importance of relative frequency and number of occurrences, as shown in equation 3:</Paragraph>
      <Paragraph position="14"> We performed the intersection between the entire BP set and the entire asymmetry based multi-words set, keeping BP scores. Notice that the entire set of BP is not adequate for our dictionary because BP are extracted from all parts of the alignment (and not in asymmetries only), so most BP are not BMWEs but word sequences that can be decomposed and translated word to word.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
3.2 Lexical and Morpho-syntactic Filters
</SectionTitle>
      <Paragraph position="0"> In English and Spanish, a list of stop words3 (respectively 19 and 26) was established. The BMWE dictionary was also processed by a Part-Of-Speech (POS) tagger and eight rules were written to filter out noisy entries. These rules depend on the tag set used. Examples of criteria to reject a BMWE include:  * Its source or target side begins with a coordination conjunction (except &amp;quot;nor&amp;quot;, in English) * Its source or target side ends with an indefinite determiner English data have been POS-tagged using the TnT tagger (Brants, 2000), after the lemmas have been extracted with wnmorph, included in the Wordnet package (Miller et al., 1991). POS-tagging for Spanish has been performed using the FreeLing analysis tool (Carreras et al., 2004).</Paragraph>
      <Paragraph position="1"> Finally, the BMWE set has been divided in three subsets, according to the following criteria, applied in this order:  * If source AND target sides of a BMWE contain at least a verb, it is assigned to the &amp;quot;verb&amp;quot; class.</Paragraph>
      <Paragraph position="2"> * If source AND target sides of a BMWE contain at least a noun, it is assigned to the &amp;quot;noun&amp;quot; class.</Paragraph>
      <Paragraph position="3"> * Otherwise, it is assigned to the &amp;quot;misc&amp;quot; class (miscellaneous). Note that this class is mainly composed of adverbial phrases.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
3.3 Multi-Words Identification
</SectionTitle>
      <Paragraph position="0"> Identification consists, first, of the detection of all possible BMWE(s) in the corpus, and second, of the selection of the relevant candidates.</Paragraph>
      <Paragraph position="1"> The detection part simply means matching the entries of the dictionaries described in the previous subsections. In the example of figure 2, the following BMWEs would have been detected (the number on the right is the score): Then, selection in a sentence pair runs as follows. First, the BMWE with highest score among the possible candidates is considered and its corresponding positions are set as covered. If this BMWE satisfies the selection criterion, the corresponding words in the source and target sentences are grouped as a unique token. This process is repeated until all word positions are covered in the sentence pair, or until no BMWE matches the positions remaining to cover.</Paragraph>
      <Paragraph position="2"> The selection criterion rejects candidates whose words are linked to exactly one word. Thus in the example, &amp;quot;esto - this is&amp;quot; would not be selected. This is correct, because the subject &amp;quot;esto&amp;quot; (this) of the verb &amp;quot;es&amp;quot; (is) in Spanish is not omitted, so that &amp;quot;this is - es&amp;quot; does not act as BMWE (&amp;quot;esto&amp;quot; should be translated to &amp;quot;this&amp;quot; and &amp;quot;is&amp;quot; to &amp;quot;es&amp;quot;). At the end of the identification process the sentence pair of figure 2 would be the following: &amp;quot;lo siento ; esto es verdad - I 'm sorry , this is true&amp;quot;.</Paragraph>
      <Paragraph position="3"> In order to increase the recall, BMWE detection was insensitive to the case of the first letter of each multi-word. The detection engine also allows a search based on lemmas. Two strategies are possible. In the first one, search is first carried out with full forms, so that lemmas are resorted to only if no match is found with full forms. In the second strategy, only lemmas are considered.</Paragraph>
    </Section>
    <Section position="4" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
3.4 Re-alignment
</SectionTitle>
      <Paragraph position="0"> The modified training corpus, with identified BMWEs grouped in a unique &amp;quot;super-token&amp;quot; was aligned again in the same way as explained in section 2. By grouping multi-words, we increased the size of the vocabulary and thus the sparseness of data. However, we expect that if the meaning of the multi-words expressions we grouped is effectively different from the meaning of the words they contain, the individual word probabilities should be improved.</Paragraph>
      <Paragraph position="1"> After re-aligning, we unjoined the super-tokens that had been grouped in the previous stage, correcting the alignment set accordingly. More precisely, if two super-tokens A and B were linked together, after ungrouping them into various tokens, every word of A was linked to every word of B. Translation units were extracted from this corrected alignment, with the unjoined sentence pairs (i.e.the same as in the baseline). So the only difference with respect to the baseline lied in the alignment, and thus in the distribution of translation units and in lexical model probabilities.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="11" end_page="14" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
4.1 Training and Test Data
</SectionTitle>
      <Paragraph position="0"> Our task was word alignment and translation of parliamentary session transcriptions of the European Parliament (EPPS). These data are currently available at the Parliament's website.4 They  included session transcriptions from April 1996 until September 2004, and from November 15th until November 18th, 2004, respectively. Translation test data include two reference sets. Alignment test data was a subset of the training data (Lambert et al., 2006).</Paragraph>
      <Paragraph position="1"> Table 1 presents some statistics of the various data sets for each considered language: English (eng) and Spanish (spa). More specifically, the statistics presented in Table 1 are, the total number of sentences, the total number of words, the vocabulary size (or total number of distinct words) and the average number of words per sentence.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.2 Evaluation measures
</SectionTitle>
      <Paragraph position="0"> Details about alignment evaluation can be found in Lambert et al. (2006). The alignment test data contain unambiguous links (called S or Sure) and ambiguous links (called P or Possible). If there is a P link between two words in the reference, a computed link (i.e. to be evaluated) between these words is acceptable, but not compulsory. On the contrary, if there would be an S link between these words in the reference, a computed link would be compulsory. In this paper, precision refers to the proportion of computed links that are present in the reference. Recall refers to the proportion of reference Sure links that were computed. The alignment error rate (AER) is given by the following formula:</Paragraph>
      <Paragraph position="2"> where A is the set of computed links, GS is the set of Sure reference links and G is the entire set of reference links.</Paragraph>
      <Paragraph position="3"> As for translation evaluation, we used the following measures: WER (word error rate) or mWER (multireference word error rate) The WER is the minimum number of substitution, insertion and deletion operations that must be performed to convert the generated sentence into the reference target sentence. For the mWER, a whole set of reference translations is used. In this case, for each translation hypothesis, the edit distance to the most similar sentence is calculated.</Paragraph>
      <Paragraph position="4"> BLEU score This score measures the precision of unigrams, bigrams, trigrams, and fourgrams with respect to a whole set of reference translations, and with a penalty for too short sentences (Papineni et al., 2001). BLEU measures accuracy, thus larger scores are better.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
4.3 Multi-words in Training Data
</SectionTitle>
      <Paragraph position="0"> In this section we describe the results of the BMWE extraction and detection techniques applied to the training data.</Paragraph>
      <Paragraph position="1"> 4.3.1 Description of the BMWE dictionaries Parameters of the extraction process have been optimised with the alignment development corpus available with the alignment test corpus. With these parameters, a dictionary of 60k entries was extracted. After applying the lexical and morpho-syntactic filters, 45k entries were left. The best 30k entries (hereinafter referred to as all) have been selected for the experiments and divided in the three groups mentioned in section 3.2. verb, noun and misc (miscellaneous) dictionaries contained respectively 11797, 9709 and 8494 entries. Table 2 shows recall and precision for the BMWEs identified with each dictionary. The first line is the evaluation of the MWEs obtained with the best 30k entries of the dictionary before filtering. Alignments evaluated in table 2 contained only links corresponding to the identified BMWEs. For an identified BMWE, a link was introduced between each word of the source side and each word of the target side. Nevertheless, the test data contained the whole set of links.</Paragraph>
      <Paragraph position="2"> From table 2 we see the dramatic effect of the filters. The precision for nouns is lower than for  the various dictionaries.</Paragraph>
      <Paragraph position="3"> the other categories because many word groups which were identified, like &amp;quot;European Parliament Parlamento europeo&amp;quot;, are not aligned as a group in the alignment reference. Notice also that the data in table 2 reflects the precision of bilingual MWE, which is a lower bound of the precision of &amp;quot;supertokens&amp;quot; formed in each sentence, the quantity that matters in our experiment.</Paragraph>
      <Paragraph position="4"> Identification of BMWE based on lemmas has also been experimented. However, with lemmas, the selection phase is more delicate. With our basic selection criterion (see section 3.3), the quality of MWEs identified was worse so we based identification on full forms.</Paragraph>
      <Paragraph position="5"> Figure 3 shows the first 10 entries in the misc dictionary, along with their renormalised score.</Paragraph>
      <Paragraph position="6"> Notice that &amp;quot;the EU - la UE&amp;quot;, &amp;quot;young people j'ovenes&amp;quot; and &amp;quot;the WTO - la OMC&amp;quot; have been incorrectly classified due to POS-tagging errors.</Paragraph>
      <Paragraph position="7">  Table 3 shows, for each language, the MWE vocabulary size after the identification process, and how many times a MWE has been grouped as a unique token (instances). The different number of instances between Spanish and English correspond to one-to-many BMWEs. In general more MWEs are grouped in the Spanish side, because English is a denser language. However, the omission of the subject in Spanish causes the inverse situation for verbs.</Paragraph>
      <Paragraph position="8">  the various dictionaries. ALL refers to the 30k best entries with filters.</Paragraph>
    </Section>
    <Section position="4" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
4.4 Alignment and Translation Results
</SectionTitle>
      <Paragraph position="0"> Tables 4 and 5 show the effect of aligning the corpus when the various categories of multi-words have been previously grouped.</Paragraph>
      <Paragraph position="1">  alignment model in the baseline and after grouping MWE with all dictionary entries. The multi-word tokens &amp;quot;in other words&amp;quot; and &amp;quot;es decir&amp;quot; do not exist in the baseline.</Paragraph>
      <Paragraph position="2"> In table 4 we see how word-to-word lexical probabilities of the alignment model can be favourably modified. In the baseline, due to presence of the fixed expression &amp;quot;in other words es decir&amp;quot;, the probability of &amp;quot;words&amp;quot; given &amp;quot;decir&amp;quot; (&amp;quot;say&amp;quot; in English) is high. With this expression grouped, probabilities p(words|decir) and p(other|decir) vanish, while p(say|decir) is reinforced. These observations allowed to expect that with many individual probabilities improved, a global improvement of the alignment would occur.</Paragraph>
      <Paragraph position="3"> However, table 5 shows that alignment is not better when trained with BMWEs grouped as a unique token.</Paragraph>
      <Paragraph position="4"> A closer insight into alignments confirms that they have not been improved globally. Changes with respect to the baseline are very localised and correspond directly to the grouping of the BMWEs present in each sentence pair.</Paragraph>
      <Paragraph position="5"> Table 6 presents the automatic translation eval- null uation results. In the Spanish to English direction, BMWEs seem to have a negative influence. In the English to Spanish direction, no significant improvement or worsening is observed.</Paragraph>
      <Paragraph position="6">  In order to understand these results better, we performed a manual error analysis for the first 50 sentences of the test corpus. We analysed, for the experiment with all dictionary entries (&amp;quot;All&amp;quot; line of table 6), the changes in translation with respect to the baseline. We counted how many changes had a neutral, positive or negative effect on translation quality. Results are shown in table 7. Notice that approximatively half of these changes were directly related to the presence some BMWE.</Paragraph>
      <Paragraph position="7"> This study permitted to see interesting qualitative features. First, BMWEs have a clear influence on translation, sometimes positive and sometimes negative, with a balance which appears to be null in this experiment. In many examples BMWEs allowed a group translation instead of an incorrect word to word literal translation. For instance, &amp;quot;Red Crescent&amp;quot; was translated by &amp;quot;Media Luna Roja&amp;quot; instead of &amp;quot;Cruz Luna&amp;quot; (cross moon). Two main types of error were observed. The first ones are related to the quality of BMWEs. Determiners, or particles like &amp;quot;of&amp;quot;, which are present in BMWEs are mistakenly inserted in the translations. Some errors are caused by inadequate BMWEs. For example &amp;quot;looking at - si analizamos&amp;quot; (&amp;quot;if we analyse&amp;quot;) cannot be used in the sense of looking with the eyes. The second type of error is related to the rigidity and data sparseness introduced in the bilingual n-gram model. For example, when inflected forms are encapsulated in a BMWE, the model looses flexibility to translate the correct inflection. Another typical error is caused by the use of back-off (n-1)-grams in the bilingual language model, when the n-gram is not any more available because of increased data sparseness.</Paragraph>
      <Paragraph position="8"> The error analysis did not give explanation for why the effect of BMWEs is so different for different translation directions. A possible hypothesis would be that BMWEs help in translating from a denser language. However, in this case, verbs would be expected to help relatively more in the Spanish to English direction, since there are more verb group instances in the English side.</Paragraph>
      <Paragraph position="9">  translations between the baseline and the BMWE experiment with &amp;quot;ALL&amp;quot; dictionary. S and E stand for Spanish and English, respectively.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML