XML Viewer - p06-2037

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2037_metho.xml
Size: 19,606 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2037">
  <Title>Low-cost Enrichment of Spanish WordNet with Automatically Translated Glosses: Combining General and Specialized Models</Title>
  <Section position="4" start_page="288" end_page="288" type="metho">
    <SectionTitle>
3 Data Sets and Evaluation Metrics
</SectionTitle>
    <Paragraph position="0"> As a general source of English-Spanish parallel text, we used a collection of 730,740 parallel sentences extracted from the Europarl corpus. These correspond exactly to the training data from the Shared Task 2: Exploiting Parallel Texts for Statistical Machine Translation from the ACL-2005 Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond5.</Paragraph>
    <Paragraph position="1"> To be used as specialized source, we extracted, from the MCR , the set of 6,519 English-Spanish parallel glosses corresponding to the already defined synsets in Spanish WordNet. These definitions corresponded to 5,698 nouns, 87 verbs, and 734 adjectives. Examples and parenthesized texts were removed. Parallel glosses were tokenized and case lowered. We discarded some of these parallel glosses based on the difference in length between the source and the target. The gloss average length for the resulting 5,843 glosses was 8.25 words for English and 8.13 for Spanish. Finally, gloss pairs were randomly split into training (4,843), development (500) and test (500) sets.</Paragraph>
    <Paragraph position="2"> Additionally, we counted on two large mono-lingual Spanish electronic dictionaries, consisting of 142,892 definitions (2,112,592 tokens) ('D1') (Mart'i, 1996) and 168,779 definitons (1,553,674 tokens) ('D2') (Vox, 1990), respectively.</Paragraph>
    <Paragraph position="3"> Regarding evaluation, we used up to four different metrics with the aim of showing whether the improvements attained are consistent or not.</Paragraph>
    <Paragraph position="4"> We have computed the BLEU score (accumulated up to 4-grams) (Papineni et al., 2001), the NIST score (accumulated up to 5-grams) (Doddington, 2002), the General Text Matching (GTM) F-measure (e = 1,2) (Melamed et al., 2003), and the METEOR measure (Banerjee and Lavie, 2005). These metrics work at the lexical level by rewarding n-gram matches between the candidate translation and a set of human references. Additionally, METEOR considers stemming, and allows for WordNet synonymy lookup.</Paragraph>
    <Paragraph position="5"> The discussion of the significance of the results will be based on the BLEU score, for which we computed a bootstrap resampling test of significance (Koehn, 2004b).</Paragraph>
  </Section>
  <Section position="5" start_page="288" end_page="291" type="metho">
    <SectionTitle>
4 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="288" end_page="288" type="sub_section">
      <SectionTitle>
4.1 Baseline Systems
</SectionTitle>
      <Paragraph position="0"> As explained in the introduction we built two individual baseline systems. The first baseline ('EU') system is entirely based on the training data from the Europarl corpus. The second baseline system ('WNG') is entirely based on the training set from of the in-domain corpus of parallel glosses. In the second case phrase pairs occurring only once in the training corpus are not discarded due to the extremely small size of the corpus.</Paragraph>
      <Paragraph position="1"> Table 1 shows results of the two baseline systems, both for the development and test sets. We compare the performance of the 'EU' baseline on these data sets with respect to the (in-domain) Europarl test set provided by the organizers of the ACL-2005 MT workshop. As expected, there is a very significant decrease in performance (e.g., from 0.24 to 0.08 according to BLEU) when the 'EU' baseline system is applied to the new domain. Some of this decrement is also due to a certain degree of free translation exhibited by the set of available 'quasi-parallel' glosses. We further discuss this issue in Section 5.</Paragraph>
      <Paragraph position="2"> The results obtained by 'WNG' are also very low, though slightly better than those of 'EU'. This is a very interesting fact. Although the amount of data utilized to construct the 'WNG' baseline is 150 times smaller than the amount utilized to construct the 'EU' baseline, its performance is higher consistently according to all metrics. We interpret this result as an indicator that models estimated from in-domain data provide higher precision.</Paragraph>
      <Paragraph position="3"> We also compare the results to those of a commercial system such as the on-line version 5.0 of SYSTRAN6, a general-purpose MT system based on manually-defined lexical and syntactic transfer rules. The performance of the baseline systems is significantly worse than SYSTRAN's on both development and test sets. This means that a rule-based system like SYSTRAN is more robust than the SMT-based systems. The difference against the specialized 'WNG' also suggests that the amount of data used to train the 'WNG' base-line is clearly insufficient.</Paragraph>
    </Section>
    <Section position="2" start_page="288" end_page="289" type="sub_section">
      <SectionTitle>
4.2 Combining Sources: Language Models
</SectionTitle>
      <Paragraph position="0"> In order to improve results, in first place we turned our eyes to language modeling. In addition to  baseline system on the ACL-2005 SMT workshop test set extracted from the Europarl Corpus. BLEU.n4 shows the accumulated BLEU score for 4-grams. NIST.n5 shows the accumulated NIST score for 5-grams. GTM.e1 and GTM.e2 show the GTM F1measure for different values of the e parameter (e = 1, e = 2, respectively). METEOR reflects the METEOR score. the language model built from the Europarl corpus ('EU') and the specialized language model based on the small training set of parallel glosses ('WNG'), two specialized language models, based on the two large monolingual Spanish electronic dictionaries ('D1' and 'D2') were used. We tried several configurations. In all cases, language models are combined with equal probability. See results, for the development set, in Table 2.</Paragraph>
      <Paragraph position="1"> As expected, the closer the language model is to the target domain, the better results. Observe how results using language models 'D1' and 'D2' outperform results using 'EU'. Note also that best results are in all cases consistently attained by using the 'WNG' language model. This means that language models estimated from small sets of in-domain data are helpful. A second conclusion is that a significant gain is obtained by incrementally adding (in-domain) specialized language models to the baselines, according to all metrics but BLEU for which no combination seems to significantly outperform the 'WNG' baseline alone. Observe that best results are obtained, except in the case of BLEU, by the system using 'EU' as translation model and 'WNG' as language model. We interpret this result as an indicator that translation models estimated from out-of-domain data are helpful because they provide recall. A third interesting point is that adding an out-of-domain language model ('EU') does not seem to help, at least combined with equal probability than in-domain models. Same conclusions hold for the test set, too.</Paragraph>
    </Section>
    <Section position="3" start_page="289" end_page="289" type="sub_section">
      <SectionTitle>
4.3 Tuning the System
</SectionTitle>
      <Paragraph position="0"> Adjusting the Pharaoh parameters that control the importance of the different probabilities that govern the search may yield significant improvements. In our case, it is specially important to properly adjust the contribution of the language models. We adjusted parameters by means of a software based on the Downhill Simplex Method in Multidimensions (William H. Press and Flannery, 2002). The tuning was based on the improvement attained in BLEU score over the development set. We tuned 6 parameters: 4 language models (llmEU, llmD1, llmD2, llmWNG), the translation model (lph), and the word penalty (lw)7.</Paragraph>
      <Paragraph position="1"> Results improve substantially. See Table 3. Best results are still attained using the 'EU' translation model. Interestingly, as suggested by Table 2, the weight of language models is concentrated on the 'WNG' language model (llmWNG = 0.95).</Paragraph>
    </Section>
    <Section position="4" start_page="289" end_page="291" type="sub_section">
      <SectionTitle>
4.4 Combining Sources: Translation Models
</SectionTitle>
      <Paragraph position="0"> In this section we study the possibility of combining out-of-domain and in-domain translation models aiming at achieving a good balance between precision and recall that yields better MT results.</Paragraph>
      <Paragraph position="1"> Two different strategies have been tried. In a first stragegy we simply concatenate the out-of-domain corpus ('EU') and the in-domain corpus ('WNG'). Then, we construct the translatation model ('EUWNG') as detailed in Section 2.1. A second manner to proceed is to linearly combine the two different translation models into a single translation model ('EU+WNG'). In this case, we can assign different weights (o) to the contribution of the different models to the search. We can also determine a certain threshold th which allows us  llmEU = 0.22, llmD1 = 0, llmD2 = 0.01, llmWNG = 0.95, lph = 1, and lw = [?]2.97, while when using the 'WNG' translation model final values are llmEU = 0.17, llmD1 = 0.07, llmD2 = 0.13, llmWNG = 1, lph = 0.95, and lw = [?]2.64.</Paragraph>
      <Paragraph position="3"> the models estimated from the Europarl corpus and the training set of parallel WordNet glosses, respectively. 'D1', and 'D2' denote the specialized language models estimated from the two dictionaries.</Paragraph>
      <Paragraph position="4"> Translation Model Language Model BLEU.n4 NIST.n5 GTM.e1 GTM.e2 METEOR</Paragraph>
      <Paragraph position="6"> for the two translation models, 'EU' and 'WNG'.</Paragraph>
      <Paragraph position="7"> to discard phrase pairs under a certain probability.</Paragraph>
      <Paragraph position="8"> These weights and thresholds were adjusted8 as detailed in Subsection 4.3. Interestingly, at combination time the importance of the 'WNG' translation model (otmWNG = 0.9) is much higher than that of the 'EU' translation model (otmEU = 0.1).</Paragraph>
      <Paragraph position="9"> Table 4 shows results for the two strategies.</Paragraph>
      <Paragraph position="10"> As expected, the 'EU+WNG' strategy consistently obtains the best results according to all metrics both on the development and test sets, since it allows to better adjust the relative importance of each translation model. However, both techniques achieve a very competitive performance. Results improve, according to BLEU, from 0.13 to 0.16, and from 0.11 to 0.14, for the development and test sets, respectively.</Paragraph>
      <Paragraph position="11"> We measured the statistical signficance of the overall improvement in BLEU.n4 attained with respect to the baseline results by applying the bootstrap resampling technique described by Koehn (2004b). The 95% confidence intervals extracted from the test set after</Paragraph>
      <Paragraph position="13"> tervals are not ovelapping, we can conclude that the performance of the best combined method is statistically higher than the ones of the two base-line systems.</Paragraph>
      <Paragraph position="14"> 4.5 How much in-domain data is needed? In principle, the more in-domain data we have the better, but these may be difficult or expensive to collect. Thus, a very interesting issue in the context of our work is how much in-domain data is needed in order to improve results attained using out-of-domain data alone. To answer this question we focus on the 'EU+WNG' strategy and analyze the impact on performance (BLEU.n4) of specialized models extracted from an incrementally bigger number of example glosses. The results are presented in the plot of Figure 1. We compute three variants separately, by considering the use of the in-domain data: only for the translation model (TM), only for the language model (LM), and simultaneously in both models (TM+LM). In order  MT system performance for the test set.</Paragraph>
      <Paragraph position="15"> to avoid the possible effect of over-fitting we focus on the behavior on the test set. Note that the optimization of parameters is performed at each point in the x-axis using only the development set.</Paragraph>
      <Paragraph position="16"> A significant initial gain of around 0.3 BLEU points is observed when adding as few as 100 glosses. In all cases, it is not until around 1,000 glosses are added that the 'EU+WNG' system stabilizes. After that, results continue improving as more in-domain data are added. We observe a very significant increase by just adding around 3,000 glosses. Another interesting observation is the boosting effect of the combination of TM and LM specialized models. While individual curves for TM and LM tend to be more stable with more than 4,000 added examples, the TM+LM curve still shows a steep increase in this last part.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="291" end_page="292" type="metho">
    <SectionTitle>
5 Error Analysis
</SectionTitle>
    <Paragraph position="0"> We inspected results at the sentence level based on the GTM F-measure (e = 1) for the best configuration of the 'EU+WNG' system. 196 sentences out from the 500 obtain an F-measure equal to or higher than 0.5 on the development set (181 sentences in the case of test set), whereas only 54 sentences obtain a score lower than 0.1. These numbers give a first idea of the relative usefulness of our system. Table 5 shows some translation cases selected for discussion. For instance, Case 1 is a clear example of unfair low score. The problem is that source and reference are not parallel but 'quasi-parallel'. Both glosses define the same concept but in a different way. Thus, metrics based on rewarding lexical similarities are not well suited for these cases. Cases 2, 3, 4 are examples of proper cooperation between 'EU' and 'WNG' models. 'EU' models provides recall, for instance by suggesting translation candidates for 'bombs' or 'price below'. 'WNG' models provide precision, for instance by choosing the right translation for 'an attack' or 'the act of'.</Paragraph>
    <Paragraph position="1"> We also compared the 'EU+WNG' system to SYSTRAN. In the case of SYSTRAN 167 sentences obtain a score equal to or higher than 0.5 whereas 79 sentences obtain a score lower than 0.1. These numbers are slightly under the performance of the 'EU+WNG' system. Table 6 shows some translation cases selected for discussion. Case 1 is again an example of both systems obtaining very low scores because of 'quasiparallelism'. Cases 2 and 3 are examples of SYSTRAN outperforming our system. In case 2 SYSTRAN exhibits higher precision in the translation of 'accompanying' and 'illustration', whereas in case 3 it shows higher recall by suggesting appropriate translation candidates for 'fibers', 'silkworm', 'cocoon', 'threads', and 'knitting'. Cases</Paragraph>
  </Section>
  <Section position="7" start_page="292" end_page="292" type="metho">
    <SectionTitle>
FE FW FEW Source OutE OutW OutEW Reference
</SectionTitle>
    <Paragraph position="0"> 0.0000 0.1333 0.1111 of the younger de acuerdo con de la younger de acuerdo con que tiene of two boys el m'as joven de dos boys el m'as joven de menos edad with the same de dos boys tiene el mismo dos muchachos family name con la misma nombre familia tiene el mismo familia fama nombre familia 0.2857 0.2500 0.5000 an attack atacar por ataque ataque ataque con by dropping cayendo realizado por realizado por bombas bombs bombas dropping bombs cayendo bombas 0.1250 0.7059 0.5882 the act of acto de la acci'on y efecto acci'on y efecto acci'on y efecto informing by informaci'on de informing de informaba de informar verbal report por verbales por verbal por verbales con una expliponencia explicaci'on explicaci'on caci'on verbal 0.5000 0.0000 0.5000 a price below un precio por una price un precio por precio que est'a the standard debajo de la below n'umbero debajo de la por debajo de price norma precio est'andar price est'andar precio lo normal  F-measure attained by the 'EU', 'WNG' and 'EU+WNG' systems, respectively. 'Source', OutE, OutW and OutEW refer to the input and the output of the systems. 'Reference' corresponds to the expected output. 4 and 5 are examples where our system outperforms SYSTRAN. In case 4, our system provides higher recall by suggesting an adequate translation for 'top of something'. In case 5, our system shows higher precision by selecting a better translation for 'rate'. However, we observed that SYSTRAN tends in most cases to construct sentences exhibiting a higher degree of grammaticality.</Paragraph>
  </Section>
  <Section position="8" start_page="292" end_page="292" type="metho">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> In this work, we have enriched every synset in Spanish WordNet with a preliminary gloss, which can be later updated in a lighter process of manual revision. Though imperfect, this material constitutes a very valuable resource. For instance, Word-Net glosses have been used in the past to generate sense tagged corpora (Mihalcea and Moldovan, 1999), or as external knowledge for Question Answering systems (Hovy et al., 2001).</Paragraph>
    <Paragraph position="1"> We have also shown the importance of using a small set of in-domain parallel sentences in order to adapt a phrase-based general SMT system to a new domain. In particular, we have worked on specialized language and translation models and on their combination with general models in order to achieve a proper balance between precision (specialized in-domain models) and recall (general out-of-domain models). A substantial increase is consistently obtained according to standard MT evaluation metrics, which has been shown to be statistically significant in the case of BLEU. Broadly speaking, we have shown that around 3,000 glosses (very short sentence fragments) suffice in this domain to obtain a significant improvement. Besides, all the methods used are language independent, assumed the availability of the required in-domain additional resources. In the future we plan to work on domain independent translation models built from WordNet itself. We may use the WordNet topology to provide translation candidates weighted according to the given domain. Moreover, we are experimenting the applicability of current Word Sense Disambiguation (WSD) technology to MT. We could favor those translation candidates showing a closer semantic relation to the source. We believe that coarse-grained is sufficient for the purpose of MT.</Paragraph>
  </Section>
  <Section position="9" start_page="292" end_page="292" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> This research has been funded by the Spanish Ministry of Science and Technology (ALIADO TIC2002-04447-C02) and the Spanish Ministry of Education and Science (TRANGRAM, TIN200407925-C03-02). Our research group, TALP Research Center, is recognized as a Quality Research Group (2001 SGR 00254) by DURSI, the Research Department of the Catalan Government.</Paragraph>
    <Paragraph position="1"> Authors are grateful to Patrik Lambert for providing us with the implementation of the Simplex Method, and specially to German Rigau for motivating in its origin all this work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML