File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3113_evalu.xml

Size: 7,786 bytes

Last Modified: 2025-10-06 13:59:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3113">
  <Title>How Many Bits Are Needed To Store Probabilities for Phrase-Based Translation?</Title>
  <Section position="6" start_page="96" end_page="98" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="96" end_page="97" type="sub_section">
      <SectionTitle>
Data and Experimental Framework
</SectionTitle>
      <Paragraph position="0"> We performed experiments on two large vocabulary translation tasks: the translation of European Parliamentary Plenary Sessions (EPPS) (Vilar et al., 2005) from Spanish to English, and the translation of documents from Chinese to English as proposed by the NIST MT Evaluation Workshops3.</Paragraph>
      <Paragraph position="1"> Translation of EPPS is performed on the so-called nal text editions, which are prepared by the translation of ce of the European Parliament. Both the training and testing data were collected by the TC-STAR4 project and were made freely available to participants in the 2006 TC-STAR Evaluation Campaign. In order to perform experiments under different data sparseness conditions, four subsamples of the training data with different sizes were generated, too.</Paragraph>
      <Paragraph position="2"> Training and test data used for the NIST task are  available through the Linguistic Data Consortium5. Employed training data meet the requirements set for the Chinese-English large-data track of the 2005 NIST MT Evaluation Workshop. For testing we used instead the NIST 2003 test set.</Paragraph>
      <Paragraph position="3"> Table 1 reports statistics about the training data of each task and the models estimated on them. That is, the number of running words of source and target languages, the number of n-grams in the language model and the number phrase-pairs in the translation model. Table 2 reports instead statistics about the test sets, namely, the number of source sentences and running words in the source part and in the gold reference translations.</Paragraph>
      <Paragraph position="4"> Translation performance was measured in terms of BLEU score, NIST score, word-error rate (WER), and position independent error rate (PER). Score computation relied on two and four reference translations per sentence, respectively, for the EPPS and NIST tasks. Scores were computed in case-insensitive modality with punctuation. In general, none of the above measures is alone suf ciently informative about translation quality, however, in the community there seems to be a preference toward reporting results with BLEU. Here, to be on the safe side and to better support our ndings we will report results with all measures, but will limit discussion on performance to the BLEU score.</Paragraph>
      <Paragraph position="5"> In order to just focus on the effect of quantiza- null tion, all reported experiments were performed with a plain con guration of the ITC-irst SMT system.</Paragraph>
      <Paragraph position="6"> That is, we used a single decoding step, no phrase re-ordering, and task-dependent weights of the log-linear model.</Paragraph>
      <Paragraph position="7"> Henceforth, LMs and TM quantized with h bits are denoted with LM-h and TM-h, respectively.</Paragraph>
      <Paragraph position="8"> Non quantized models are indicated with LM-32 and TM-32.</Paragraph>
      <Paragraph position="9"> Impact of Quantization on LM and TM A rst set of experiments was performed on the EPPS task by applying probability quantization either on the LM or on the TMs. Figures 1 and 2 compare the two proposed quantization algorithms (LLOYD and BINNING) against different levels of quantization, namely 2, 3, 4, 5, 6, and 8 bits.</Paragraph>
      <Paragraph position="10"> The scores achieved by the non quantized models (LM-32 and TM-32) are reported as reference.</Paragraph>
      <Paragraph position="11"> The following considerations can be drawn from these results. The Binning method works slightly, but not signi cantly, better than the Lloyd's algorithm, especially with the highest compression ratios. null In general, the LM seems less affected by data compression than the TM. By comparing quantization with the binning method against no quantization, the BLEU score with LM-4 is only 0.42% relative worse (54.78 vs 54.55). Degradation of BLEU score by TM-4 is 0.77% (54.78 vs 54.36). For all the models, encoding with 8 bits does not affect translation quality at all.</Paragraph>
      <Paragraph position="12"> In following experiments, binning quantization was applied to both LM and TM. Figure 3 plots all scores against different levels of quantization. As references, the curves corresponding to only  different quantization levels of the LM and TM.</Paragraph>
      <Paragraph position="13"> LM quantization (LM-h) and only TM quantization (TM-h) are shown. Independent levels of quantization of the LM and TM were also considered. BLEU scores related to several combinations are reported in Table 3.</Paragraph>
      <Paragraph position="14"> Results show that the joint impact of LM and TM quantization is almost additive. Degradation with 4 bits quantization is only about 1% relative (from 54.78 to 54.23). Quantization with 2 bits is surprisingly robust: the BLEU score just decreases by 4.33% relative (from 54.78 to 52.41).</Paragraph>
    </Section>
    <Section position="2" start_page="97" end_page="98" type="sub_section">
      <SectionTitle>
Quantization vs. Data Sparseness
</SectionTitle>
      <Paragraph position="0"> Quantization of LM and TM was evaluated with respect to data-sparseness. Quantized and not quantized models were trained on four subset of the EPPS corpus with decreasing size. Statistics about these sub-corpora are reported in Table 1. Quantization was performed with the binning method using 2, 4, and 8 bit encodings. Results in terms of BLEU score are plotted in Figure 4. It is evident that the gap in BLEU score between the quantized and not quantized models is almost constant under different training conditions. This result suggests that performance of quantized models is not affected by data sparseness.</Paragraph>
      <Paragraph position="1">  A subset of quantization settings tested with the EPPS tasks was also evaluated on the NIST task.</Paragraph>
      <Paragraph position="2"> Results are reported in Table 4.</Paragraph>
      <Paragraph position="3"> Quantization with 8 bits does not affect performance, and gives even slightly better scores. Also quantization with 4 bits produces scores very close to those of non quantized models, with a loss in BLEU score of only 1.60% relative. However, pushing quantization to 2 bits signi cantly deteriorates performance, with a drop in BLEU score of 9.96% relative.</Paragraph>
      <Paragraph position="4"> In comparison to the EPPS task, performance degradation due to quantization seems to be twice as large. In conclusion, consistent behavior is observed among different degrees of compression. Absolute loss in performance, though quite different from the EPPS task, remains nevertheless very reasonable.</Paragraph>
      <Paragraph position="5"> Performance vs. Compression From the results of single versus combined compression, we can reasonably assume that performance degradation due to quantization of LM and TM probabilities is additive. Hence, as memory savings on the two models are also independent we can look at the optimal trade-off between performance and compression separately. Experiments on the NIST and EPPS tasks seem to show that encoding of LM and TM probabilities with 4 bits provides the best trade-off, that is a compression ratio of 8 with a relative loss in BLEU score of 1% and 1.6%. It can be seen that score degradation below 4 bits grows generally faster than the corresponding memory savings. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML