File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3113_metho.xml

Size: 10,238 bytes

Last Modified: 2025-10-06 14:11:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3113">
  <Title>How Many Bits Are Needed To Store Probabilities for Phrase-Based Translation?</Title>
  <Section position="3" start_page="94" end_page="94" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> Most related work can be found in the area of speech recognition, where n-gram language models have been used for a while.</Paragraph>
    <Paragraph position="1"> Efforts targeting ef ciency have been mainly focused on pruning techniques (Seymore and Rosenfeld, 1996; Gao and Zhang, 2002), which permit to signi cantly reduce the amount of n-grams to be stored at a negligible cost in performance. Moreover, very compact data-structures for storing back-off n-gram models have been recently proposed by Raj and Whittaker (2003).</Paragraph>
    <Paragraph position="2"> Whittaker and Raj (2001) discuss probability encoding as a means to reduce memory requirements of an n-gram language model. Quantization of a 3-gram back-off model was performed by applying the k-means Lloyd-Max algorithm at each n-gram level. Experiments were performed on several large-vocabulary speech recognition tasks by considering different levels of compression. By encoded probabilities in 4 bits, the increase in word-error-rate was only around 2% relative with respect to a baseline using 32-bit oating point probabilities.</Paragraph>
    <Paragraph position="3"> Similar work was carried out in the eld of information retrieval, where memory ef ciency is instead related to the indexing data structure, which contains information about frequencies of terms in all the individual documents. Franz and McCarley (2002) investigated quantization of term frequencies by applying a binning method. The impact on retrieval performance was analyzed against different quantization levels. Results showed that 2 bits are suf cient to encode term frequencies at the cost of a negligible loss in performance.</Paragraph>
    <Paragraph position="4"> In our work, we investigate both data compression methods, namely the Lloyd's algorithm and the binning method, in a SMT framework.</Paragraph>
  </Section>
  <Section position="4" start_page="94" end_page="95" type="metho">
    <SectionTitle>
3 Quantization
</SectionTitle>
    <Paragraph position="0"> Quantization provides an effective way of reducing the number of bits needed to store oating point variables. The quantization process consists in partitioning the real space into a nite set of k quantization levels and identifying a center ci for each level, i = 1,... ,k. A function q(x) maps any real-valued point x onto its unique center ci. Cost of quantization is the approximation error between x and ci.</Paragraph>
    <Paragraph position="1"> If k = 2h, h bits are enough to represent a oating point variable; as a oating point is usually encoded in 32 bits (4 byte), the compression ratio is equal to 32/h1 . Hence, the compression ratio also gives an upper bound for the relative reduction of memory use, because it assumes an optimal implementation of data structures without any memory waste. Notice that memory consumption for storing the kentry codebook is negligible (k [?] 32 bits).</Paragraph>
    <Paragraph position="2"> As we will apply quantization on probabilistic distribution, we can restrict the range of real values between 0 and 1. Most quantization algorithms require a xed (although huge) amount of points in order to de ne the quantization levels and their centers. Probabilistic models used in SMT satisfy this requirement because the set of parameters larger than 0 is always limited.</Paragraph>
    <Paragraph position="3"> Quantization algorithms differ in the way partition of data points is computed and centers are identi ed. In this paper we investigate two different quantization algorithms.</Paragraph>
    <Section position="1" start_page="94" end_page="95" type="sub_section">
      <SectionTitle>
Lloyd's Algorithm
</SectionTitle>
      <Paragraph position="0"> Quantization of a nite set of real-valued data points can be seen as a clustering problem. A large family of clustering algorithms, called k-means algorithms (Kanungo et al., 2002), look for optimal centers ci which minimize the mean squared distance from each data point to its nearest center. The map between points and centers is trivially derived.</Paragraph>
      <Paragraph position="1"> 1In the computation of the compression ratio we take into account only the memory needed to store the probabilities of the observations, and not the memory needed to store the observations themselves which depends on the adopted data structures.  As no ef cient exact solution to this problem is known, either polynomial-time approximation or heuristic algorithms have been proposed to tackle the problem. In particular, Lloyd's algorithm starts from a feasible set of centers and iteratively moves them until some convergence criterion is satis ed.</Paragraph>
      <Paragraph position="2"> Finally, the algorithm nds a local optimal solution.</Paragraph>
      <Paragraph position="3"> In this work we applied the version of the algorithm available in the K-MEANS package2.</Paragraph>
    </Section>
    <Section position="2" start_page="95" end_page="95" type="sub_section">
      <SectionTitle>
Binning Method
</SectionTitle>
      <Paragraph position="0"> The binning method partitions data points into uniformly populated intervals or bins. The center of each bin corresponds to the mean value of all points falling into it. If Ni is the number of points of the i-th bin, and xi the smallest point in the i-th bin, a partition [xi,xi+1] results such that Ni is constant for each i = 0,... ,k [?] 1, where xk = 1 by default.</Paragraph>
      <Paragraph position="1"> The following map is thus de ned:</Paragraph>
      <Paragraph position="3"> Our implementation uses the following greedy strategy: bins are build by uniformly partition all different points of the data set.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="95" end_page="96" type="metho">
    <SectionTitle>
4 Phrase-based Translation System
</SectionTitle>
    <Paragraph position="0"> Given a string f in the source language, our SMT system (Federico and Bertoldi, 2005; Cettolo et al., 2005), looks for the target string e maximizing the posterior probability Pr(e,a  |f) over all possible word alignments a. The conditional distribution is computed with the log-linear model:</Paragraph>
    <Paragraph position="2"> where hr(e,f,a),r = 1 ... R are real valued feature functions.</Paragraph>
    <Paragraph position="3"> The log-linear model is used to score translation hypotheses (e,a) built in terms of strings of phrases, which are simple sequences of words. The translation process works as follows. At each step, a target phrase is added to the translation whose corresponding source phrase within f is identi ed through three random quantities: the fertility which establishes its length; the permutation which sets its rst position; 2www.cs.umd.edu/[?]mount/Projects/KMeans.</Paragraph>
    <Paragraph position="4"> the tablet which tells its word string. Notice that target phrases might have fertility equal to zero, hence they do not translate any source word. Moreover, untranslated words in f are also modeled through some random variables.</Paragraph>
    <Paragraph position="5"> The choice of permutation and tablets can be constrained in order to limit the search space until performing a monotone phrase-based translation.</Paragraph>
    <Paragraph position="6"> In any case, local word reordering is permitted by phrases.</Paragraph>
    <Paragraph position="7"> The above process is performed by a beam-search decoder and is modeled with twelve feature functions (Cettolo et al., 2005) which are either estimated from data, e.g. the target n-gram language models and the phrase-based translation model, or empirically xed, e.g. the permutation models.</Paragraph>
    <Paragraph position="8"> While feature functions exploit statistics extracted from monolingual or word-aligned texts from the training data, the scaling factors l of the log-linear model are empirically estimated on development data.</Paragraph>
    <Paragraph position="9"> The two most memory consuming feature functions are the phrase-based Translation Model (TM) and the n-gram Language Model (LM).</Paragraph>
    <Paragraph position="10"> Translation Model The TM contains phrase-pairs statistics computed on a parallel corpus provided with word-alignments in both directions. Phrase-pairs up to length 8 are extracted and singleton observations are pruned off. For each extracted phrase-pair ( ~f, ~e), four translation probabilities are estimated: a smoothed frequency of ~f given ~e a smoothed frequency of ~e given ~f an IBM model 1 based probability of ~e given ~f an IBM model 1 based probability of ~f given ~e Hence, the number of parameters of the translation models corresponds to 4 times the number of extracted phrase-pairs. From the point of view of quantization, the four types of probabilities are considered separately and a speci c codebook is generated for each type.</Paragraph>
    <Section position="1" start_page="95" end_page="96" type="sub_section">
      <SectionTitle>
Language Model
</SectionTitle>
      <Paragraph position="0"> The LM is a 4-gram back-off model estimated with the modi ed Kneser-Ney smoothing method (Chen and Goodman, 1998). Singleton pruning is applied on 3-gram and 4-gram statistics. In terms of num- null task parallel resources mono resources LM TM src trg words 1-gram 2-gram 3-gram 4-gram phrase pairs  ber of parameters, each n-gram, with n &lt; 4, has two probabilities associated with: the probability of the n-gram itself, and the back-off probability of the corresponding n + 1-gram extensions. Finally, 4grams have only one probability associated with. For the sake of quantization, two separate codebooks are generated for each of the rst three levels, and one codebook is generated for the last level. Hence, a total of 7 codebooks are generated. In all discussed quantized LMs, unigram probabilities are always encoded with 8 bits. The reason is that uni-gram probabilities have indeed the largest variability and do not contribute signi cantly to the total number of parameters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML