File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0815_metho.xml
Size: 8,332 bytes
Last Modified: 2025-10-06 14:09:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0815"> <Title>Experiments Using MAR for Aligning Corpora[?]</Title> <Section position="3" start_page="95" end_page="96" type="metho"> <SectionTitle> 2 The MAR </SectionTitle> <Paragraph position="0"> We provide here a brief description of the model, a more detailed presentation can be found in (Vilar and Vidal, 2005). The idea is that the translation of a sentence -x into a sentence -y can be performed in the following steps1: (a) If -x is small enough, IBM's model 1 (Brown et al., 1993) is employed for the translation.</Paragraph> <Paragraph position="1"> (b) If not, a cut point is selected in -x yielding two parts that are independently translated applying the same procedure recursively.</Paragraph> <Paragraph position="2"> (c) The two translations are concatenated either in the same order that they were produced or second first.</Paragraph> <Section position="1" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 2.1 Model parameters </SectionTitle> <Paragraph position="0"> Apart from the parameters of model 1 (a stochastic dictionary and a discrete distribution of lenghts), each of the steps above defines a set of parameters.</Paragraph> <Paragraph position="1"> We will consider now each set in turn.</Paragraph> <Paragraph position="2"> Deciding the submodel The first decision is whether to use IBM's model 1 or to apply the MAR recursively. This decision is taken on account of the length of -x. A table is used so that:</Paragraph> <Paragraph position="4"> Deciding the cut point It is assumed that the probability of cutting the input sentence at a given position b is most influenced by the words around it: xb and xb+1. We use a table B such that:</Paragraph> <Paragraph position="6"> That is, a weight is assigned to each pair of words and they are normalized in order to obtaing a proper probability distribution.</Paragraph> <Paragraph position="7"> 1We use the following notational conventions. A string or sequence of words is indicated by a bar like in -x, individual words from the sequence carry a subindex and no bar like in xi, substrings are indicated with the first and last position like in -xji . Finally, when the final position of the substring is also the last of the string, a dot is used like in -x.i Deciding the concatenation direction The direction of the concatenation is also decided as a function of the two words adjacent to the cut point, that is:</Paragraph> <Paragraph position="9"> where D stands for direct concatenation (i.e.</Paragraph> <Paragraph position="10"> the translation of -xb1 will precede the translation of -x.b+1) and I stands for inverse. Clearly,</Paragraph> <Paragraph position="12"> pair (xb,xb+1).</Paragraph> </Section> <Section position="2" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 2.2 Final form of the model </SectionTitle> <Paragraph position="0"> With these parameters, the final model is:</Paragraph> <Paragraph position="2"> were pI represents the probability assigned by model 1 to a pair of sentences.</Paragraph> </Section> <Section position="3" start_page="95" end_page="96" type="sub_section"> <SectionTitle> 2.3 Model training </SectionTitle> <Paragraph position="0"> The training of the model parameters is done maximizing the likelihood of the training sample. For each training pair (-x, -y) and each parameter P relevant to it, the value of</Paragraph> <Paragraph position="2"> is computed. This corresponds to the counts of P in that pair. As the model is polynomial on all its parameters except for the cuts (the B's), Baum-Eagon's inequality (Baum and Eagon, 1967) guarantees that normalization of the counts increases the likelihood of the sample. For the cuts, Gopalakrishnan's inequality (Gopalakrishnan et al., 1991) is used.</Paragraph> <Paragraph position="3"> using model 1 training and then a series of iterations are made updating the values of every parameter. Some additional considerations are taken into account for efficiency reasons, see (Vilar and Vidal, 2005) for details.</Paragraph> <Paragraph position="4"> A potential problem here is the large number of parameters associated with cuts and directions: two for each possible pair of words. But, as we are interested only in aligning the corpus, no provision is made for the data sparseness problem.</Paragraph> </Section> </Section> <Section position="4" start_page="96" end_page="96" type="metho"> <SectionTitle> 3 The task </SectionTitle> <Paragraph position="0"> The aim of the task was to align a set of 200 translation pairs between Romanian and English. As training material, the text of 1984, the Romanian Constitution and a collection of texts from the Web were provided. Some details about this corpus can be seen in Table 1.</Paragraph> </Section> <Section position="5" start_page="96" end_page="96" type="metho"> <SectionTitle> 4 Splitting the corpus </SectionTitle> <Paragraph position="0"> To reduce the high computational costs of training of the parameters of MAR, a heuristic was employed in order to split long sentences into smaller parts with a length less than l words.</Paragraph> <Paragraph position="1"> Suppose we are to split sentences -x and -y. We begin by aligning each word in -y to a word in -x.</Paragraph> <Paragraph position="2"> Then, a score and a translation is assigned to each substring -xji with a length below l. The translation is produced by looking for the substring of -y which has a length below l and which has the largest number of words aligned to positions between i and j. The pair so obtained is given a score equal to sum of: (a) the square of the length of -xji; (b) the square of the number of words in the output aligned to the input; and (c) minus ten times the sum of the square of the number of words aligned to a nonempty position out of -xji and the number of words outside the segment chosen that are aligned to -xji.</Paragraph> <Paragraph position="3"> These scores are chosen with the aim of reducing the number of segments and making them as &quot;complete&quot; as possible, ie, the words they cover are aligned to as many words as possible.</Paragraph> <Paragraph position="4"> After the segments of -x are so scored, the partition of -x that maximizes the sum of scores is computed by dynamic programming.</Paragraph> <Paragraph position="5"> The training material was split in parts up to ten words in length. For this, an alignment was obtained by training an IBM model 4 using GIZA++ (Och and Ney, 2003). The test pairs were split in parts up to twenty words. After the split, there were 141945 training pairs and 337 test pairs. Information was stored about the partition in order to be able to recover the correct alignments later.</Paragraph> </Section> <Section position="6" start_page="96" end_page="97" type="metho"> <SectionTitle> 5 Aligning the corpus </SectionTitle> <Paragraph position="0"> The parameters of the MAR were trained as explained above: first ten IBM model 1 iterations were used for giving initial values to the dictionary probabilities and then ten more iterations for retraining the dictionary together with the rest of the parameters.</Paragraph> <Paragraph position="1"> The alignment of a sentence pair has the form of a tree similar to those in Figure 1. Each interior node has two children corresponding to the translation of the two parts in which the input sentence is divided.</Paragraph> <Paragraph position="2"> The leaves of the tree correspond to those segments that were translated by model 1.</Paragraph> <Paragraph position="3"> As the reference alignments do not have this kind of structure it is necessary to &quot;flatten&quot; them. The procedure we have employed is very simple: if we are in a leaf, every output word is aligned to every input word; if we are in an interior node, the &quot;flat&quot; alignments for the children are built and then combined. Note that the way leaves are labeled tends to favor recall over precision.</Paragraph> <Paragraph position="4"> The flat alignment corresponding to the trees of Figure 1 are: economia si finantele publice economy and public finance and Winston se intoarse brusc .</Paragraph> <Paragraph position="5"> Winston turned round abruptly .</Paragraph> </Section> class="xml-element"></Paper>