File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2117_metho.xml
Size: 12,429 bytes
Last Modified: 2025-10-06 14:10:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2117"> <Title>Boosting Statistical Word Alignment Using Labeled and Unlabeled Data</Title> <Section position="4" start_page="913" end_page="913" type="metho"> <SectionTitle> 2 Statistical Word Alignment Model </SectionTitle> <Paragraph position="0"> According to the IBM models (Brown et al., 1993), the statistical word alignment model can be generally represented as in equation (1).</Paragraph> <Paragraph position="1"> Where and f represent the source sentence and the target sentence, respectively.</Paragraph> <Paragraph position="2"> e In this paper, we use a simplified IBM model 4 (Al-Onaizan et al., 1999), which is shown in equation (2). This simplified version does not take into account word classes as described in Brown et al. (1993).</Paragraph> <Paragraph position="4"> (2) ml, are the lengths of the source sentence and the target sentence respectively. j is the position index of the target word. j a is the position of the source word aligned to the target word.</Paragraph> <Paragraph position="6"> ph is the number of target words that is aligned to.</Paragraph> <Paragraph position="8"> p , are the fertility probabilities for , and is the distortion probability for the non-head words of cept i.</Paragraph> <Paragraph position="10"> c is the center of cept i.</Paragraph> </Section> <Section position="5" start_page="913" end_page="917" type="metho"> <SectionTitle> 3 Parameter Estimation with Labeled </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="913" end_page="914" type="sub_section"> <SectionTitle> Data </SectionTitle> <Paragraph position="0"> With the labeled data, instead of using EM algorithm, we directly estimate the three main parameters in model 4: translation probability, fertility probability, and distortion probability.</Paragraph> <Paragraph position="1"> A cept is defined as the set of target words connected to a source word (Brown et al., 1993).</Paragraph> </Section> <Section position="2" start_page="914" end_page="914" type="sub_section"> <SectionTitle> 3.1 Translation Probability Where 1),( =yxd if yx = . Otherwise, 0),( =yxd . </SectionTitle> <Paragraph position="0"> The translation probability is estimated from the labeled data as described in (3).</Paragraph> <Paragraph position="1"> In this section, we first propose a semi-supervised AdaBoost algorithm for word alignment, which uses both the labeled data and the unlabeled data. Based on the semi-supervised algorithm, we describe two boosting methods for word alignment. And then we develop a method to combine the results of the two boosting methods. null Where is the occurring frequency of aligned to in the labeled data.</Paragraph> <Paragraph position="3"/> </Section> <Section position="3" start_page="914" end_page="916" type="sub_section"> <SectionTitle> 3.2 Fertility Probability </SectionTitle> <Paragraph position="0"> The fertility probability )|( ii en ph describes the distribution of the numbers of words that is aligned to. It is estimated as described in (4).</Paragraph> <Paragraph position="2"> Figure 1 shows the semi-supervised AdaBoost algorithm for word alignment by using labeled and unlabeled data. Compared with the supervised Adaboost algorithm, this semi-supervised AdaBoost algorithm mainly has five differences.</Paragraph> <Paragraph position="3"> Where ),( ii ecount ph describes the occurring frequency of word aligned to</Paragraph> <Paragraph position="5"> ph target words in the labeled data.</Paragraph> <Paragraph position="6"> p and describe the fertility probabilities for . And and sum to 1. We estimate directly from the labeled data, which is shown in (5).</Paragraph> <Paragraph position="7"> The first is the word alignment model, which is taken as a learner in the boosting algorithm. The word alignment model is built using both the labeled data and the unlabeled data. With the labeled data, we train a supervised model by directly estimating the parameters in the IBM model as described in section 3. With the unlabeled data, we train an unsupervised model using the same EM algorithm in Brown et al. (1993). Then we build an interpolation model by linearly interpolating these two word alignment models, which is shown in (8). This interpolated model is used as the model described in figure 1. Where is the occurring frequency of the target words that have counterparts in the source language. is the occurring frequency of the target words that have no counterparts in the source language.</Paragraph> <Paragraph position="8"> (8)There are two kinds of distortion probability in model 4: one for head words and the other for non-head words. Both of the distortion probabilities describe the distribution of relative positions Thus, if we let</Paragraph> <Paragraph position="10"> the distortion probabilities for head words and non-head words are estimated in (6) and (7) with the labeled data, respectively.</Paragraph> <Paragraph position="11"> Where and are the trained supervised model and unsupervised model, respectively.</Paragraph> <Paragraph position="12"> l is an interpolation weight.</Paragraph> <Paragraph position="13"> We train the weight in equation (8) in the same way as described in Wu et al. (2005).</Paragraph> <Paragraph position="15"> The second is the reference set for the unlabeled data. For the unlabeled data, we automatically build a pseudo reference set. In order to build a reliable pseudo reference set, we perform bi-directional word alignment on the training data using the interpolated model trained on the first round. Bi-directional word alignment includes alignment in two directions (source to target and target to source) as described in Och and Ney (2000). Thus, we get two sets of alignment results and on the unlabeled data.</Paragraph> <Paragraph position="16"> Based on these two sets, we use a modified &quot;refined&quot; method (Och and Ney, 2000) to construct The third is the calculation of the error of the individual word aligner on each round. For word alignment, a sentence pair is taken as a sample. Thus, we calculate the error rate of each sentence pair as described in (9), which is the same as described in Wu and Wang (2005).</Paragraph> <Paragraph position="17"> Where represents the set of alignment links of a sentence pair i identified by the individual interpolated model on each round. is the reference alignment set for the sentence pair.</Paragraph> <Paragraph position="19"> With the error rate of each sentence pair, we calculate the error of the word aligner on each round. Although we build a pseudo reference set for the unlabeled data, it contains alignment errors. Thus, the weighted sum of the error rates of sentence pairs in the labeled data instead of that in the entire training data is used as the error of the word aligner.</Paragraph> <Paragraph position="21"> Weights Update for Sentence Pairs The forth is the weight update for sentence pairs according to the error and the reference set. In a sentence pair, there are usually several word alignment links. Some are correct, and others may be incorrect. Thus, we update the weights according to the number of correct and incorrect alignment links as compared with the reference set, which is shown in step (9) in figure 1.</Paragraph> <Paragraph position="22"> Weights for Word Alignment Links The fifth is the weights used when we construct the final ensemble. Besides the weight</Paragraph> <Paragraph position="24"> b , which is the confidence measure of the word aligner, we also use the weight to measure the confidence of each alignment link produced by the model . The weight is calculated as shown in (10).</Paragraph> <Paragraph position="25"> Wu and Wang (2005) proved that adding this weight improved the word alignment results.</Paragraph> <Paragraph position="27"> Where is the occurring frequency of the alignment link in the word alignment results of the training data produced by the</Paragraph> <Paragraph position="29"/> </Section> <Section position="4" start_page="916" end_page="916" type="sub_section"> <SectionTitle> 4.2 Method 1 </SectionTitle> <Paragraph position="0"> This method only uses the labeled data as training data. According to the algorithm in figure 1, we obtain and . Thus, we only change the distribution of the labeled data. However, we build an unsupervised model using the unlabeled data. On each round, we keep this unsupervised model unchanged, and we rebuild the supervised model by estimating the parameters as described in section 3 with the weighted training data. Then we interpolate the supervised model and the unsupervised model to obtain an interpolated model as described in section 4.1.</Paragraph> <Paragraph position="1"> The interpolated model is used as the alignment model in figure 1. Thus, in this interpolated model, we use both the labeled and unlabeled data. On each round, we rebuild the interpolated model using the rebuilt supervised model and the unchanged unsupervised model. This interpolated model is used to align the training data.</Paragraph> <Paragraph position="3"> According to the reference set of the labeled data, we calculate the error of the word aligner on each round. According to the error and the reference set, we update the weight of each sample in the labeled data.</Paragraph> </Section> <Section position="5" start_page="916" end_page="916" type="sub_section"> <SectionTitle> 4.3 Method 2 </SectionTitle> <Paragraph position="0"> This method uses both the labeled data and the unlabeled data as training data. Thus, we set With the weighted samples in the training data, we rebuild the unsupervised model with EM algorithm on each round. Based on these two models, we built an interpolated model as described in section 4.1. The interpolated model is used as the alignment model in figure 1. On each round, we rebuild the interpolated model using the unchanged supervised model and the rebuilt unsupervised model. Then the interpolated model is used to align the training data.</Paragraph> <Paragraph position="2"> Since the training data includes both labeled and unlabeled data, we need to build a pseudo reference set for the unlabeled data using the method described in section 4.1. According to the reference set of the labeled data, we calculate the error of the word aligner on each round. Then, according to the pseudo reference set and the reference set , we update the weight of each sentence pair in the unlabeled data and in the labeled data, respectively.</Paragraph> <Paragraph position="3"> (1) On each round, Method 2 changes the distribution of both the labeled data and the unlabeled data, while Method 1 only changes the distribution of the labeled data.</Paragraph> <Paragraph position="4"> (2) Method 2 rebuilds the unsupervised model, while Method 1 rebuilds the supervised model.</Paragraph> <Paragraph position="5"> (3) Method 2 uses the labeled data instead of the entire training data to estimate the error of the word aligner on each round.</Paragraph> <Paragraph position="6"> (4) Method 2 uses an automatically built pseudo reference set to update the weights for the sentence pairs in the unlabeled data.</Paragraph> </Section> <Section position="6" start_page="916" end_page="917" type="sub_section"> <SectionTitle> 4.4 Combination </SectionTitle> <Paragraph position="0"> In the above two sections, we described two semi-supervised boosting methods for word alignment. Although we use interpolated models In fact, we can also rebuild the supervised model according to the weighted labeled data. In this case, as we know, the error of the supervised model increases. Thus, we keep the supervised model unchanged in this method.</Paragraph> <Paragraph position="1"> for word alignment in both Method 1 and Method 2, the interpolated models are trained with different weighted data. Thus, they perform differently on word alignment. In order to further improve the word alignment results, we combine the results of the above two methods as described in (11).</Paragraph> <Paragraph position="2"> ods to calculate the precision, recall, f-measure, and alignment error rate (AER) are shown in equations (12), (13), (14), and (15). It can be seen that the higher the f-measure is, the lower the alignment error rate is.</Paragraph> </Section> </Section> class="xml-element"></Paper>