File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1009_metho.xml
Size: 22,209 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1009"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 65-72, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics NeurAlign: Combining Word Alignments Using Neural Networks</Title> <Section position="4" start_page="65" end_page="65" type="metho"> <SectionTitle> 3 Neural Networks </SectionTitle> <Paragraph position="0"> A multi-layer perceptron (MLP) is a feed-forward neural network that consists of several units (neurons) that are connected to each other by weighted links. As illustrated in Figure 1, an MLP consists of one input layer, one or more hidden layers, and one output layer. The external input is presented to the input layer, propagated forward through the hidden layers and creates the output vector in the output layer. Each unitiin the network computes its output with respect to its net inputneti = summationtextjwijaj, where j represents all units in the previous layer that are connected to the unit i. The output of unit i is computed by passing the net input through a non-linear activation function f, i.e. ai = f(neti).</Paragraph> <Paragraph position="1"> The most commonly used non-linear activation functions are the log sigmoid function f(x) = 1+e[?]x or hyperbolic tangent sigmoid function f(x) = 1[?]e[?]2x1+e[?]2x. The latter has been shown to be more suitable for binary classification problems.</Paragraph> <Paragraph position="2"> The critical question is the computation of weights associated with the links connecting the neurons. In this paper, we use the resilient back-propagation (RPROP) algorithm (Riedmiller and Braun, 1993), which is based on the gradient descent method, but converges faster and generalizes better.</Paragraph> </Section> <Section position="5" start_page="65" end_page="68" type="metho"> <SectionTitle> 4 NeurAlign Approach </SectionTitle> <Paragraph position="0"> We propose a new approach, NeurAlign, that learns how to combine individual word alignment systems. We treat each alignment system as a classifier and transform the combination problem into a classifier ensemble problem. Before describing the NeurAlign approach, we first introduce some terminology used in the description below.</Paragraph> <Paragraph position="1"> Let E = e1,...,et and F = f1,...,fs be two sentences in two different languages. An alignment link (i,j) corresponds to a translational equivalence between words ei and fj. Let Ak be an alignment between sentences E and F, where each element a [?] Ak is an alignment link (i,j). Let A = {A1,...,Al} be a set of alignments between E andF. We refer to the true alignment asT, where each a [?] T is of the form (i,j). A neighborhood of an alignment link (i,j)--denoted by N(i,j)-consists of 8 possible alignment links in a 3x3 window with (i,j) in the center of the window. Each element of N(i,j) is called a neighboring link of (i,j).</Paragraph> <Paragraph position="2"> Our goal is to combine the information in A1,...,Al such that the resulting alignment is closer to T. A straightforward solution is to take the intersection or union of the individual alignments, or perform a majority voting for each possible alignment link (i,j). Here, we use an additional model to learn how to combine outputs of A1,...,Al.</Paragraph> <Paragraph position="3"> We decompose the task of combining word alignments into two steps: (1) Extract features; and (2) Learn a classifier from the transformed data. We describe each of these two steps in turn.</Paragraph> <Section position="1" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 4.1 Extracting Features </SectionTitle> <Paragraph position="0"> Given sentences E and F, we create a (potential) alignment instance (i,j) for all possible word combinations. A crucial component of building a classifier is the selection of features to represent the data. The simplest approach is to treat each alignmentsystem output as a separate feature upon which we build a classifier. However, when only a few alignment systems are combined, this feature space is not sufficient to distinguish between instances. One of the strategies in the classification literature is to supply the input data to the set of features as well.</Paragraph> <Paragraph position="1"> While combining word alignments, we use two types of features to describe each instance (i,j): (1) linguistic features and (2) alignment features.</Paragraph> <Paragraph position="2"> Linguistic features include POS tags of both words (ei and fj) and a dependency relation for one of the words (ei). We generate POS tags using the MXPOST tagger (Ratnaparkhi, 1996) for English and Chinese, and Connexor for Spanish. Dependency relations are produced using a version of the Collins parser (Collins, 1997) that has been adapted for building dependencies.</Paragraph> <Paragraph position="3"> Alignment features consist of features that are extracted from the outputs of individual alignment systems. For each alignmentAk[?]A, the following are some of the alignment features that can be used to describe an instance (i,j): 1. Whether (i,j) is an element of Ak or not 2. Translation probability p(fj|ei) computed over Ak1 3. Fertility of (i.e., number of words inF that are aligned to) ei in Ak 4. Fertility of (i.e., number of words inE that are aligned to) fj in Ak 5. For each neighbor (x,y) [?] N(i,j), whether (x,y)[?]Ak or not (8 features in total) 6. For each neighbor (x,y) [?] N(i,j), transla null tion probabilityp(fy|ex) computed overAk (8 features in total) It is also possible to use variants, or combinations, of these features to reduce feature space.</Paragraph> <Paragraph position="4"> Figure 2 shows an example of how we transform the outputs of 2 alignment systems, A1 and A2, for an alignment link (i,j) into data with some of the features above. We use -1 and 1 to represent the absence and existence of a link, respectively. The neighboring links are presented in row-by-row order.</Paragraph> </Section> <Section position="2" start_page="66" end_page="66" type="sub_section"> <SectionTitle> into Classification Data </SectionTitle> <Paragraph position="0"> For each sentence pair E = e1,...,et and F = f1,...,fs, we generate sxt instances to represent the sentence pair in the classification data.</Paragraph> <Paragraph position="1"> Supervised learning requires the correct output, which here is the true alignment T. If an alignment link (i,j) is an element of T, then we set the correct output to 1, and to[?]1, otherwise.</Paragraph> </Section> <Section position="3" start_page="66" end_page="67" type="sub_section"> <SectionTitle> 4.2 Learning A Classifier </SectionTitle> <Paragraph position="0"> Once we transform the alignments into a set of instances with several features, the remaining task is to learn a classifier from this data. In the case of word alignment combination, there are important issues to consider for choosing an appropriate classifier. First, there is a very limited amount of manually annotated data. This may give rise to poor generalizations because it is very likely that unseen data include lots of cases that are not observed in the training data.</Paragraph> <Paragraph position="1"> Second, the distribution of the data according to the classes is skewed. In a preliminary study on an English-Spanish data set, we found out that only 4% of the all word pairs are aligned to each other by humans, among a possible 158K word pairs. Moreover, only 60% of those aligned word pairs were</Paragraph> </Section> <Section position="4" start_page="67" end_page="68" type="sub_section"> <SectionTitle> Using All Data At Once </SectionTitle> <Paragraph position="0"> also aligned by the individual alignment systems that were tested.</Paragraph> <Paragraph position="1"> Finally, given the distribution of the data, it is difficult to find the right features to distinguish between instances. Thus, it is prudent to use as many features as possible and let the learning algorithm filter out the redundant features.</Paragraph> <Paragraph position="2"> Below, we describe how neural nets are used at different levels to build a good classifier.</Paragraph> <Paragraph position="3"> Figure 3 illustrates how we combine alignments using all the training data at the same time (NeurAlign1). First, the outputs of individual alignments systems and the original corpus (enriched with additional linguistic features) are passed to the feature extraction module. This module transforms the alignment problem into a classification problem by generating a training instance for every pair of words between the sentences in the original corpus. Each instance is represented by a set of features (described in Section 4.1). The new training data is passed to a neural net learner, which outputs whether an alignment link exists for each training instance. The use of multiple neural networks (NeurAlign2) enables the decomposition of a complex problem into smaller problems. Local experts are learned for each smaller problem and these are then merged. Following Tumer and Ghosh (1996), we apply spatial partitioning of training instances using proximity of patterns in the input space to reduce the complexity of the tasks assigned to individual classifiers. We conducted a preliminary analysis on 100 randomly selected English-Spanish sentence pairs from a mixed corpus (UN + Bible + FBIS) to observe the distribution of errors according to POS tags in both languages. We examined the cases in which the individual alignment and the manual annotation were different--a total of 3,348 instances, where 1,320 of those are misclassified by GIZA++ (E-to-S).2 We use a standard measure of error, i.e., the percentage of misclassified instances out of the total number of instances. Table 1 shows error rates (by percentage) according to POS tags for GIZA++ (E-to-S).3 Table 1 shows that the error rate is relatively low in cases where both words have the same POS tag.</Paragraph> <Paragraph position="4"> Except for verbs, the lowest error rate is obtained when both words have the same POS tag (the error rates on the diagonal). On the other hand, the error rates are high in several other cases, as much as 100%, e.g., when the Spanish word is a determiner or a preposition.4 This suggests that dividing the training data according to POS tag, and training neural networks on each subset separately might be better than training on the entire data at once.</Paragraph> <Paragraph position="5"> Figure 4 illustrates the combination approach with neural nets after partitioning the data into dis- null joint subsets (NeurAlign2). Similar to NeurAlign1, the outputs of individual alignment systems, as well as the original corpus, are passed to the feature extraction module. Then the training data is split into disjoint subsets using a subset of the available features for partitioning. We learn different neural nets for each partition, and then merge the outputs of the individual nets. The advantage of this is that it results in different generalizations for each partition and that it uses different subsets of the feature space for each net.</Paragraph> </Section> </Section> <Section position="6" start_page="68" end_page="69" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> This section describes our experimental design, including evaluation metrics, data, and settings.</Paragraph> <Section position="1" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 5.1 Evaluation Metrics </SectionTitle> <Paragraph position="0"> Let A be the set of alignment links for a set of sentences. We take S to be the set of sure alignment links and P be the set of probable alignment links (in the gold standard) for the same set of sentences.</Paragraph> <Paragraph position="2"> A manually aligned corpus is used as our gold standard. For English-Spanish data, the manual annotation is done by a bilingual English-Spanish speaker.</Paragraph> <Paragraph position="3"> Every link in the English-Spanish gold standard is considered a sure alignment link (i.e., P = S).</Paragraph> <Paragraph position="4"> For English-Chinese, we used 2002 NIST MT evaluation test set. Each sentence pair was aligned by two native Chinese speakers, who are fluent in English. Each alignment link appearing in both annotations was considered a sure link, and links appearing in only one set were judged as probable. The annotators were not aware of the specifics of our approach. null</Paragraph> </Section> <Section position="2" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 5.2 Evaluation Data and Settings </SectionTitle> <Paragraph position="0"> We evaluated NeurAlign1 and NeurAlign2, using 5fold cross validation on two data sets: 1. A set of 199 English-Spanish sentence pairs (nearly 5K words on each side) from a mixed corpus (UN + Bible + FBIS).</Paragraph> <Paragraph position="1"> 2. A set of 491 English-Chinese sentence pairs (nearly 13K words on each side) from 2002 NIST MT evaluation test set.</Paragraph> <Paragraph position="2"> We computed precision, recall and error rate on the entire set of sentence pairs for each data set.5 To evaluate NeurAlign, we used GIZA++ in both directions (E-to-F and F-to-E, where F is either Chinese (C) or Spanish (S)) as input and a refined alignment approach (Och and Ney, 2000) that uses a heuristic combination method called grow-diagfinal (Koehn et al., 2003) for comparison. (We henceforth refer to the refined-alignment approach as &quot;RA.&quot;) For the English-Spanish experiments, GIZA++ was trained on 48K sentence pairs from a mixed corpus (UN + Bible + FBIS), with nearly 1.2M of words on each side, using 10 iterations of Model 1, 5 iterations of HMM, and 5 iterations of Model 4. For the English-Chinese experiments, we used 107K sentence pairs from FBIS corpus (nearly 4.1M English and 3.3M Chinese words) to train GIZA++, using 5 iterations of Model 1, 5 iterations of HMM, 3 iterations of Model 3, and 3 iterations of Model 4.</Paragraph> </Section> <Section position="3" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 5.3 Neural Network Settings </SectionTitle> <Paragraph position="0"> In our experiments, we used a multi-layer perceptron (MLP) consisting of 1 input layer, 1 hidden layer, and 1 output layer. The hidden layer consists of 10 units, and the output layer consists of 1 unit.</Paragraph> <Paragraph position="1"> All units in the hidden layer are fully connected to the units in the input layer, and the output unit is fully connected to all the units in the hidden layer.</Paragraph> <Paragraph position="2"> We used hyperbolic tangent sigmoid function as the activation function for both layers.</Paragraph> <Paragraph position="3"> One of the potential pitfalls is overfitting as the number of iterations increases. To address this, we used the early stopping with validation set method.</Paragraph> <Paragraph position="4"> In our experiments, we held out (randomly selected) 1/4 of the training set as the validation set.</Paragraph> <Paragraph position="5"> Neural nets are sensitive to the initial weights. To overcome this, we performed 5 runs of learning for each training set. The final output for each training is obtained by a majority voting over 5 runs.</Paragraph> <Paragraph position="6"> 5The number of alignment links varies over each fold.</Paragraph> <Paragraph position="7"> Therefore, we chose to evaluate all data at once instead of evaluating on each fold and then averaging.</Paragraph> </Section> </Section> <Section position="7" start_page="69" end_page="69" type="metho"> <SectionTitle> 5.4 Results </SectionTitle> <Paragraph position="0"> This section describes the experiments on English-Spanish and English-Chinese data for testing the effects of feature selection, training on the entire data (NeurAlign1) or on the partitioned data (NeurAlign2), using two input alignments: GIZA++ (E-to-F) and GIZA++ (F-to-E). We used the following additional features, as well as the outputs of individual aligners, for an instance (i,j) (set of features 2-7 below are generated separately for each input alignment Ak): 1. posEi,posFj,relEi: POS tags and dependency relation for ei and fj.</Paragraph> <Paragraph position="1"> 2. neigh(i,j): 8 features indicating whether a neighboring link exists in Ak.</Paragraph> <Paragraph position="2"> 3. fertEi,fertFj: 2 features indicating the fertility of ei and fj in Ak.</Paragraph> <Paragraph position="3"> 4. NC(i,j): Total number of existing links in N(i,j) in Ak.</Paragraph> <Paragraph position="4"> 5. TP(i,j): Translation probability p(fj|ei) in Ak.</Paragraph> <Paragraph position="5"> 6. NghTP(i,j): 8 features indicating the translation probability p(fy|ex) for each (x,y) [?] N(i,j) in Ak.</Paragraph> <Paragraph position="6"> 7. AvTP(i,j): Average translation probability of the neighbors of (i,j) in Ak.</Paragraph> <Paragraph position="7"> We performed statistical significance tests using two-tailed paired t-tests. Unless otherwise indicated, the differences between NeurAlign and other alignment systems, as well as the differences among NeurAlign variations themselves, were statistically significant within the 95% confidence interval.</Paragraph> </Section> <Section position="8" start_page="69" end_page="70" type="metho"> <SectionTitle> 5.4.1 Results for English-Spanish </SectionTitle> <Paragraph position="0"> Table 2 summarizes the precision, recall and alignment error rate values for each of our two alignment system inputs plus the three alternative alignment-combination approaches. Note that the best performing aligner among these is the RA method, with an AER of 21.2%. (We include this in subsequent tables for ease of comparison.) Feature Selection for Training All Data At Once: NeurAlign1 Table 3 presents the results of training neural nets using the entire data (NeurAlign1) with different subsets of the feature space. When we used POS tags and the dependency relation as features, NeurAlign1 performs worse than RA. Using</Paragraph> <Section position="1" start_page="69" end_page="70" type="sub_section"> <SectionTitle> Simple Combinations </SectionTitle> <Paragraph position="0"> the neighboring links as the feature set gave slightly (not significantly) better results than RA. Using POS tags, dependency relations, and neighboring links also resulted in better performance than RA but the difference was not statistically significant.</Paragraph> <Paragraph position="1"> When we used fertilities along with the POS tags and dependency relations, the AER was 20.0%--a significant relative error reduction of 5.7% over RA.</Paragraph> <Paragraph position="2"> Adding the neighboring links to the previous feature set resulted in an AER of 17.6%--a significant relative error reduction of 17% over RA.</Paragraph> <Paragraph position="3"> Interestingly, when we removed POS tags and dependency relations from this feature set, there was no significant change in the AER, which indicates that the improvement is mainly due to the neighboring links. This supports our initial claim about the clustering of alignment links, i.e., when there is an alignment link, usually there is another link in its neighborhood. Finally, we tested the effects of using translation probabilities as part of the feature set, and found out that using translation probabilities did no better than the case where they were not used. We believe this happens because the translation probability p(fj|ei) has a unique value for each pair of ei and fj; therefore it is not useful to distinguish between alignment links with the same words.</Paragraph> <Paragraph position="4"> Feature Selection for Training on Partitioned Data: NeurAlign2 In order to train on partitioned data (NeurAlign2), we needed to establish appropriate features for partitioning the training data. Table 4 presents the evaluation results for NeurAlign1 (i.e., no partitioning) and NeurAlign2 with different features for partitioning (English POS tag, Spanish POS tag, and POS tags on both sides). For training on each partition, the feature space included POS tags (e.g., Spanish POS tag in the case where partitioning is based on English POS tag only), dependency relations, neighborhood features, and fertilities. We observed that partitioning based on POS tags on one side reduced the AER to 17.4% and</Paragraph> <Paragraph position="6"> 17.1%, respectively. Using POS tags on both sides reduced the error rate to 16.9%--a significant relative error reduction of 5.6% over no partitioning.</Paragraph> <Paragraph position="7"> All four methods yielded statistically significant error reductions over RA--we will examine the fourth method in more detail below.</Paragraph> <Paragraph position="8"> Once we determined that partitioning by POS tags on both sides brought about the biggest gain, we ran NeurAlign2 using this partitioning, but with different feature sets. Table 5 shows the results of this experiment. Using dependency relations, word fertilities and translation probabilities (both for the link in question and the neighboring links) yielded a significantly lower AER (18.6%)--a relative error reduction of 12.3% over RA. When the feature set consisted of dependency relations, word fertilities, and neighborhood links, the AER was reduced to 16.9%--a 20.3% relative error reduction over RA.</Paragraph> <Paragraph position="9"> We also tested the effects of adding translation probabilities to this feature set, but as in the case of NeurAlign1, this did not improve the alignments.</Paragraph> <Paragraph position="10"> In the best case, NeurAlign2 achieved substantial and significant reductions in AER over the input alignment systems: a 28.4% relative error reduction over S-to-E and a 30.5% relative error re- null duction over E-to-S. Compared to RA, NeurAlign2 also achieved significantly better results over RA: relative improvements of 9.3% in precision, 2.2% in recall, and 20.3% in AER.</Paragraph> </Section> </Section> class="xml-element"></Paper>