File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1024_evalu.xml
Size: 16,614 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1024"> <Title>Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences</Title> <Section position="5" start_page="12" end_page="12" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> The evaluation for our finite state representations and algorithm requires careful examination. Obviously, what counts as a good result largely depends on the application one has in mind. If we are extracting paraphrases for question-reformulation, it doesn't really matter if we output a few syntactically incorrect paraphrases, as long as we produce a large number of semantically correct ones.</Paragraph> <Paragraph position="1"> If we want to use the FSA for MT evaluation (for example, comparing a sentence to be evaluated with the possible paths in FSA), we would want all paths to be relatively good (which we will focus on in this paper), while in some other applications, we may only care about the quality of the best path (not addressed in this paper). Section 4.1 concentrates on evaluating the paraphrase pairs that can be extracted from the FSAs built by our system, while Section 4.2 is dedicated to evaluating the FSAs directly. null</Paragraph> <Section position="1" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 4.1 Evaluating paraphrase pairs 4.1.1 Human-based evaluation of paraphrases </SectionTitle> <Paragraph position="0"> By construction, different paths between any two nodes in the FSA representations that we derive are paraphrases (in the context in which the nodes occur). To evaluate our algorithm, we extract paraphrases from our FSAs and ask human judges to evaluate their correctness.</Paragraph> <Paragraph position="1"> We compare the paraphrases we collect with paraphrases that are derivable from the same corpus using a co-training-based paraphrase extraction algorithm (Barzilay and McKeown, 2001). To the best of our knowledge, this is the most relevant work to compare against since it aims at extracting paraphrase pairs from parallel corpus. Unlike our syntax-based algorithm which treats a sentence as a tree structure and uses this hierarchical structural information to guide the merging process, their algorithm treats a sentence as a sequence of phrases with surrounding contexts (no hierarchical structure involved) and cotrains classifiers to detect paraphrases and contexts for paraphrases. It would be interesting to compare the results from two algorithms so different from each other.</Paragraph> <Paragraph position="2"> For the purpose of this experiment, we randomly selected 300 paraphrase pairs (Ssyn) from the FSAs produced by our system. Since the co-training-based algorithm of Barzilay and McKeown (2001) takes parallel corpus as input, we created out of the MTC corpus 55 x 993 sentence pairs (Each equivalent translation set of cardinality 11 was mapped into parenleftbig112 parenrightbig equivalent translation pairs.). Regina Barzilay kindly provided us the list of paraphrases extracted by their algorithm from this parallel corpus, from which we randomly selected another phrases produced by the syntax-based alignment (Ssyn) and co-training-based (Scotr) algorithms.</Paragraph> <Paragraph position="3"> The resulting 600 paraphrase pairs were mixed and presented in random order to four human judges. Each judge was asked to assess the correctness of 150 paraphrase pairs (75 pairs from each system) based on the context, i.e., the sentence group, from which the paraphrase pair was extracted. Judges were given three choices: &quot;Correct&quot;, for perfect paraphrases, &quot;Partially correct&quot;, for paraphrases in which there is only a partial overlap between the meaning of two paraphrases (e.g.</Paragraph> <Paragraph position="4"> while {saving set, aid package} is a correct paraphrase pair in the given context, {set, aide package} is considered partially correct), and &quot;Incorrect&quot;. The results of the evaluation are presented in Table 1.</Paragraph> <Paragraph position="5"> Although the four evaluators were judging four different sets, each clearly rated a higher percentage of the outputs produced by the syntax-based alignment algorithm as &quot;Correct&quot;. We should note that there are parameters specific to the co-training algorithm that we did not tune to work for this particular corpus. In addition, the co-training algorithm recovered more paraphrase pairs: the syntax-based algorithm extracted 8666 pairs in total with 1051 of them extracted at least twice (i.e. more or less reliable), while the numbers for the co-training algorithm is 2934 out of a total of 16993 pairs. This means we are not comparing the accuracy on the same recall level.</Paragraph> <Paragraph position="6"> Aside from evaluating the correctness of the paraphrases, we are also interested in the degree of overlap between the paraphrase pairs discovered by the two algorithms so different from each other. We find that out of the 1051 paraphrase pairs that were extracted from more than one sentence group by the syntax-based algorithm, 62.3% were also extracted by the co-training algorithm; and out of the 2934 paraphrase pairs from the results of co-training algorithm, 33.4% were also extracted by the syntax-based algorithm. This shows that in spite of the very different cues the two different algorithms rely on, they do discover a lot of common pairs.</Paragraph> <Paragraph position="7"> In order to (roughly) estimate the recall (of lexical synonyms) of our algorithm, we use the synonymy relation in WordNet to extract all the synonym pairs present in our corpus. This extraction process yields the list of all WordNet-consistent synonym pairs that are present in our data. (Note that some of the pairs identified as synonyms by WordNet, like &quot;follow/be&quot;, are not really synonyms in the contexts defined in our data set, which may lead to artificial deflation of our recall estimate.) Once we have the list of WordNet-consistent paraphrases, we can check how many of them are recovered by our method. Table 2 gives the percentage of pairs recovered for each range of average sentence length (ASL) in the group.</Paragraph> <Paragraph position="8"> Not surprisingly, we get higher recall with shorter sentences, since long sentences tend to differ in their syntactic structures fairly high up in the parse trees, which leads to fewer mergings at the lexical level. The recall on the task of extracting lexical synonyms, as defined by WordNet, is not high. But after all, this is not what our algorithm has been designed for. It's worth noticing that the syntax-based algorithm also picks up many paraphrases that are not identified as synonyms in Word-Net. Out of 3217 lexical paraphrases that are learned by our system, only 493 (15.3%) are WordNet synonyms, which suggests that paraphrasing is a much richer and looser relation than synonymy. However, the WordNet-based recall figures suggest that WordNet can be used as an additional source of information to be exploited by our algorithm.</Paragraph> </Section> <Section position="2" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 4.2 Evaluating the FSA directly </SectionTitle> <Paragraph position="0"> We noted before that apart from being a natural representation of paraphrases, the FSAs that we build have their own merit and deserve to be evaluated directly. Since our FSAs contain large numbers of paths, we design automatic evaluation metrics to assess their qualities.</Paragraph> <Paragraph position="1"> If we take our claims seriously, each path in our FSAs that connects the start and end nodes should correspond to a well-formed sentence. We are interested in both quantity (how many sentences our automata are able to produce) and quality (how good these sentences are). To answer the first question, we simply count the number of paths produced by our FSAs.</Paragraph> <Paragraph position="2"> average N (# of paths) logN length max ave max ave duced by our FSAs, reported by the average length of sentences in the input sentence groups. For example, the sentence groups that have between 10 and 20 words produce, on average, automata that can yield 4468 alternative, semantically equivalent formulations.</Paragraph> <Paragraph position="3"> Note that if we always get the same degree of merging per word across all sentence groups, the number of paths would tend to increase with the sentence length. This is not the case here. Apparently we are getting less merging with longer sentences. But still, given 11 sentences, we are capable of generating hundreds, thousands, and in some cases even millions of sentences.</Paragraph> <Paragraph position="4"> Obviously, we should not get too happy with our ability to boost the number of equivalent meanings if they are incorrect. To assess the quality of the FSAs generated by our algorithm, we use a language model-based metric.</Paragraph> <Paragraph position="5"> We train a 4-gram model over one year of the Wall Street Journal using the CMU-Cambridge Statistical Language Modeling toolkit (v2). For each sentence group SG, we use this language model to estimate the average entropy of the 11 original sentences in that group (ent(SG)). We also compute the average entropy of all the sentences in the corresponding FSA built by our syntax-based algorithm (ent(FSA)). As the statistics in Table 4 show, there is little difference between the average entropy of the original sentences and the average entropy of the paraphrase sentences we produce. To better calibrate this result, we compare it with the average entropy of 6 corresponding machine translation outputs (ent(MTS)), which were also made available by LDC in conjunction with the same corpus. As one can see, the difference between the average entropy of the machine produced output and the average entropy of the original 11 sentences is much higher than the difference between the average entropy of the FSA-produced outputs and the average entropy of the original 11 sentences. Obviously, this does not mean that our FSAs only produce well-formed sentences. But it does mean that our FSAs produce sentences that look more like human produced sentences than machine produced ones according to a language model.</Paragraph> <Paragraph position="6"> Not surprisingly, the language model we used in Section 4.2.1 is far from being a perfect judge of sentence quality. Recall the example of &quot;bad&quot; path we gave in Section 1: the battle of last week's fighting took at least 12 people lost their people died in the fighting last week's fighting. Our 4-gram based language model will not find any fault with this sentence. Notice, however, that some words (such as &quot;fighting&quot; and &quot;people&quot;) appear at least twice in this path, although they are not repeated in any of the source sentences. These erroneous repetitions indicate mis-alignment. By measuring the frequency of words that are mistakenly repeated, we can now examine quantitatively whether a direct application of the MSA algorithm suffers from different constituent orderings as we expected.</Paragraph> <Paragraph position="7"> For each sentence group, we get a list of words that never appear more than once in any sentence in this group. Given a word from this list and the FSA built from this group, we count the total number of paths that contain this word (C) and the number of paths in which this word appears at least twice (Cr, i.e. number of erroneous repetitions). We define the repetition ratio to be Cr/C, which is the proportion of &quot;bad&quot; paths in this FSA according to this word. If we compute this ratio for all the words in the lists of the first 499 groups2 and the corresponding FSAs produced by an instantiation of the MSA algorithm3, the average repetition ratio is 0.0304992 (14.76% of the words have a non-zero repetition ratio, and the average ratio for these words is 0.206671). In comparison, the average repetition ratio for our algorithm is 0.0035074 (2.16% of the words have a non-zero repetition ratio4, and the average ratio for these words is 0.162309). The presence of different constituent orderings does pose a more serious problem to the MSA algorithm.</Paragraph> <Paragraph position="8"> Recently, Papineni et al. (2002) have proposed an automatic MT system evaluation technique (the BLEU score). Given an MT system output and a set of refer- null not yield any non-zero repetition ratio. However, if there are mis-alignment not prevented by keyword checking in an FSA, it may contain paths with erroneous repetition of words after squeezing.</Paragraph> <Paragraph position="9"> range 0-1 1-2 2-3 3-4 4-5 count 546 256 80 15 2 Table 5: Statistics for edgain ence translations, one can estimate the &quot;goodness&quot; of the MT output by measuring the n-gram overlap between the output and the reference set. The higher the overlap, i.e., the closer an output string is to a set of reference translations, the better a translation it is.</Paragraph> <Paragraph position="10"> We hypothesize that our FSAs provide a better representation against which the outputs of MT systems can be evaluated because they encode not just a few but thousands of equivalent semantic formulations of the desired meaning. Ideally, if the FSAs we build accept all and only the correct renderings of a given meaning, we can just give a test sentence to the reference FSA and see if it is accepted by it. Since this is not a realistic expectation, we measure the edit distance between a string and an FSA instead: the smaller this distance is, the closer it is to the meaning represented by the FSA.</Paragraph> <Paragraph position="11"> To assess whether our FSAs are more appropriate representations for evaluating the output of MT systems, we perform the following experiment. For each sentence group, we hold out one sentence as test sentence, and try to evaluate how much of it can be predicted from the other 10 sentences. We compare two different ways of estimating the predictive power. (a) we compute the edit distance between the test sentence and the other 10 sentences in the set. The minimum of this distance is ed(input). (b) we use dynamic programming to efficiently compute the minimum distance (ed(FSA)) between the test sentence and all the paths in the FSA built from the other 10 sentences. The smaller the edit distance is, the better we are predicting a test sentence. Mathematically, the difference between these two measures ed(input)[?]ed(FSA) characterizes how much is gained in predictive power by building the FSA.</Paragraph> <Paragraph position="12"> We carry out the experiment described above in a &quot;leave-one-out&quot; fashion (i.e. each sentence serves as a test sentence once). Now let edgain be the average of ed(input) [?] ed(FSA) over the 11 runs for a given group. We compute this for all 899 groups and find the mean for edgain to be 0.91 (std. dev = 0.78). Table 5 gives the count for groups whose edgain falls into the specified range. We can see that the majority of edgain falls under 2.</Paragraph> <Paragraph position="13"> We are also interested in the relation between the predictive power of the FSAs and the number of reference translations they are derived from. For a given group, we randomly order the sentences in it, set the last one as the test sentence, and try to predict it with the first 1, 2, 3, ... 10 sentences. We investigate whether more sentences</Paragraph> <Paragraph position="15"> n mean std. dev mean std. dev of reference sentences yield an increase in the predictive power. Let ed(FSAn) be the edit distance from the test sentence to the FSA built on the first n sentences; similarly, let ed(inputn) be the minimum edit distance from the test sentence to an input set that consists of only the first n sentences. Table 6 reports the effect of using different number of reference translations. The first column shows that each translation is contributing to the predictive power of our FSA. Even when we add the tenth translation to our FSA, we still improve its predictive power. The second column shows that the more sentences we add to the FSA the larger the difference between its predictive power and that of a simple set. The results in Table 6 suggest that our FSA may be used in order to refine the BLEU metric (Papineni et al., 2002).</Paragraph> </Section> </Section> class="xml-element"></Paper>