File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-5003_metho.xml
Size: 14,488 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-5003"> <Title>Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence</Title> <Section position="3" start_page="17" end_page="19" type="metho"> <SectionTitle> 2 MT Evaluation Methods </SectionTitle> <Paragraph position="0"> MT evaluation schemes score a set of MT system output segments (sentences in our case) S = {s1,s2,...,sI} with respect to a set of references R corresponding to correct translations for their respective segments. Since we classify sentence pairs, we only consider the case of using a single reference for evaluation. Thus the set of references is given by: R = {r1,r2,...,rI}.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.1 WER </SectionTitle> <Paragraph position="0"> Word error rate (WER) (Su et al., 1992) is a measure of the number of edit operations required to transform one sentence into another, defined as:</Paragraph> <Paragraph position="2"> where I(si,ri), D(si,ri) and S(si,ri) are the number of insertions, deletions and substitutions respectively.</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.2 PER </SectionTitle> <Paragraph position="0"> Position-independent word error rate (PER) (Tillmann et al., 1997) is similar to WER except that word order is not taken into account, both sentences are treated as bags of words:</Paragraph> <Paragraph position="2"> where diff(si,ri) is the number of words observed only in si.</Paragraph> </Section> <Section position="3" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.3 BLEU </SectionTitle> <Paragraph position="0"> The BLEU score (Papineni et al., 2001) is based on the geometric mean of n-gram precision. The score is given by:</Paragraph> <Paragraph position="2"> where N is the maximum n-gram size.</Paragraph> <Paragraph position="3"> The n-gram precision pn is given by:</Paragraph> <Paragraph position="5"> where count(ngram) is the count of ngram found in both si and ri and countsys(ngram) is the count of ngram in si.</Paragraph> <Paragraph position="6"> The brevity penalty BP penalizes MT output for being shorter than the corresponding references and is given by: bracketrightBiggbracketrightBigg where Lsys is the number of words in the MT output sentences and Lref is the number of words in the corresponding references.</Paragraph> <Paragraph position="7"> The BLEU brevity penalty is a single value computed over the whole corpus rather than an average of sentence level penalties which would have made its effect too severe. For this reason, in our experiments we omit the brevity penalty from the BLEU score. Its effect is small since the reference sentences and system outputs are drawn fromthesamesampleandhaveapproximatelythe same average length.</Paragraph> <Paragraph position="8"> We ran experiments for N = 1...4, these are referred to as BLEU1 to BLEU4 respectively.</Paragraph> </Section> <Section position="4" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 2.4 NIST </SectionTitle> <Paragraph position="0"> The NIST score (Doddington, 2002) also uses n-gram precision, differing in that an arithmetic mean is used, weights are used to emphasize informative word sequences and a different brevity penalty is used: Sentence pair 1 (semantically equivalent): 1. Amrozi accused his brother, whom he called &quot;the witness&quot;, of deliberately distorting his evidence. 2. Referring to him as only &quot;the witness&quot;, Amrozi accused his brother of deliberately distorting his evidence. Sentence pair 2 (not semantically equivalent): 1. Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion. 2. Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998. Sentence pair 3 (semantically equivalent): 1. The stock rose $2.11, or about 11 percent, to close Friday at $21.51 on the New York Stock Exchange. 2. PG&E Corp. shares jumped $1.63 or 8 percent to $21.03 on the New York Stock Exchange on Friday.</Paragraph> <Paragraph position="2"> For NIST the brevity penalty is computed on a segment-by-segment basis and is given by: where Lsys is the length of the MT system output, Lref is the average number of words in a reference translation and b is chosen to make</Paragraph> <Paragraph position="4"> We ran experiments for N = 1...5, these are referred to as NIST1 to NIST5 respectively. We include the brevity penalty in the scores used for our experiments.</Paragraph> </Section> <Section position="5" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 2.5 Introducing Part of Speech Information </SectionTitle> <Paragraph position="0"> Early experiments based on the PER score revealed that removing certain classes of function words from the edit distance calculation had a positive impact on classification performance. Instead of simply removing these words, we created a mechanism that would allow the classifier to learn for itself the usefulness of various classes of word. For example, one would expect edits involving nouns or verbs to cost more than edits involving interjections or punctuation. We used a POS tagger for the UPENN tag set (Marcus et al., 1994) to label all the data. We then divided the total edit distance, into components, one for each POS tag which hold the amount of edit distance that words bearing this POS tag contributed to the total edit distance. The feature vector therefore having one element for each UPENN POS tag.</Paragraph> <Paragraph position="1"> Let W[?] be the bag of words from si that have no matches in ri and let W+ be the bag of words from si that have matches in ri. The value of the feature vector vectorf[?] corresponding to the contribution to the PER from POS tag t is given by:</Paragraph> <Paragraph position="3"> where count[?]t (w) is the number of times word w occurs in W[?] with tag t.</Paragraph> <Paragraph position="4"> The feature vector defined above characterizes the nature of the words in the sentences that do not match. However it might also be important to include information on the words in the sentence that match. To investigate this, we augment the feature vector vectorf[?] with an analogous set of features vectorf+ (again one for each UPENN POS tag) that represent the distribution over the tag set of word unigram precision, given by:</Paragraph> <Paragraph position="6"> where count+t (w) is the number of times word w occurs in W+ with tag t.</Paragraph> <Paragraph position="7"> This technique is analogous to the NIST score in that it allows the classifier to weight the importance of matches, but differs in that this weight is learned rather than defined, and is with respect to the word's grammatical/semantic role rather than as a function of rarity. When both vectorf+ and vectorf[?] are explained in Section 3, &quot;edit distance&quot; is the average Levenstein distance between the sentences of the pairs used in combination the method differs again by utilizing information about the nature of both the matching words and the non-matching words.</Paragraph> <Paragraph position="8"> We will refer to the system based only on the feature vector vectorf[?] as POS- , that based only on vectorf+ as POS+ and that based on both as POS.</Paragraph> </Section> <Section position="6" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 2.6 Dealing with Synonyms </SectionTitle> <Paragraph position="0"> Often in paraphrases the semantic information carried by a word in one sentence is conveyed by a synonymous word in its paraphrase. To cover these cases we investigated the effect of allowing words to match with synonyms in the edit distance calculations. Another pilot experiment was run with a modified edit distance that allowed words in the sentences to match if their semantic distance was less than a specific threshold (chosen by visual inspection of the output of the system). The semantic distance measure we used was that of (Jiang and Conrath, 1997) defined using the relationships between words in the WordNet database (Fellbaum, 1998). A performance improvement of approximately 0.6% was achieved on the semantic equivalence task using the strategy.</Paragraph> </Section> </Section> <Section position="4" start_page="19" end_page="19" type="metho"> <SectionTitle> 3 Experimental Data </SectionTitle> <Paragraph position="0"> Two corpora were used for the experiments in this paper: the Microsoft Research Paraphrase Corpus (MSRP) and the PASCAL Challenge's entailment recognition corpus (PASCAL). Corpus statistics for these corpora (after pre-processing) are presented in Table 1.</Paragraph> <Paragraph position="1"> The MSRP corpus consists of 5801 sentence pairs drawn from a corpus of news articles from the internet. The sentences were annotated by human annotators with labels indicating whether or not the two sentences are close enough in meaning to be close paraphrases. Multiple annotators were used to annotate each sentence: two annotators labeled the data and a third resolved the cases where they disagreed. The average inter-annotator agreement on this task was 83%, indicating the difficulty in defining the task and the ambiguity of the labeling. Approximately 67% of the sentences were judged to be paraphrases. The datawasdividedrandomlyinto4076trainingsentences and 1725 test sentences. For full details of how the corpus was collected we refer the reader to the corpus documentation. To give an idea of thenatureofthedataandthedifficultyofthetask, three sentences from the corpus are shown in Figure 1. The example sentences show the ambiguity inherent in this task. The first sentence pair is clearly a pair of paraphrases. The second pair ofsentencessharesemanticinformation, butwere judged to be not semantically equivalent. The third pair are not paraphrases, they are clearly describingthemovementsoftotallydifferentstocks, null but the sentences share sufficient semantic content to be labeled equivalent.</Paragraph> <Paragraph position="2"> For the MSRP corpus we present results using the provided training and test sets to allow comparison with our results. To obtain more accurate figures and to get an estimate of the confidence intervals we also conducted experiments by 10foldjackknifingoverallthedata. Theresultsfrom eachfoldwerethenaveragedand95%confidence intervals were estimated for the means.</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> ThePASCALdataconsistsof567development </SectionTitle> <Paragraph position="0"> sentences and 800 test sentences drawn from 7 domains: comparable document (CD), information extraction (IE), machine translation (MT), the MSRP corpus in that it is annotated for entailment rather than semantic equivalence. This explains the asymmetry in the sentence lengths, which is apparent even in the PP component of the corpus. We do not present results for 10-fold jackknifing on the PASCAL data since the data were too small in number for this type of analysis. null In Table 1 &quot;Sentence 1&quot; refers to the first sentence of a sentence pair in the corpus, and &quot;Sentence 2&quot; the second. The length distance ratio</Paragraph> <Paragraph position="2"> This measures the similarity of the lengths of the sentences in the pairs, it has the property of being 0 when all sentence pairs have sentences of the same length and 1 when all sentence pairs differ maximally in length. For the PASCAL corpus the LDR is around 0.5 for the corpus as a whole, correspondingtoalargedifferenceinthesentence lengths. The CD component of the corpus being considerably more consistent in terms of sentence length. The differences among the tasks in terms of edit distance are less clear-cut, with the PP task having the lowest average edit distance despite its higher LDR. The MSRP corpus has an LDR of only 0.14. The sentences pairs are more similar in terms of their length and edit distance than those in the PASCAL corpus. We will argue later that this length similarity has a significant effect on the performance and applicability of these techniques. null</Paragraph> </Section> </Section> <Section position="5" start_page="19" end_page="21" type="metho"> <SectionTitle> 4 Experimental Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.1 Tokenization </SectionTitle> <Paragraph position="0"> In order that the sentences could be tagged with UPENNtags(Marcusetal., 1994), theywerepreprocessed by a tokenizer. After tokenization the average MSRP sentence length was 21 words.</Paragraph> </Section> <Section position="2" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.2 Stemming </SectionTitle> <Paragraph position="0"> Stemming conflates morphologically related words to the same root and has been shown to have a beneficial effect on IR tasks (Krovetz, 1993). A pilot experiment showed that the performance of a PER-based system degraded if the stemmed form of the word was used in place of the surface form. However, if the stemmer was applied only to words labeled by a POS tagger as verbs and nouns, a performance improvement of around 0.8% was observed on the semantic equivalence task. Therefore, for the purposes of the experiments, the nouns and verbs in the sentences were all pre-processed by a stemmer.</Paragraph> </Section> <Section position="3" start_page="19" end_page="21" type="sub_section"> <SectionTitle> 4.3 Classification </SectionTitle> <Paragraph position="0"> We used a support vector machine (SVM) classifier (Vapnik, 1995) with radial basis function kernels to classify the data. The training sets for the respective corpora were used for training, except in the jackknifing experiments. Feature vectors (an example is given in Figure 2) were constructed directly from the output of the MT evaluation systems, when used. The vector has 2 parts, one due to matches and one due to non-matches.</Paragraph> <Paragraph position="1"> The sum of the elements corresponding to non-matches is equal to the PER. We calculated the vectors for each sentence in the pair as both reference and system output and averaged to get the vector for the pair.</Paragraph> </Section> </Section> class="xml-element"></Paper>