File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3219_metho.xml
Size: 25,039 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3219"> <Title>Monolingual Machine Translation for Paraphrase Generation</Title> <Section position="4" start_page="1" end_page="8" type="metho"> <SectionTitle> 3 Data collection </SectionTitle> <Paragraph position="0"> Our training corpus, like those of Shinyama et al. and Barzilay & Lee, consists of different news stories reporting the same event. While previous work with comparable news corpora has been limited to just two news sources, we set out to harness the ongoing explosion in internet news coverage.</Paragraph> <Paragraph position="1"> Thousands of news sources worldwide are competing to cover the same stories, in real time. Despite different authorship, these stories cover the same events and therefore have significant content overlap, especially in reports of the basic facts. In other cases, news agencies introduce minor edits into a single original AP or Reuters story. We believe that our work constitutes the first to attempt to exploit these massively multiple data sources for paraphrase learning and generation.</Paragraph> <Section position="1" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 3.1 Gathering aligned sentence pairs </SectionTitle> <Paragraph position="0"> We began by identifying sets of pre-clustered URLs that point to news articles on the Web, gathered from publicly available sites such as http://news.yahoo.com/, http://news.google.com and http://uk.newsbot.msn.com. Their clustering algorithms appear to consider the full text of each news article, in addition to temporal cues, to produce sets of topically/temporally related articles. Story content is captured by downloading the HTML and isolating the textual content. A supervised HMM was trained to distinguish story content from surrounding advertisements, etc.</Paragraph> <Paragraph position="1"> Over the course of about 8 months, we collected 11,162 clusters, comprising 177,095 articles and averaging 15.8 articles per cluster. The quality of We hand-tagged 1,150 articles to indicate which portions of the text were story content and which were advertisements, image captions, or other unwanted material. We evaluated several classifiers on a 70/30 test train split and found that an HMM trained on a handful of features was most effective in identifying content lines (95% F-measure).</Paragraph> <Paragraph position="2"> these clusters is generally good. Impressionistically, discrete events like sudden disasters, business announcements, and deaths tend to yield tightly focused clusters, while ongoing stories like the SARS crisis tend to produce very large and unfocused clusters.</Paragraph> <Paragraph position="3"> To extract likely paraphrase sentence pairs from these clusters, we used edit distance (Levenshtein 1966) over words, comparing all sentences pair-wise within a cluster to find the minimal number of word insertions and deletions transforming the first sentence into the second. Each sentence was normalized to lower case, and the pairs were filtered to reject: * Sentence pairs where the sentences were identical or differed only in punctuation; * Duplicate sentence pairs; * Sentence pairs with significantly different lengths (the shorter is less than two-thirds the length of the longer); * Sentence pairs where the Levenshtein distance was greater than 12.0.</Paragraph> <Paragraph position="4"> A total of 139K non-identical sentence pairs were obtained. Mean Levenshtein distance was 5.17; mean sentence length was 18.6 words.</Paragraph> </Section> <Section position="2" start_page="3" end_page="5" type="sub_section"> <SectionTitle> 3.2 Word alignment </SectionTitle> <Paragraph position="0"> To this corpus we applied the word alignment algorithms available in Giza++ (Och & Ney, 2000), a freely available implementation of IBM Models 1-5 (Brown, 1993) and the HMM alignment (Vogel et al, 1996), along with various improvements and modifications motivated by experimentation by Och & Ney (2000). In order to capture the many-to-many alignments that identify correspondences between idioms and other phrasal chunks, we align in the forward direction and again in the backward direction, heuristically recombining each unidirectional word alignment into a single bidirectional alignment (Och & Ney 2000).</Paragraph> <Paragraph position="1"> Figure 1 shows an example of a monolingual alignment produced by Giza++. Each line represents a uni-directional link; directionality is indicated by a tick mark on the target side of the link. We held out a set of news clusters from our training data and extracted a set of 250 sentence pairs for blind evaluation. Randomly extracted on the basis of an edit distance of 5 [?] n [?] 20 (to allow a range of reasonably divergent candidate pairs while eliminating the most trivial substitutions), the gold-standard sentence pairs were checked by an independent human evaluator to ensure that Chosen on the basis of ablation experiments and optimal AER (discussed in 3.2).</Paragraph> <Paragraph position="2"> they contained paraphrases before they were hand word-aligned.</Paragraph> <Paragraph position="3"> To evaluate the alignments, we adhered to the standards established in Melamed (2001) and Och & Ney (2000, 2003). Following Och & Ney's methodology, two annotators each created an initial annotation for each dataset, subcategorizing alignments as either SURE (necessary) or POSSIBLE (allowed, but not required). Differences were highlighted and the annotators were asked to review their choices on these differences. Finally we combined the two annotations into a single gold standard: if both annotators agreed that an alignment should be SURE, then the alignment was marked as SURE in the gold-standard; otherwise the alignment was marked as POSSIBLE.</Paragraph> <Paragraph position="4"> To compute Precision, Recall, and Alignment Error Rate (AER) for the twin datasets, we used exactly the formulae listed in Och & Ney (2003).</Paragraph> <Paragraph position="5"> Let A be the set of alignments in the comparison, S be the set of SURE alignments in the gold standard, and P be the union of the SURE and POSSIBLE alignments in the gold standard. Then we have: , final interrater agreement between the two annotators on the 250 sentences was 93.1%.</Paragraph> <Paragraph position="6"> The formula for AER given here and in Och & Ney (2003) is intended to compare an automatic alignment against a gold standard alignment. However, when comparing one human against another, both comparison and reference distinguish between SURE and POSSIBLE links. Because the AER is asymmetric (though each direction Table 1 shows the results of evaluating alignment after trainng the Giza++ model. Although the overall AER of 11.58% is higher than the best bi-lingual MT systems (Och & Ney, 2003), the training data is inherently noisy, having more in common with analogous corpora than conventional MT parallel corpora in that the paraphrases are not constrained by the source text structure. The identical word AER of 10.57% is unsurprising given that the domain is unrestricted and the alignment algorithm does not employ direct string matching to leverage word identity.</Paragraph> <Paragraph position="7"> The non-identical word AER of 20.88% may appear problematic in a system that aims to generate paraphrases; as we shall see, however, this turns out not to be the case. Ablation experiments, not described here, indicate that additional data will improve AER.</Paragraph> </Section> <Section position="3" start_page="5" end_page="7" type="sub_section"> <SectionTitle> 3.3 Identifying phrasal replacements </SectionTitle> <Paragraph position="0"> Recent work in SMT has shown that simple phrase-based MT systems can outperform more sophisticated word-based systems (e.g. Koehn et al. 2003). Therefore, we adopt a phrasal decoder patterned closely after that of Vogel et al. (2003). We view the source and target sentences S and T as word sequences s</Paragraph> <Paragraph position="2"> ment A of S and T can be expressed as a function from each of the source and target tokens to a unique cept (Brown et al. 1993); isomorphically, a cept represents an aligned subset of the source and target tokens. Then, for a given sentence pair and word alignment, we define a phrase pair as a sub-set of the cepts in which both the source and target tokens are contiguous.</Paragraph> <Paragraph position="3"> We gathered all phrase differs by less than 5%), we have presented the average of the directional AERs.</Paragraph> <Paragraph position="4"> However, following SMT practice of augmenting data with a bilingual lexicon, we did append an identity lexicon to the training data.</Paragraph> <Paragraph position="5"> While this does preclude the usage of &quot;gapped&quot; phrase pairs such as or - either ... or, we found such map- null pairs (limited to those containing no more than five cepts, for reasons of computational efficiency) occurring in at least one aligned sentence somewhere in our training corpus into a single replacement database. This database of lexicalized phrase pairs, termed phrasal replacements, serves as the backbone of our channel model.</Paragraph> <Paragraph position="6"> As in (Vogel et al. 2003), we assigned probabilities to these phrasal replacements via IBM Model 1. In more detail, we first gathered lexical translation probabilities of the form P(s |t) by running five iterations of Model 1 on the training corpus. This allows for computing the probability of a sequence of source words S given a sequence of target words T as the sum over all possible alignments of the Model 1 probabilities: (Brown et al. (1993) provides a more detailed derivation of this identity.) Although simple, this approach has proven effective in SMT for several reasons. First and foremost, phrasal scoring by Model 1 avoids the sparsity problems associated with estimating each phrasal replacement probability with MLE (Vogel et al. 2003). Secondly, it appears to boost translation quality in more sophisticated translation systems by inducing lexical triggering (Och et al. 2004). Collocations and other non-compositional phrases receive a higher probability as a whole than they would as independent single word replacements.</Paragraph> <Paragraph position="7"> One further simplification was made. Given that our domain is restricted to the generation of mono-lingual paraphrase, interesting output can be produced without tackling the difficult problem of inter-phrase reordering.</Paragraph> <Paragraph position="8"> Therefore, along the lines of Tillmann et al. (1997), we rely on only monotone phrasal alignments, although we do allow intra-phrasal reordering. While this means certain common structural alternations (e.g., active/passive) cannot be generated, we are still able to express a broad range of phenomena: pings to be both unwieldy in practice and very often indicative of poor a word alignment.</Paragraph> <Paragraph position="9"> Even in the realm of MT, such an assumption can produce competitive results (Vogel et al. 2003). In addition, we were hesitant to incur the exponential increase in running time associated with those movement models in the tradition of Brown el al (1993), especially since these offset models fail to capture important linguistic generalizations (e.g., phrasal coherence, headedness). * Synonymy: injured - wounded * Phrasal replacements: Bush administration - White House * Intra-phrasal reorderings: margin of error - error margin Our channel model, then, is determined solely by the phrasal replacements involved. We first assume a monotone decomposition of the sentence pair into phrase pairs (considering all phrasal decompositions equally likely), and the probability</Paragraph> <Paragraph position="11"> phrasal replacement probability.</Paragraph> <Paragraph position="12"> The target language model was a trigram model using interpolated Kneser-Ney smoothing (Kneser & Ney 1995), trained over all 1.4 million sentences (24 million words) in our news corpus.</Paragraph> </Section> <Section position="4" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 3.4 Generating paraphrases </SectionTitle> <Paragraph position="0"> To generate paraphrases of a given input, a standard SMT decoding approach was used; this is described in more detail below. Prior to decoding, however, the input sentence underwent preprocessing: text was lowercased, tokenized, and a few classes of named-entities were identified using regular expressions.</Paragraph> <Paragraph position="1"> To begin the decoding process, we first constructed a lattice of all possible paraphrases of the source sentence based on our phrasal translation database. Figure 2 presents an example. The lattice was realized as a set of |S |+ 1 vertices v with probability p.</Paragraph> <Paragraph position="2"> Our replacement database was stored as a trie with words as edges, hence populating the lattice takes worst case O(n ) time. Finally, since source and target languages are identical, we added an identity mapping for each source word s</Paragraph> <Paragraph position="4"> and a uniform probability u. This allows for handling unseen words. A high u value permits more conservative paraphrases.</Paragraph> <Paragraph position="5"> We found the optimal path through the lattice as scored by the product of the replacement model and the trigram language model. This algorithm reduces easily to the Viterbi algorithm; such a dynamic programming approach guarantees an efficient optimal search (worst case O(kn), where n is the maximal target length and k is the maximal number of replacements for any word). In addition, fast algorithms exist for computing the n-best lists over a lattice (Soong & Huang 1991).</Paragraph> <Paragraph position="6"> Finally the resultant paraphrases were cleaned up in a post-processing phase to ensure output was not trivially distinguishable from other systems during human evaluation. All generic named entity tokens were re-instantiated with their source values, and case was restored using a model like that used in Vita et al. (2003).</Paragraph> </Section> <Section position="5" start_page="7" end_page="8" type="sub_section"> <SectionTitle> 3.5 Alternate approaches </SectionTitle> <Paragraph position="0"> Barzilay & Lee (2003) have released a common dataset that provides a basis for comparing different paraphrase generation systems. It consists of 59 sentences regarding acts of violence in the Middle East. These are accompanied by paraphrases generated by their Multi-Sequence Alignment (MSA) system and a baseline employing WordNet (Fellbaum 1998), along with human judgments for each output by 2-3 raters.</Paragraph> <Paragraph position="1"> The MSA WordNet baseline was created by selecting a subset of the words in each test sentence--proportional to the number of words replaced by MSA in the same sentence--and replacing each with an arbitrary word from its most frequent WordNet synset.</Paragraph> <Paragraph position="2"> Since our SMT approach depends quite heavily on a target language model, we presented an alternate WordNet baseline using a target language model.</Paragraph> <Paragraph position="3"> In combination with the language model described in section 3.4, we used a very simple replacement model: each appropriately inflected member of the most frequent synset was proposed as a possible replacement with uniform probability. This was intended to isolate the contribution of the language model from the replacement model.</Paragraph> <Paragraph position="4"> Given that our alignments, while aggregated into phrases, are fundamentally word-aligned, one question that arises is whether the information we learn is different in character than that learned In contrast, Barzilay and Lee (2003) avoided using a language model for essentially the same reason: their MSA approach did not take advantage of such a resource. null from much simpler techniques. To explore this hypothesis, we introduced an additional baseline that used statistical clustering to produce an automated, unsupervised synonym list, again with a trigram language model. We used standard bigram clustering techniques (Goodman 2002) to produce 4,096 clusters of our 65,225 vocabulary items.</Paragraph> </Section> </Section> <Section position="5" start_page="8" end_page="8" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We have experimented with several methods for extracting a parallel sentence-aligned corpus from news clusters using word alignment error rate, or AER, (Och & Ney 2003) as an evaluation metric.</Paragraph> <Paragraph position="1"> A brief summary of these experiments is provided in Table 1. To evaluate the quality of generation, we followed the lead of Barzilay & Lee (2003).</Paragraph> <Paragraph position="2"> We started with the 59 sentences and corresponding paraphrases from MSA and WordNet (designated as WN below). Since the size of this data set made it difficult to obtain statistically significant results, we also included 141 randomly selected sentences from held-out clusters. We then produced paraphrases with each of the following systems and compared them with MSA and WN: * WN+LM: WordNet with a trigram LM * CL: Statistical clusters with a trigram LM * PR: The top 5 sentence rewrites produced by Phrasal Replacement.</Paragraph> <Paragraph position="3"> For the sake of consistency, we did not use the judgments provided by Barzilay and Lee; instead we had two raters judge whether the output from each system was a paraphrase of the input sentence. The raters were presented with an input sentence and an output paraphrase from each system in random order to prevent bias toward any particular judgment. Since, on our first pass, we found inter-rater agreement to be somewhat low (84%), we asked the raters to make a second pass of judgments on those where they disagreed; this significantly improved agreement (96.9%). The results of this final evaluation are summarized in Table 2.</Paragraph> </Section> <Section position="6" start_page="8" end_page="8" type="metho"> <SectionTitle> 5 Analysis </SectionTitle> <Paragraph position="0"> Table 2 shows that PR can produce rewordings that are evaluated as plausible paraphrases more frequently than those generated by either baseline techniques or MSA. The WordNet baseline performs quite poorly, even in combination with a trigram language model: the language model does not contribute significantly to resolving lexical selection. The performance of CL is likewise abysmal--again a language model does nothing to help. The poor performance of these synonym-based techniques indicates that they have little value except as a baseline.</Paragraph> <Paragraph position="1"> The PR model generates plausible paraphrases for the overwhelming majority of test sentences, indicating that even the relatively high AER for non-identical words is not an obstacle to successful generation. Moreover, PR was able to generate a paraphrase for all 200 sentences (including the 59 MSA examples). The correlation between acceptability and PR sentence rank validates both the ranking algorithm and the evaluation methodology.</Paragraph> <Paragraph position="2"> In Table 2, the PR model scores significantly better than MSA in terms of the percentage of paraphrase candidates accepted by raters. Moreover, PR generates at least five (and often hundreds more) distinct paraphrases for each test sentence. Such perfect coverage on this dataset is perhaps fortuitous, but is nonetheless indicative of scalability. By contrast Barzilay & Lee (2003) report being able to generate paraphrases for only 59 out of 484 sentences in their training (test?) set, a total of 12%.</Paragraph> <Paragraph position="3"> One potential concern is that PR paraphrases usually involve simple substitutions of words and short phrases (a mean edit distance of 2.9 on the top ranked sentences), whereas MSA outputs more complex paraphrases (reflected in a mean edit distance of 25.8). This is reflected in Table 3, which provides a breakdown of four dimensions of interest, as provided by one of our independent evaluators. Some 47% of MSA paraphrases involve significant reordering, such as an active-passive alternation, whereas the monotone PR decoder precludes anything other than minor transpositions within phrasal replacements.</Paragraph> <Paragraph position="4"> Should these facts be interpreted to mean that MSA, with its more dramatic rewrites, is ultimately more ambitious than PR? We believe that the opposite is true. A close look at MSA suggests that it is similar in spirit to example-based machine translation techniques that rely on pairing entire sentences in source and target languages, with the translation step limited to local adjustments of the target sentence (e.g. Sumita 2001). When an input sentence closely matches a template, results can be stunning. However, MSA achieves its richness of substitution at the cost of generality. Inspection reveals that 15 of the 59 MSA paraphrases, or 25.4%, are based on a single high-frequency, domain-specific template (essentially a running tally of deaths in the Israeli-Palestinian conflict). Unless one is prepared to assume that similar templates can be found for most sentence types, scalability and domain extensibility appear beyond the reach of MSA.</Paragraph> <Paragraph position="5"> In addition, since MSA templates pair entire sentences, the technique can produce semantically different output when there is a mismatch in information content among template training sentences. Consider the third and fourth rows of Table 3, which indicate the extent of embellishment and lossiness found in MSA paraphrases and the top-ranked PR paraphrases. Particularly noteworthy is the lossiness of MSA seen in row 4. Figure 3 illustrates a case where the MSA paraphrase yields a significant reduction in information, while PR is more conservative in its replacements.</Paragraph> <Paragraph position="6"> While the substitutions obtained by the PR model remain for the present relatively modest, they are not trivial. Changing a single content word is a legitimate form of paraphrase, and the ability to paraphrase across an arbitrarily large sentence set and arbitrary domains is a desideratum of paraphrase research. We have demonstrated that the SMT-motivated PR method is capable of generating acceptable paraphrases for the overwhelming majority of sentences in a broad domain.</Paragraph> </Section> <Section position="7" start_page="8" end_page="8" type="metho"> <SectionTitle> 6 Future work </SectionTitle> <Paragraph position="0"> Much work obviously remains to be done. Our results remain constrained by data sparsity, despite the large initial training sets. One major agenda item therefore will be acquisition of larger (and more diverse) data sets. In addition to obtaining greater absolute quantities of data in the form of clustered articles, we also seek to extract aligned sentence pairs that instantiate a richer set of phenomena. Relying on edit distance to identify likely paraphrases has the unfortunate result of excluding interesting sentence pairs that are similar in meaning though different in form. For example: The Cassini spacecraft, which is en route to Saturn, is about to make a close pass of the ringed planet's mysterious moon Phoebe On its way to an extended mission at Saturn, the Cassini probe on Friday makes its closest rendezvous with Saturn's dark moon Phoebe.</Paragraph> <Paragraph position="1"> We are currently experimenting with data extracted from the first two sentences in each article, which by journalistic convention tend to summarize content (Dolan et al. 2004). While noisier than the edit distance data, initial results suggest that these can be a rich source of information about larger phrasal substitutions and syntactic reordering.</Paragraph> <Paragraph position="2"> Although we have not attempted to address the issue of paraphrase identification here, we are currently exploring machine learning techniques, based in part on features of document structure and other linguistic features that should allow us to bootstrap initial alignments to develop more data.</Paragraph> <Paragraph position="3"> This will we hope, eventually allow us to address such issues as paraphrase identification for IR.</Paragraph> <Paragraph position="4"> To exploit richer data sets, we will also seek to address the monotone limitation of our decoder that further limits the complexity of our paraphrase output. We will be experimenting with more sophisticated decoder models designed to handle re-ordering and mappings to discontinuous elements.</Paragraph> <Paragraph position="5"> We also plan to pursue better (automated) metrics for paraphrase evaluation.</Paragraph> </Section> class="xml-element"></Paper>