File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2096_metho.xml
Size: 7,482 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2096"> <Title>Adding Syntax to Dynamic Programming for Aligning Comparable Texts for the Generation of Paraphrases</Title> <Section position="5" start_page="748" end_page="750" type="metho"> <SectionTitle> 3 Alignment Algorithms </SectionTitle> <Paragraph position="0"> Our alignment algorithm can be described as modifying Levenshtein Edit Distance by assigning different scores to lexically matched words according to their syntactic similarity. And the decision of whether to align a pair of words is based on such syntax scores.</Paragraph> <Section position="1" start_page="748" end_page="749" type="sub_section"> <SectionTitle> 3.1 Modified Levenshtein Edit Distance </SectionTitle> <Paragraph position="0"> The Levenshtein Edit Distance (LED) is a measure of similarity between two strings named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. It is the number of substitutions, deletions or insertions (hence &quot;edits&quot;) needed to transform one string into the other. We extend LED to sentence level by counting the substitutions, deletions and insertions of words necessary to transform a sentence into the other. We abbreviate this sentence-level edit distance as MLED. Similar to LED, MLED computation produces an M+1 by N+1 distance matrix, D, given two input sentences of length M and N respectively. This matrix is constructed through dynamic programming as shown in Figure 3.</Paragraph> <Paragraph position="2"> MLED of two sentences of length M and N.</Paragraph> <Paragraph position="3"> &quot;match&quot; is 2 if the ith word in Sentence 1 and the jth word in Sentence 2 syntactically match, and is -1 otherwise. &quot;gap&quot; represents the score for inserting a gap rather than aligning, and is set to -1. The matching conditions of two words are far more complicated than lexical equality. Rather, we judge whether two lexically equal words match based on a predefined set of syntactic features.</Paragraph> <Paragraph position="4"> The output matrix is used to guide the alignment. Starting from the bottom right entry of the matrix, we go to the matrix entry from which the value of the current cell is derived in the recursion of the dynamic programming. Call the current entry D[i][j]. If it gets its value from D[i?1][j?1], the ith word in Sentence 1 and the jth word in Sentence 2 are either aligned or both aligned to a gap depending on whether they syntactically match; if the value of D[i][j] is derived from D[i][j ? 1] + &quot;gap&quot;, the i th word in Sentence 1 is aligned to a gap inserted into Sentence 2 (the jth word in Sentence 2 is not consumed); otherwise, the jth word in Sentence 2 is aligned to a gap inserted into Sentence 1.</Paragraph> <Paragraph position="5"> Now that we know how to align two sentences, aligning a cluster of sentences is done progressively. We start with the overall most similar pair and then respect the initial ordering of the cluster, aligning remaining sentences sequentially. Each sentence is aligned against its best match in the pool of already-aligned ones. This approach is a hybrid of the Feng-Doolittle's Algorithm (Feng and Doolittle, 1987) and a variant described in (Fitch and Margoliash, 1967).</Paragraph> </Section> <Section position="2" start_page="749" end_page="750" type="sub_section"> <SectionTitle> 3.2 Syntax-based Alignment </SectionTitle> <Paragraph position="0"> As remarked earlier, our alignment scheme judges whether two words match according to their syntactic similarity on top of lexical equality.</Paragraph> <Paragraph position="1"> The syntactic features are obtained from running Chunklink (Buchholz, 2000) on the Charniak parses of the clustered sentences.</Paragraph> <Paragraph position="2"> Among all the information Chunklink provides, we use in particular the part-of-speech tags, the Chunk tags, and the syntactic dependence traces. The Chunk tag shows the constituent of a word and its relative position in that constituent. It can take one of the three values, &quot;O&quot; meaning that the word is outside of any chunk; &quot;I-XP&quot; meaning that this word is inside an XP chunk where X = N, V, P, ADV, ...; &quot;B-XP&quot; meaning that the word is at the beginning of an XP chunk.</Paragraph> <Paragraph position="3"> From now on, we shall refer to the Chunk tag of a word as its IOB value (IOB was named by Tjong Kim Sang and Jorn Veeenstra (Tjong Kim Sang and Veenstra, 1999) after Ratnaparkhi (Ratnaparkhi, 1998)). For example, in the sentence &quot;I visited Milan Theater&quot;, the IOB value for &quot;I&quot; is B-NP since it marks the beginning of a noun-phrase (NP). On the other hand, &quot;Theater&quot; has an IOB value of I-NP because it is inside a noun-phrase (Milan Theater) and is not at the beginning of that constituent. Finally, the syntactic dependence trace of a word is the path of IOB values from the root of the tree to the word itself. The last element in the trace is hence the IOB of the word itself.</Paragraph> <Paragraph position="4"> Lexically matched words but with different POS are considered not syntactically matched (e.g., race VB vs. race NN). Hence, our focus is really on pairs of lexically matched words with the same POS. We first compare their IOB values. Two IOB values are exactly matched only if they are identical (same constituent and same position); they are partially matched if they share a common constituent but have different position (e.g., B-PP vs. I-PP); and they are unmatched otherwise. For a pair of words with exactly matched IOB values, we assign 1 as their IOB-score; for those with partially matched IOB values, 0; and -1 for those with unmatched IOB values. The numeric values of the score are from experimental experience.</Paragraph> <Paragraph position="5"> The next step is to compare syntactic dependence traces of the two words. We start with the second last element in the traces and go backward because the last one is already taken care of by the previous step. We also discard the front element of both traces since it is &quot;I-S&quot; for all words. The corresponding elements in the two traces are checked by the IOB-comparison described above and the scores accumulated. The process terminates as soon as one of the two traces is exhausted. Last, we adjust down the cumulative score by the length difference between the two traces. Such final score is named the trace-score of the two words.</Paragraph> <Paragraph position="6"> We declare &quot;unmatched&quot; if the sum of the IOBscore and the trace-score falls below 0. Otherwise, we perform one last measurement - the relative position of the two words in their respective sentences. The relative position is defined to be the word's absolute position divided by the length of the sentence it appears in (e.g. the 4th word of a 20-word sentence has a relative position of 0.2).</Paragraph> <Paragraph position="7"> If the difference between two relative positions is larger than 0.4 (empirically chosen before running the experiments), we consider the two words &quot;unmatched&quot;. Otherwise, they are syntactically matched.</Paragraph> <Paragraph position="8"> The pseudo-code of checking syntactic match is shown in Figure 4.</Paragraph> <Paragraph position="10"/> </Section> </Section> class="xml-element"></Paper>