File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1021_metho.xml
Size: 12,089 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1021"> <Title>Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora</Title> <Section position="4" start_page="162" end_page="162" type="metho"> <SectionTitle> 3 Data </SectionTitle> <Paragraph position="0"> The Britannica corpus, collected and annotated by Barzilay and Elhadad (2003), consists of 103 pairsofcomprehensive andelementary encyclopedia entries describing major world cities. Twenty of these document pairs were annotated by human judges, who were asked to mark sentence pairs thatcontain atleastoneclause expressing thesame information, and further split into a training and testing set.</Paragraph> <Paragraph position="1"> As a rough indication of the diversity of the dataset and the difference of the task from bilingual alignment, we define the alignment diversity measure (ADM) for two texts, T1</Paragraph> <Paragraph position="3"> a5 , where matches is the number of matching sentence pairs. Intuitively, for closely aligned document pairs, as prevalent in bilingual alignment, one would expect an ADM value close to 1. The average ADM value for the training document pairs of the Britannica corpus is 0a826.</Paragraph> <Paragraph position="4"> For the gospels, we use the King James version, available electronically from the Sacred Text Archive.2 The gospels' lengths span from 678 verses (Mark) to 1151 verses (Luke), where we treat verses as sentences. For training and evaluation purposes, we use the list of parallels given by Aland (1985).3 We use the pair Matthew-Mark bible-researcher.com/parallels.html.</Paragraph> <Paragraph position="5"> for training and the two pairs: Matthew-Luke and Mark-Luke fortesting. Whereas for the Britannica corpus parallels were marked at the resolution of sentences, Aland's annotation presents parallels as matched sequences of verses, known as pericopes.</Paragraph> <Paragraph position="6"> For instance, Matthew:4.1-11 matches Mark:1.1213. We write v a9 p to indicate that verse v belongs to pericope p.4</Paragraph> </Section> <Section position="5" start_page="162" end_page="165" type="metho"> <SectionTitle> 4 Algorithm </SectionTitle> <Paragraph position="0"> We now describe the algorithm, starting with the TF*IDF similarity score, followed by our use of logistic regression, and the global alignment.</Paragraph> <Section position="1" start_page="162" end_page="163" type="sub_section"> <SectionTitle> 4.1 From word overlap to TF*IDF </SectionTitle> <Paragraph position="0"> Barzilay and Elhadad (2003) use a cosine measure of word-overlap as a baseline for the task.</Paragraph> <Paragraph position="1"> As can be expected, word overlap is a relatively effective indicator of sentence similarity and relatedness (Marcu, 1999). Unfortunately, plain word-overlap assigns all words equal importance, not even distinguishing between function and content words. Thus, once the overlap threshold is decreased to improve recall, precision degrades rapidly. For instance, if a pair of sentences has one or two words in common, this is inconclusive evidence of their similarity or difference.</Paragraph> <Paragraph position="2"> One way to address this problem is to differentially weight words using the TF*IDF scoring scheme, which has become standard in Information Retrieval (Salton and Buckley, 1988). IDF wasalso used for the similar task of directional entailment by Monz and de Rijke (2001). To apply this scheme for the task at hand we diverge from the standard IDF definition by viewing each sentence as a document, and the pair of documents as a combined collection of N single-sentence documents. For a term t in sentence s, we define TFs number of occurrences yielded better accuracy on the Britannica training set. This is probably due to the &quot;documents&quot; being only of sentence length.</Paragraph> <Paragraph position="3"> We use these scores as the basis of a standard cosine similarity measure,</Paragraph> <Paragraph position="5"> We normalize terms by using Porter stemming(Porter, 1980). Forthe Britannica corpus, we also normalized British/American spelling differences using a small manually-constructed lexicon.</Paragraph> </Section> <Section position="2" start_page="163" end_page="164" type="sub_section"> <SectionTitle> 4.2 Logistic regression </SectionTitle> <Paragraph position="0"> TF*IDF scores provide a numeric measure of sentence similarity. To use them for choosing sentence pairs, we proceeded to learn a probability of two sentences being matched, given their TF*IDF similarity score, pr a10 match a12 1 a3 sima11 . We expect this probability to follow a sigmoid-shaped curve. While it is always monotonically increasing, the rate of ascent changes; for very low or very high values it is not as steep as for middle values. This reflects the intuition that while we always prefer a higher scoring pair over a lower scoring pair, this preference ismorepronounced inthemiddlerange than in the extremities.</Paragraph> <Paragraph position="1"> Indeed, Figure 1 shows a graph of this distribution on the training part of the Britannica corpus, where point</Paragraph> <Paragraph position="3"> ya11 represents the fraction y of correctly matched sentences of similarity x. Overlayed on top of the points is a logistic regression model of this distribution, defined as the function</Paragraph> <Paragraph position="5"> where a and b are parameters. We used Weka (Witten and Frank, 1999) to automatically learn the parameters of the distribution on the training data. These are set to a a12a6a5 7a889 and b a12 27a856 for the Britannica corpus.</Paragraph> <Paragraph position="6"> dicate the best hit for each verse. The pairs considered correct are a7 2</Paragraph> <Paragraph position="8"> Logistic regression scales the similarity scores monotonically but non-linearly. In particular, it changes the density of points at different score levels. In addition, we can use this distribution to choose a threshold, th, for when a similarity score is indicative of a match. Optimizing the F-measure on the training data using Weka, we choose a threshold value of th a12 0a8276. Note that since the logistic regression transformation is monotonic, the existence of a threshold on probabilities implies the existence of a threshold on the original sim scores. Moreover, such a threshold might be obtained by means other than logistic regression. The scaling, however, will become crucial once we do additional calculations with these probabilities in Section 4.4.</Paragraph> <Paragraph position="9"> Applying logistic regression to the gospels is complicated by the fact that we only have a correct alignment at the resolution of pericopes, and not individual verses. Verse pairs that do not belong to a matched pericope pair can be safely considered unaligned, but for amatched pericope pair,</Paragraph> <Paragraph position="11"> pg2, we do not know which verse is matched with which. We solve this by searching for the reciprocal best hit, a method often used to find orthologous genes in related species (Mushegian and Koonin, 1996). For each verse in each pericope, we find the top matching verse in the other pericope. We take as correct all and only pairs of verses x y, such that x is y's best match and y is x's best match. An example is shown in Figure 2. Taking these pairs as matched yields an ADM value of 0a834 for the training pair of documents. null We used the reciprocally best-matched pairs of the training portion of the gospels to find logistic</Paragraph> <Paragraph position="13"> th a12 0a8250a11 . Notethat werely on this matching only for training, but not for evaluation (see Section 5.2).</Paragraph> </Section> <Section position="3" start_page="164" end_page="164" type="sub_section"> <SectionTitle> 4.3 Method 1: TF*IDF </SectionTitle> <Paragraph position="0"> As a simple method for choosing sentence pairs, we just select all sentence pairs with pr</Paragraph> <Paragraph position="2"> th. We use the following additional heuristics: a2 We unconditionally match the first sentence of one document with the first sentence of the other document. As noted by Quirk et al.</Paragraph> <Paragraph position="3"> (2004), these are very likely to be matched, as verified on our training set as well.</Paragraph> <Paragraph position="4"> a2 We allow many-to-one matching of sentences, but limit them to at most 2-to-1 sentences in both directions (by allowing only the top two matches per sentence to be chosen), since such multiple matchings often arise due to splitting a sentence into two, or conversely, merging two sentences into one.</Paragraph> </Section> <Section position="4" start_page="164" end_page="165" type="sub_section"> <SectionTitle> 4.4 Method 2: TF*IDF + Global alignment </SectionTitle> <Paragraph position="0"> Matching sentence pairs according to TF*IDF ignores sentence ordering completely. For bilingual texts, Gale and Church (1991) demonstrated the extraordinary effectiveness of a global alignment dynamic programming algorithm, where the basic similarity score wasbasedonthedifference insentence lengths, measured in characters. Such methods fail to work in the monolingual case. Gale and Church's algorithm (using the implementation of Danielsson and Ridings (1997)) yields 2% precision at 2.85% recall on the Britannica corpus.</Paragraph> <Paragraph position="1"> Moore's algorithm (2002), which augments sentence length alignment with IBM Model 1 alignment, reports zero matching sentence pairs (regardless of threshold).</Paragraph> <Paragraph position="2"> Nevertheless, we expect sentence ordering can provide important clues for monolingual alignment, bearing in mind two main differences from the bilingual case. First, as can be expected by the ADM value, there are many gaps in the alignment.</Paragraph> <Paragraph position="3"> Second, there can be large segments that diverge from the linear order predicted by a global alignment, as illustrated by the oval in Figure 3 (Figure 2, (Barzilay and Elhadad, 2003)).</Paragraph> <Paragraph position="4"> To model these features of the data, we use a variant of Needleman-Wunsch alignment (1970).</Paragraph> <Paragraph position="5"> We compute the optimal alignment between sentences 1a8 a8 iofthecomprehensive textandsentences</Paragraph> <Paragraph position="7"> Note that the dynamic programming sums match probabilities, rather than the original sim scores, making crucial use of the calibration induced by the logistic regression. Starting from the first pair of sentences, wefind the best path through the matrix indexed by i and j, using dynamic programming. Unlike the standard algorithm, weassign no penalty to off-diagonal matches, allowing many-to-one matches as illustrated schematically in Figure 4. This is because for the loose alignment exhibited by the data, being off-diagonal is not indicative of a bad match. Instead, we prune the complete path generated by the dynamic programming using two methods. First, as in Section 4.3, we limit many-to-one matches to 2-to-1, by allowing just the two best matches per sentence to be included. Second, we eliminate sentence pairs with very low match probabilities</Paragraph> <Paragraph position="9"> 0a8005a11 , a value learned on the training data. Finally, to deal with the divergences from the linear order, we add the top n pairs with very high match probability, above a higher threshold, tha9 .</Paragraph> <Paragraph position="10"> Optimizing on the training data, we set n a12 5 and tha9 a12 0a865 for both corpora.</Paragraph> <Paragraph position="11"> Note that although Barzilay and Elhadad also used an alignment algorithm, they restricted it only to sentences judged to belong to topically related paragraphs. As noted above, this restriction relies on a special feature of the corpus, the fact that encyclopedia entries follow a relatively regular structure of paragraphs. By not relying on such corpus-specific features, our approach gains in robustness. null</Paragraph> </Section> </Section> class="xml-element"></Paper>