File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1004_metho.xml
Size: 13,817 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1004"> <Title>Sentence Alignment for Monolingual Comparable Corpora</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithm </SectionTitle> <Paragraph position="0"> Given a comparable corpus consisting of two collections and a training set of manually aligned text pairs from the corpus, the algorithm follows four main steps. Steps 1 and 2 take place at training time. Steps 3 and 4 are carried out when a new text pair (Text1, Text2) is to be aligned.</Paragraph> <Paragraph position="1"> 1. Topical structure induction: by analyzing multiple instances of paragraphs within the texts of each collection, the topics characteristic of the collections are identified through clustering. Each paragraph in the training set gets assigned the topic it verbalizes (Section 3.1.1.) 2. Learning of structural mapping rules: using the training set, rules for mapping paragraphs are learned in a supervised fashion (Section 3.1.2).</Paragraph> <Paragraph position="2"> 3. Macro alignment: given a new unseen pair (Text1, Text2), each paragraph is automatically assigned its topic. Paragraphs are mapped following the learned rules (Section 3.2).</Paragraph> <Paragraph position="3"> 4. Micro alignment: for each mapped paragraph pair, a local alignment is computed at the sentence level. The final alignment for the text pair is the union of all the aligned sentence pairs (Section 3.3).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Off-Line Processing </SectionTitle> <Paragraph position="0"> Given two sentences with moderate lexical similarity, we may not have enough evidence to decide accurately whether they should be aligned. Looking at the broader context they appear in can provide additional insight: if the types of information expressed in the contexts are similar, then the specific information expressed in the sentences is more likely to be the same. On the other hand, if the types of information in the two contexts are unrelated, chances are that the sentences should not be aligned. In our implementation, context is represented by the paragraphs to which the sentences belong.2 Our goal in this phase is to learn rules for determining whether two paragraphs are likely to contain sentences that should be aligned, or whether, on the contrary, two paragraphs are unrelated and, therefore, should not be considered for further processing.</Paragraph> <Paragraph position="1"> A potentially fruitful way to do so is to take advantage of the topical structure of texts. In a given domain and genre, while the texts relate different subjects, they all use a limited set of topics to convey information; these topics are also known as the Domain Communication Knowledge (Kittredge et al., 1991). For instance, most texts describing diseases will have topics such as &quot;symptoms&quot; or &quot;treatment.&quot;3 If the task is to align a disease description written for physicians and a text describing the same disease for lay people, it is most likely that sentences within the topic &quot;symptoms&quot; in the expert version will map to sentences describing the symptoms in the lay version rather than those describing treatment options. If we can automatically identify the topic each paragraph conveys, we can decide more accurately whether two paragraphs are related and should be mapped for further processing.</Paragraph> <Paragraph position="2"> in the topic detection task- there, a &quot;topic&quot; would designate which disease is described.</Paragraph> <Paragraph position="3"> In the field of text generation, methods for representing the semantic structure of texts have been investigated through text schemata (McKeown, 1985) or rhetorical structures (Mann and Thompson, 1987). In our framework, we want to identify the different topics of the text, but we are not concerned with the relations holding between them or the order in which they typically appear. We propose to identify the topics typical to each collection in the comparable corpus by using clustering, such that each cluster represents a topic in the collection.</Paragraph> <Paragraph position="4"> The process of learning paragraph mapping rules is accomplished in two stages: first, we identify the topics of each collection, Corpus1 and Corpus2, and label each paragraph with its specific topic. Second, using a training set of manually aligned text pairs, we learn rules for mapping paragraphs from Corpus1 to Corpus2. Two paragraphs are considered mapped if they are likely to contain sentences that should be aligned.</Paragraph> <Paragraph position="5"> We perform a clustering at the paragraph level for each collection. We call this stage Vertical Clustering because all the paragraphs of all the documents in Corpus1 get clustered, independently of Corpus2; the same goes for the paragraphs in Corpus2. At this stage, we are only interested in identifying the topics of the texts in each collection, each cluster representing a topic.</Paragraph> <Paragraph position="6"> We apply a hierarchical complete link clustering.</Paragraph> <Paragraph position="7"> Similarity is a simple cosine measure based on the word overlap of the paragraphs, ignoring function words. Since we want to group together paragraphs that convey the same type of information across the documents in the same collection, we replace all the text-specific attributes, such as proper names, dates and numbers, by generic tags.4 This way, we ensure that two paragraphs are clustered not because they relate the same specific information, but rather, because they convey the same type of information (an example of two automatically clustered paragraphs is shown in Figure 3). The number of clusters for each collection is a parameter tuned on our training set (see Section 4).</Paragraph> <Paragraph position="8"> 4We crudely consider any words with a capital letter a proper name, except for each sentence's first word.</Paragraph> <Paragraph position="9"> Lisbon has a mild and equable climate, with a mean annual temperature of 63 degree F (17 degree C). The proximity of the Atlantic and the frequency of sea fogs keep the atmosphere humid, and summers can be somewhat oppressive, although the city has been esteemed as a winter health resort since the 18th century. Average annual rainfall is 26.6 inches (666 millimetres). null Jakarta is a tropical, humid city, with annual temperatures ranging between the extremes of 75 and 93 degree F (24 and 34 degree C) and a relative humidity between 75 and 85 percent. The average mean temperatures are 79 degree F (26 degree C) in January and 82 degree F (28 degree C) in October. The annual rainfall is more than 67 inches (1,700 mm). Temperatures are often modified by sea winds. Jakarta, like any other large city, also has its share of air and noise pollution. step. An arrow between two paragraphs indicates they contain at least one aligned sentence pair.</Paragraph> <Paragraph position="10"> Once the different topics, or clusters, are identified inside each collection, we can use this information to learn rules for paragraph mapping (Horizontal Mapping between texts from Corpus1 and texts from Corpus2). Using a training set of text pairs, manually aligned at the sentence level, we consider two paragraphs to map each other if they contain at least one aligned sentence pair (see Figure 4).</Paragraph> <Paragraph position="11"> Our problem can be framed as a classification task: given training instances of paragraph pairs (P , Q) from a text pair, classify them as mapping or not.</Paragraph> <Paragraph position="12"> The features for the classification are the lexical similarity of P and Q, the cluster number of P , and the cluster number of Q. Here, similarity is again a simple cosine measure based on the word overlap of the two paragraphs.5 These features are weak indicators by themselves. Consequently, we use the publicly-available classification tool BoosTexter (Singer and Schapire, 1998) to combine them accurately.6</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Macro Alignment: Find Candidate Paragraph(s) </SectionTitle> <Paragraph position="0"> At this stage, the clustering and training are completed. Given a new unseen text pair (Text1, Text2), the goal is to find a sentence alignment between them. Two sentences with very high lexical similarity are likely to be aligned. We allow such pairs in the alignment independently of their context.</Paragraph> <Paragraph position="1"> This step allows us to catch the &quot;easy&quot; paraphrases. We focus next on how our algorithm identifies the less obvious matching sentence pairs.</Paragraph> <Paragraph position="2"> For each paragraph in each text, we identify the cluster in its collection it is the closest to. Similarity between the paragraph and each cluster is computed the same way as in the Vertical Clustering step. We then apply mapping classification to find the mapping paragraphs in the text pair (see Figure 5).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Micro Alignment: Find Sentence Pair(s) </SectionTitle> <Paragraph position="0"> Once the paragraph pairs are identified in (Text1, Text2), we want to find, for each paragraph pair, the (possibly empty) subsets of sentence pairs which constitute a good alignment. Context is used in the following way: given two sentences with moderate similarity, their proximity to sentence pairs with high similarity can help us decide whether to align them or not.</Paragraph> <Paragraph position="1"> To combine the lexical similarity (again using cosine measure) and the proximity feature, we com- null we add a feature which encodes the combination of two cluster numbers.</Paragraph> <Paragraph position="2"> pute local alignments on each paragraph pair, using dynamic programming. The local alignment we construct fits the characteristics of the data we are considering. In particular, we adapt it to our framework to allow many-to-many alignments and some flips of order among aligned sentences. Given sentences i and j, their local similarity sim(i; j) is: sim(i; j) = cos(i; j) mismatch penalty The weight s(i; j) of the optimal alignment between the two sentences is computed as follows:</Paragraph> <Paragraph position="4"> The mismatch penalty penalizes sentence pairs with very low similarity measure, while the skip penalty prevents only sentence pairs with high similarity from getting aligned.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation Setup </SectionTitle> <Paragraph position="0"> The Data. We compiled two collections from the Encyclopedia Britannica and Britannica Elementary.</Paragraph> <Paragraph position="1"> In contrast to the long (up to 15-page) detailed articles of the Encyclopedia Britannica, Britannica Elementary contains one- to two-page entries targeted towards children. The elementary version generally contains a subset of the information presented in the comprehensive version, but there are numerous cases when the elementary entry contains additional or more up-to-date pieces of information.7 The two collections together exhibit many instances of complex rewriting.</Paragraph> <Paragraph position="2"> We collected 103 pairs of comprehensive/elementary city descriptions. We set aside a testing set of 11 text pairs. The rest (92 pairs) was used for the Vertical Clustering. Nine text pairs were used for training (see Table 1 for statistics).</Paragraph> <Paragraph position="3"> Human Annotation. Each text pair in the training and testing sets was annotated by two annota- null pairs among different similarity ranges.</Paragraph> <Paragraph position="4"> that expresses the same information. We allowed many-to-many alignments. On average, each annotator spent 50 minutes per text pair. While the annotators agreed for most of the sentence pairs they identified, there were some cases of disagreement.</Paragraph> <Paragraph position="5"> Alignment is a tedious task, and sentence pairs can easily be missed even by a careful human annotator.</Paragraph> <Paragraph position="6"> For each text pair, a third annotator went through contested sentence pairs, deciding on a case-by-case basis whether to include it in the alignment. Overall, 320 sentence pairs were aligned in the training set and 281 in the testing set. The other sentence pairs which were not aligned served as negative examples, yielding a total of 4192 training instances and 3884 testing instances.9 As a confirmation that there is no order preservation in comparable corpora, there were up to nine order shifts in each of the annotated text pairs. Table 2 shows that a large fraction of manually aligned sentence pairs have low lexical similarity. Similarity is measured here by the number of words in common, normalized by the number of types in the shorter sentence.</Paragraph> <Paragraph position="7"> Parameter Tuning. We tuned all the parameters on our training set, obtaining the following values: the skip penalty is 0.001, and the cosine threshold for selecting pairs with high lexical similarity is 0.5. BoosTexter was trained in 200 iterations. To find the optimal number of clusters for each collection, Vertical Clustering was performed with different numbers of clusters, ranging from 10 to 40; we selected Decomposition, and our full method.</Paragraph> <Paragraph position="8"> the alternatives with the best performance on the training set: 20 for both collections.</Paragraph> </Section> class="xml-element"></Paper>