File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1121_metho.xml
Size: 13,891 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1121"> <Title>Aligning Bilingual Corpora Using Sentences Location Information*</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Several conceptions </SectionTitle> <Paragraph position="0"> 1) Alignment anchors: (Brown, 1991) firstly introduced the concept of alignment anchors when he aligned Hansard corpus. He considered that the whole texts were divided into some small fragments by these alignment anchors. Anchors are some aligned sentence pairs.</Paragraph> <Paragraph position="1"> 2) Sentence bead: and at the same time, (Brown, 1991) called each aligned sentence pair a sentence bead. Sentence bead has some different styles, such as (0:1), (1:0), (1:1), (1:2), (1: more), (2:1), (2:2), (2: more), (more: 1), (more: 2), (more: more). 3) Sentence pair: Any two sentences in the bilingual text can construct a sentence pair.</Paragraph> <Paragraph position="2"> 4) Candidate anchors: Candidate anchors are those that can be possible alignment anchors. In this paper, all (1:1) sentence beads are categorized as candidate anchors.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Mathematical Model of Alignment </SectionTitle> <Paragraph position="0"> The alignment process has two steps: the first step is to integrate all the origin paragraphs into one large paragraph. This can eliminate the problem induced by the vague paragraph boundaries.</Paragraph> <Paragraph position="1"> The second step is the alignment process. After alignment, the bilingual text becomes sequences of translated fragments. The unit of a fragment can be one sentence, two sentences or several sentences.</Paragraph> <Paragraph position="2"> The traditional alignment method can be used with the fragment with several sentences to improve the alignment granularity. In this paper the formal description of the alignment task was given by extending the concepts of bipartite graph and matching in graph theory.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Bipartite graph </SectionTitle> <Paragraph position="0"> Bipartite graph: Here, we assumed G to be an undirected graph, then it could be defined as G=<V, E>. The vertex+ set of V has two finite subsets: V1 and V2, also V1 [?] V2=V, V1[?]V2=f. Let E be a collection of pairs, when e[?]E, then e={vi, vj}, where vi[?]V1,vj[?]V2. The triple G was described as, G=<V1, E, V2>, called bipartite graph. In a bipartite graph G, if each vertex of V1 is joined with each vertex of V2, or vice versa, here an edge represents a sentence pair. The collection E is the set of all the edges. The triple G=<V1, E, V2> is called complete bipartite graph. We considered that: |V1|=m, |V2|=n, where the parameters m and n are respectively the elements numbers of V1 and V2. The complete bipartite graph was usually abbreviated as Km, n (as shown in figure 1).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Matching </SectionTitle> <Paragraph position="0"> Matching: Assuming G=<V1, E, V2> was a bipartite graph. A matching of G was defined as M, a subset of E with the property that no two edges of M have a common vertex.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Best Alignment Matching </SectionTitle> <Paragraph position="0"> The procedure of alignment using sentence loca-tion information can be seen as a special matching.</Paragraph> <Paragraph position="1"> We defined this problem as &quot;Best Alignment Matching&quot; (BAM).</Paragraph> <Paragraph position="2"> BAM: If M=<S, EM, T> is a best alignment matching of G=<S, E, T>, then M must meet the following conditions: 1) All the vertexes in the complete bipartite graph are ordered; 2) The weight of any edges in EM d(si, tj) has: d(si, tj)< D (where D is alignment threshold); at the same time, there are no edges {sk, tr} which made k<i and r>j, or k>i and r<j; Figure 1 K3,3 complete bipartite graph 3) If we consider: |S|=m and |T|=n, then the edge {sm, tn} belonged to EM; Best alignment matching can be attained by searching for the smallest weight of edge in collection E, until the weight of every edge d(si, tj) is equal or more than the alignment threshold D.</Paragraph> <Paragraph position="3"> Generally, the alignment threshold D is determined according to experience because different texts have different styles.</Paragraph> <Paragraph position="4"> If each sentence in the text S (or T) corresponds with a vertex in V1(or V2), the text S or T can be denoted by S(s1, s2, s3,...si, ...sj, ...sm) or T(t1, t2, t3...ti, ...tj, ...tn). Considering the form merely, each element in S combined with any element in T can create a complete bipartite graph. Thus the alignment task can be seen as the process of searching for the BAM in the complete bipartite graph. As shown in figure 2, the edge e = {si, tj} belongs to M; this means that the i-th sentence in text S and the j-th sentence in text T can make an alignment anchor. Each edge is corresponding to an alignment value. In order to ensure the bilingual texts are divided with the same fragment number, we default that the last sentence in the bilingual text is aligned. That is to say, {sm, tn}E[?]M was correct, if |S|=m and |T|=n in the BAM mathematical model.</Paragraph> <Paragraph position="5"> We stipulated the smaller the alignment value is, the more similar the sentence pair is to a candidate anchor. The smallest value of the sentence pair is found from the complete bipartite graph. That means the selected sentence pair is the most probable aligned (1:1) sentence bead. Alignment process is completed until the alignment anchors become saturated under alignment threshold value.</Paragraph> <Paragraph position="6"> Sentence pairs extracted from all sentence pairs are seen as alignment anchors. These anchors divide the whole texts into short aligned fragments. The definition of BAM ensures that the selected sentence pairs cannot produce cross-alignment errors, and some cases of (1: more) or (more: 1) alignment fragments can be attained by the fragments pairs between two selected alignment anchors. null</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Anchors Selection during Alignment </SectionTitle> <Paragraph position="0"> All (1:1) sentence beads are extracted from different styles of bilingual texts. The distribution states that all of them are similar as presented in figure 3. The horizontal axis denotes the sentence number in Chinese text, and the vertical axis denotes the sentence number in English text.</Paragraph> <Paragraph position="1"> Statistical results show that more than 85% sentence beads are (1:1) sentence beads in bilingual texts and their distributions obey an obvious law well. (DeKai Wu, 1994) offered that (1:1) sentence beads occupied 89% in English-Chinese as well. If we select these style sentence beads as candidate anchors, the alignment method will be general on any other language pairs. The main points of our alignment method using sentences location information are: locating by the whole text, collocating by sentence length and checking by a bilingual dictionary. Location information of any sentence pair is used fully. Three lengths are used: are sentence length, upper context length above the sentence pair and nether context length below the sentence. All this information is considered to calculate the alignment weight of each sentence pair. Finally, the sentence pair with high weight will be checked by a English-Chinese bilingual dictionary.</Paragraph> <Paragraph position="2"> In order to study the relationship between every sentence pair of {si, tj}, four parameters are defined: Whole text length ratio: P0 = Ls / Lt; Upper context length ratio: Pu[i, j] = Usi / Utj; Nether context length ratio: Pd[i, j] = Dsi / Dtj Sentence length ratio: Pl[i, j] = Lsi / Ltj; Figure 2 Sketch map of Km, n BAM under</Paragraph> <Paragraph position="4"> in bilingual texts Where si the i-th sentence of S; tj the j-th sentence of T; Ls the length of source language text S; Lt the length of target language text T; Lsi the length of si; Ltj the length of tj; Usi the upper context length above sentence si; Utj the upper context length above sentence tj; Dsi the nether context length below sentence si; Dtj the nether context length below sentence tj; Figure 4 illustrates clearly the relationship of all variables.</Paragraph> <Paragraph position="5"> If si and tj can construct a (1:1) alignment anchor, P[i, j] must be less than the alignment threshold, where P[i,j] denotes the integrated alignment value between si and tj. We assume that the weight coefficient of Pl[i, j] is 1. Only considering the form, Pu[i, j] and Pd[i, j] must have the same weight coefficient. Here the weight coefficient is set a. We constructed a formal alignment function on every sentence pair:</Paragraph> <Paragraph position="7"> Where, the parameter a is the weight coefficient, if can adjust the weight of sentence pair length and the weight of context lengths well. The longer the text is, the more insensitive the effect of the context-length is. So a's value should increase in order to balance the whole proportion. The short text is vice versa. In this paper we define:</Paragraph> <Paragraph position="9"> According to the definition of BAM, the smaller the alignment function value of P[i, j] is, the more the probability of sentence pair {si, tj} being a (1:1) sentence bead is. In this paper, we adopt a greedy algorithm to select alignment anchors according to all the alignment function values of P[i, j] which are less than the alignment threshold. This procedure can be implemented with a time complexity of O(m*n).</Paragraph> <Paragraph position="10"> To obtain further improvement in alignment accuracy requires calculation of the similarity of the sentence pairs. An English-Chinese bilingual dictionary is adopted to calculate the semantic similarity between the two sentences in a sentence pair. The similarity formula based on a bilingual dictionary is followed: Where L ||is the bytes number of all elements, Match (T) is (according to English-Chinese dictionary) the English words which have Chinese translation in the Chinese sentence, Match (S) is the matched Chinese fragments.</Paragraph> <Paragraph position="11"> According to the above dictionary check, alignment precision is improved greatly. We take a statistic on all the errors and find that most errors are partial alignment errors. Partial alignment means that the alignment location is correct, but a half pair of the alignment pairs is not integrated. It is very difficult to avoid these errors when only taking into account the sentence location and length information. Thus in order to reduce this kind of error, we check the semantic similarity of the context-adjacent sentence pairs also. Because these pairs could be other alignment patterns, such as (1:2) or (2:1), the similarity formulas have some difference from the (1:1) sentence pair formula.</Paragraph> <Paragraph position="12"> Here, a simple judgement is performed. It is shown as:</Paragraph> <Paragraph position="14"> Here, those alignment anchors whose similarities exceed the similarity threshold based on the bilingual dictionary will become the final alignment anchors. These final anchors divide the whole bilingual texts into aligned fragments.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Algorithm Implementation </SectionTitle> <Paragraph position="0"> According to the definition of BAM, the first selected anchor will divide the whole bilingual texts into two parts. We stipulated that the sentences in the upper part of source text cannot match any sentence in the nether part of target text. As shown in Fig 5, after the first alignment anchors were selected, the second candidate anchors must be selected in the first quadrant or the third quadrant and exclusive from the boundary. It is obvious that if the candidate anchors exist in the second quadrant or fourth quadrant, the cross alignment will happen. For example, if the (i, j) is the first selected alignment anchor, and the (i-1, j+1) is the second selected alignment anchor, the cross alignment appears. We can limit the anchors selection field to prevent the cross-alignment errors.</Paragraph> <Paragraph position="1"> In addition, in order to resolve the problem that the first sentence pair is not a (1:1) sentence bead, we use a virtual sentence length as the origin alignment sentence bead when we initialize the alignment process.</Paragraph> <Paragraph position="2"> The implementation of alignment algorithm is described as followed: 1) Load the bilingual text and English-Chinese dictionary; 2) Identify the English and Chinese sentences boundaries and number each sentence; 3) Default the last sentence pair to be aligned and calculate every sentence pair's alignment value; 4) Search for sentence pair that is corresponding to the smallest alignment function value; 5) If the smallest alignment function value is less than the alignment threshold and the go to step 6), and if the smallest value is equal to or more than the threshold, then go to step 7); 6) If the similarity of the sentence pair is more than a certain threshold, the sentence pair will be null come an alignment anchor and divide the bilingual text into two parts respectively, then limit the search field of the next candidate anchors and go to the step 4) 7) Output the aligned texts, and go to the end.</Paragraph> </Section> class="xml-element"></Paper>