File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0314_metho.xml
Size: 21,239 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0314"> <Title>Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining</Title> <Section position="3" start_page="0" end_page="32" type="metho"> <SectionTitle> 2 Our Basic Idea </SectionTitle> <Paragraph position="0"> Our approach is illustrated in Figure 1. We concatenate corresponding parallel sentences into bilingual sequences to which sequential pattern mining is applied. By doing so, we obtain the following effects: ing combinatorial explosion.</Paragraph> <Paragraph position="1"> + It achieves an efficient calculation of a contingency table in a single running of sequential pattern mining. null In what follows, we describe sequential pattern mining and each module in Figure 1.</Paragraph> <Section position="1" start_page="0" end_page="32" type="sub_section"> <SectionTitle> 2.1 Sequential Pattern Mining </SectionTitle> <Paragraph position="0"> Sequential pattern mining discovers frequent subsequences as patterns in a sequence database (Agrawal and Srikant, 1995). Here, a subsequence is an order-preserving item sequence where some gaps between items are allowed. In this paper, we write the support of subsequence s in sequence database S as supportS(s), meaning the occurrence frequency of s in S. The problem is defined as follows: Given a set of sequences S, where each sequence consists of items, and a given a user-specified minimum support >>, sequential pattern mining is to find all of the subsequences whose occurrence frequency in the set S is no less than >>.</Paragraph> <Paragraph position="1"> A sequential pattern is different from N-gram pattern in that the former includes a pattern with and without gaps and does not impose any limit on its length. These characteristics in sequential pattern mining leads us to the idea of concatenating corresponding parallel sentences into a bilingual sequence database from which bilingual sequential patterns are mined efficiently.</Paragraph> <Paragraph position="3"/> </Section> <Section position="2" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 2.2 Bilingual Lexicon Extraction 2.2.1 Bilingual Sequence Database </SectionTitle> <Paragraph position="0"> For each parallel sentence, we undergo language-dependent preprocessing, such as word segmentation and part-of-speech tagging. Then we concatenate the mono-lingual sequences into a single bilingual sequence, and a collection of bilingual sequences becomes a sequence database S.</Paragraph> <Paragraph position="1"> A single run of sequential pattern mining takes care of identifying and counting translation candidate patterns rigid and gapped, some of which are overlapped - in the bilingual sequence database. All English subsequences satisfying the minimum support >> will be generated (e.g., &quot;e1&quot;, &quot;e1e2&quot;, &quot;e1e3&quot; C/C/C/, indicated by Ei ). Similarly, all Japanese and bilingual subsequences with support , >> will be generated (indicated by Jj and EiJj respectively). It is important to point out that for any bilingual pattern EiJj, corresponding English pattern Ei and Japanese pattern Jj that form constituents of the bilingual pattern are always recognized and counted.</Paragraph> <Paragraph position="2"> PrefixSpan In order to realize sequential pattern mining, we use PrefixSpan algorithm (Pei et al., 2001). The general idea is to divide the sequence database by frequent prefix and to grow the prefix-spanning patterns in depth-first search fashion.</Paragraph> <Paragraph position="3"> We introduce some concepts. Let fi be a sequential pattern in the sequence database S. Then, we refer to the fi-projected database, Sjfi, as the collection of postfixes of sequences in S w.r.t prefix fi.</Paragraph> <Paragraph position="4"> A running example of PrefixSpan with the minimum support >> = 2 (i.e., mining of sequential patterns with frequency , 2) is shown in Figure 2. Each item in a se- null quence database is indicated by eij where e is an item, i is a sequence id, j is the offset for the postfix of sequence id i. First, frequent sequential patterns with length 1 are selected. This gives A;B and C. The support of D is less than the minimum support 2, so D-projected database will not be created. For projections drawn with bold lines in Figure 2, we proceed with a frequent prefix A. Since A satisfies the minimum support 2, it creates a A-projected database derived from the sequence database S (SjA). From SjA, frequent items B;C are identified, subsequently forming prefix patterns AB and AC, and corresponding projected databases of the postfixes SjAB, SjAC. We continue with projection recursively to mine all sequential patterns satisfying the minimum support count 2.</Paragraph> <Paragraph position="5"> PrefixSpan is described in Figure 3. The predicate projectable is designed to encode if a projection is feasible in an application domain. The original PrefixSpan gives a predicate that always returns true.</Paragraph> <Paragraph position="6"> There are a number of possibilities for projectable to reflect linguistic constraints. A default projectable predicate covers both rigid and gapped sequences satisfying the minimum support. If we care for word adjacency, the projectable should return true only when the last item of the mined pattern and the first item of a postfix sequence in the projected database are contiguous. Another possibility is to prevent a certain class of words from being an item of a sequence. For example, we may wish to find a sequence consisting only of content words. In such a case, we should disallow projections involving functional word item.</Paragraph> <Paragraph position="7"> The effect of sequential pattern mining from bilingual sequence database can better be seen in a contingency table shown in Table 1. Frequencies of a bilingual pattern EiJj, an English pattern Ei, and a Japanese pattern Jj correspond to a, a + b, and a + c respectively. Since we know the total number of bilingual sequences N = a + b + c + d, values of b, c and d can be calculated immediately.</Paragraph> <Paragraph position="8"> The contingency table is used for calculating a similarity (or association) score between Ei and Jj. For this present work, we use Dunning's log-likelihood ratio statistics (Dunning, 1993) defined as follows:</Paragraph> <Paragraph position="10"> For each bilingual pattern EiJj, we compute its similarity score and qualify it as a bilingual sequence-to-sequence correspondence if no equally strong or stronger association for monolingual constituent is found. This step is conservative and the same as step 5 in Moore (2001) or step 6(b) in Kitamura and Matsumoto (1996). Our implementation uses a digital trie structure called Double Array for efficient storage and retrieval of sequential patterns (Aoe, 1989).</Paragraph> <Paragraph position="11"> For non-segmented language, a word unit depends on results of morphological analysis. In case of Japanese morphological analysis, ChaSen (Matsumoto et al., 2000) tends to over-segment words, while JUMAN (Kurohashi et al., 1994) tends to under-segment words. It is difficult to define units of correspondences only consulting the Japanese half of parallel corpora. A parallel sentence-pair may resolve some Japanese word segmentation ambiguity, however, we have no way to rank for word units with the same degree of segmentation ambiguity. Instead, we assume that frequently co-occurred sequence-to-sequence pairs in the entire parallel corpora are translation pairs. Using the global frequency of monolingual and bilingual sequences in the entire parallel corpora, we have better chance to rank for the ties, thereby resolving ambiguity in the monolingual half. To follow this intuition, we generate overlapped translation candidates where ambiguity exists, and extract ones with high association scores.</Paragraph> <Paragraph position="12"> Sequential pattern mining takes care of translation candidate generation as well as efficient counting of the generated candidates. This characteristic is well-suited for our purpose in generating overlapped translation candidates of which frequencies are efficiently counted.</Paragraph> </Section> </Section> <Section position="4" start_page="32" end_page="32" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> We use the English-Japanese parallel corpora that are automatically aligned from comparable corpora of the news wires (Utiyama and Isahara, 2002). There are 150,000 parallel sentences which satisfy their proposed sentence similarity. We use TnT (Brants, 2000) for English POS tagging and ChaSen (Matsumoto et al., 2000) for Japanese morphological analysis, and label each token to either content or functional depending on its partof-speech. null</Paragraph> </Section> <Section position="2" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 3.2 Evaluation Criteria </SectionTitle> <Paragraph position="0"> We evaluate our sequence-to-sequence correspondence by accuracy and coverage, which we believe, similar criteria to (Moore, 2001) and (Melamed, 2001) 2. Let Cseq be the set of correct bilingual sequences by a human judge, Sseq be the set of bilingual sequences identified by our system, Ctoken be the multiset of items covered by Cseq, Ttoken be the multiset of items in the bilingual sequence database, Ctype be the set of items covered by Cseq, and Ttype be the set of items in the bilingual sequence database. Then, our evaluation metrics are given by:</Paragraph> <Paragraph position="2"> 2We would like to examine how many distinct translation pairs are correctly identified (accuracy) and how well the identified subsequences can be used for partial sequence alignment in the original parallel corpora (coverage). Since all the correct translation pairs in our parallel corpora are not annotated, the sum of true positives and false negatives remain unknown. For this reason, we avoid to use evaluation terms precision and recall to emphasize the difference. There are many variations of evaluation criteria used in the literature. At first, we try to use Moore's criteria to present a direct comparison. Unfortunately, we are unclear about frequency for multi-words in the parallel corpora, which seems to require for the denominator of his coverage formula. Further, we also did not split train/test corpus for cross-validation. Our method is an unsupervised learning, and the learning does not involve tuning parameters of a probabilistic model for unseen events. So we believe results using entire parallel corpora give indicative material for evaluation.</Paragraph> <Paragraph position="3"> type coverage = jCtypejjT typej In order to calculate accuracy, each translation pair is compared against the EDR (Dictionary, 1995). All the entries appeared in the dictionary were assumed to be correct. The remaining list was checked by hand. A human judge was asked to decide &quot;correct&quot;, &quot;nearmiss&quot;, or &quot;incorrect&quot; for each proposed translation pair without any reference to the surrounding context. Distinction between &quot;nearmiss&quot; and &quot;incorrect&quot; is that the former includes translation pairs that are partially correct3.</Paragraph> <Paragraph position="4"> In Tables 3, 4, and 5, accuracy is given as a range from a combination of &quot;correct&quot; and &quot;nearmiss&quot; to a combination of &quot;nearmiss&quot; and &quot;incorrect&quot;. Having calculated the total accuracy, accuracies for single-word translation pairs only and for multi-word translation pairs only are calculated accordingly.</Paragraph> </Section> <Section position="3" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 3.3 Results </SectionTitle> <Paragraph position="0"> Our method is implemented in C++, and executed on a 2.20 GHz Penntium IV processor with 2GB memory. For each experiment, we set the minimum support (minsup) and the maximum length (maxpat) of patterns. All experiments target bilingual sequences of content words only, since we feel that functional word correspondences are better dealt with by consulting the surrounding contexts in the parallel corpora4. An execution of bilingual sequence databases compiled from 150,000 sentences, takes less than 5 mins with minsup = 3 and maxpat = 3, inferring 14312 translation pairs.</Paragraph> <Paragraph position="1"> Given different language pair, different genre of text, different evaluation criteria, we find it difficult to directly compare our result with previous high-accuracy approaches such as (Moore, 2001). Below, we give an approximate comparison of our empirical results.</Paragraph> <Paragraph position="2"> Table 3 shows a detailed result of rigid sequences with minsup = 3, maxpat = 3. In total, we obtain 14312 translation pairs, out of which we have 6567 single-word 3We include &quot;not sure&quot; ones for a single-word translation. Those are entries which are correct in some context, but debatable to include in a dictionary by itself. As for multi-word translation, we include pairs that can become &quot;correct&quot; in at most 2 rewriting steps.</Paragraph> <Paragraph position="3"> 4Inclusion of functional word items in bilingual sequences is debatable. We have conducted an preliminary experiment of approx 10,000 sentences taken from a English-Japanese dictionary. As sentences are shorter and more instructive, we get grammatical collocations such as &quot;impressed with / ni kanmei &quot; and &quot;apologize for / koto owabi&quot; or phrasal expressions such as &quot;for your information / go sanko&quot; and &quot;on behalf of / wo daihyo shi&quot;. However, we felt that it was not practical to include functional words in this work, since the parallel corpora is large-scale and interesting translation pairs in newspaper are named entities comprised of mostly content words.</Paragraph> <Paragraph position="4"> of &quot;correct&quot; and &quot;nearmiss&quot; to a combination of &quot;nearmiss&quot; and &quot;incorrect&quot;. The left side of slash gives a tigher evaluation and the right side of slash gives a looser evaluation. minsup maxpat extracted correct total single-word multi-word token type sequence sequence accuracy accuracy accuracy coverage coverage translation pairs and 7745 multi-word translation pairs.</Paragraph> <Paragraph position="5"> In this paper, we evaluate only the top 9000 pairs sorted by the similarity score.</Paragraph> <Paragraph position="6"> For single-word translation, we get 93-99% accuracy at 19% token coverage and 11% type coverage. This implies that about 1/5 of content word tokens in the parallel corpora can find their correspondence with high accuracy. We cannot compare our word alignment result to (Moore, 2001), since the real rate of tokens that can be aligned by single-word translation pairs is not explicitly mentioned. Although our main focus is sequence-to-sequence correspondences, the critical question remains as to what level of accuracy can be obtained when extending coverage rate, for example to 36%, 46% and 90%. Our result appears much inferior to Moore (2001) and Melamed (2001) in this respect and may not reach 36% type coverage. A possible explanation for the poor performance is that our algorithm has no mechanism to check mutually exclusive constraints between translation candidates derived from the same paired parallel sentence. null For general multi-word translation, our method seems more comparable to Moore (2001). Our method performs 56-84% accuracy at 11% type coverage. It seems better than &quot;compound accuracy&quot; which is his proposal of hypothesizing multi-word occurrences, being 45-54% at 12% type coverage. However it is less favorable to &quot;multiword accuracy&quot; provided by Microsoft parsers, being 73-76% accuracy at 12% type coverage (Moore, 2001).</Paragraph> <Paragraph position="7"> The better performance could be attributed to our redundant generation of overlapped translation candidates in order to account for ambiguity. Although redundancy introduces noisier indirect associations than one-to-one mapping, our empirical result suggests that there is still a good chance of direct associations being selected.</Paragraph> <Paragraph position="8"> Table 4 shows results of rigid sequences with a higher minimum support and a longer maximum length. Comparing with Table 3, setting a higher minimum support produces a slightly more cost-effective results. For example, minsup = 10;maxpat = 3, there are 4467 pairs extracted with 89.3-97.1% accuracy, while the top 4000 pairs in minsup = 3;maxpat = 3 are extracted with 89.1-97.1% accuracy. Table 4 reveals a drop in multi-word accuracy when extending minpat, indicating that care should be given to the length of a pattern as well as a cutoff threshold.</Paragraph> <Paragraph position="9"> Our analysis suggests that an iterative method by controlling minsup and maxpat appropriately seems better than a single execution cycle of finding correspondences. It can take mutually exclusive constraints into account more easily which will improve the overall performance. Another interesting extension is to incorporate more linguistically motivated constraints in generation of sequences. Yamamoto et al. (2001) reports that N-gram translation candidates that do not go beyond the chunk boundary boosts performance. Had we performed a language dependent chunking in preparation of bilingual sequences, such a chunk boundary constraint could be simply represented in the projectable predicate. The issues are left for future research.</Paragraph> <Paragraph position="10"> One of advantages in our method is a uniform generation of both rigid and gapped sequences simultaneously. Gapped sequences are generated and extracted without recording offset and without distinguisting compositional compounds from non-compositional compounds. Although non-compositional compounds are rare and more difficult to extract, compositional compounds are still useful as collocational entires in bilingual dictionary.</Paragraph> <Paragraph position="11"> There are positive and negarive effects in our gapped sequences using sequential pattern mining. Suppose we have English sequences of &quot;My best friend wishes your father to visit C/C/C/&quot; and &quot;C/C/C/ best wishes for success&quot;. Then, we obtain a pattern &quot;best wishes&quot; that should be counted separately. However, if we have sequences of &quot;staying at Hilton hotel&quot; and &quot;staying at Kyoto Miyako hotel&quot;, then we will obtain a kind of a phrasal template &quot;staying at hotel&quot; where the individual name of hotel, Hilton or Kyoto Miyako, is abstracted. Usefulness of such gapped sequences is still open, but we emperically evaluate the result of gapped sequences with minsup = 10 and maxpat = 3 shown in Table 5.</Paragraph> <Paragraph position="12"> Comparing Table 4 and 5, we lose the multi-word accuracy substantially. Table 6 is a breakdown of rigid and gapped sequences with minsup = 10, maxpat = 3.</Paragraph> <Paragraph position="13"> The &quot;Both&quot; row lists the number of pairs found, under a category described in the column head, in both rigid and gapped sequences. The &quot;Rigid only&quot; row counts for those only found in rigid sequences, while the &quot;Gapped only&quot; row counts for those only found in gapped sequence. We learn that the decrease in multi-word accuracy is due to an increase in the portion of wrong pairs in sequences; 57% (937 / 1649) in gapped sequences whilst 40% (112 / 283) in rigid sequences.</Paragraph> <Paragraph position="14"> However, gapped sequences have contributed to an increase in the absolute number of correct multi-word translation pairs (+539 correct pairs). In order to gain a better insight, we summarizes the length combination between English pattern and Japanese pattern as reported in Tables 7 and 8. It reveals that the word adjacency constraint in rigid sequences are too stringent. By relaxing the constraint, 436 (546 - 110) correct 2-2 translation pairs are encountered, though 200 (229 - 29) wrong 2-2 pairs are introduced at the same time. At this particular instance of minsup = 10 and maxpat = 3, considering gapped sequence of length 3 seems to introduce more noise.</Paragraph> <Paragraph position="15"> Admittedly, we still require further analysis as to searching a break-even point of rigid/gapped sequences.</Paragraph> <Paragraph position="16"> Our preliminary finding supports the work on collocation by Smadja et al. (1996) in that gapped sequences are also an important class of multi-word translations.</Paragraph> </Section> </Section> class="xml-element"></Paper>