File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1026_metho.xml
Size: 12,417 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1026"> <Title>A Part-of-Speech-Based Alignment Algorithm</Title> <Section position="3" start_page="0" end_page="167" type="metho"> <SectionTitle> 2. Alignment Problem </SectionTitle> <Paragraph position="0"> Alignment has three levels: 1) paragrapll; 2) sentence; and 3) word. Paragraph level is sometimes called discourse level. Many efforts are involved in senlence level and fewer researchers louch on the word level (Gale and Clmrch, 1991b). To do sentence alignment, we should first define what a sentence is. ILl English, tile sentence terminators are fifll stop, question mark, and exclamation mark. \[\[owever, tim usage of punctuation marks is unrestricted in Chinese and the types of punctuation marks are numerous (Yang, 1981). Nevertheless, in order to parallel the languages, we define that tile sentence markers are fldl slop, question mark, and exclamation mark over all languages. Therelbre, an alignment of two texts is to find a best sequence of sentcnce groups, which arc ended with one of tile sentence ternlillalors.</Paragraph> <Paragraph position="1"> Following Brown el al. (1991), we use tile term bead. A bead contains some sentences of source and largct texls. Thus, alignment can be defined as (1).</Paragraph> <Paragraph position="2"> (I) An alignment is to find a bead sequence under some crileria.</Paragraph> <Paragraph position="3"> If Ihe applied criteria are significant, the performance will be good. Finding significant criteria is tile core of this research.</Paragraph> <Paragraph position="4"> 3. Criteria of Alignment ALLy aligmnenl algorilhm has its own criteria. For example, many alignment algorithms are based on sentence lengfl~ and word correspondence. Here, wc propose a POSes-based crilerion.</Paragraph> <Paragraph position="5"> (2) Alignment Crilerion: &quot;File numbers of critical part of speeches (POSes) of a langt,age pair in an aligned bcad are close. Now, the problem ix what forms the critical POSes. Following many gr~,mmar formalisms (Sells, 1985), the content words will be the good indicalors.</Paragraph> <Paragraph position="6"> Therefore, we think nouns, verbs, and adjectives as the critical POSes. In addition, we inch,de mm~bers and quotation marks in the critical POSts due to intuition. The English tagging system used in this work follows that of the LOB Corpus (Johansson, 1986). The Chinese tagging system follows that of the BDC Corpus (BDC, 1992) but with some modilications.</Paragraph> <Paragraph position="7"> The BI)C COrlms docs not assign tags to Imnctualion marks. We adopl Ihe same philosophy of I,OB Corpus to assign the tags of the punctuation marks as themselves. These critical POSes in English and in Chinese arc listed in Table 1. N- represents all lags initialed with N, i.e., - is a wildcard.</Paragraph> <Paragraph position="8"> Our bilingual corpus is investigated to check the effectiveness of the Imstulation (2). Ten aligned Chinese-to-English texts, CE 01 to CIQI0, are considered as the objects of experimenls. These texts are selected front Sinorama Magazine, published in Chinese and English monthly by Gove,mnenl Information ()ffice of R.O.C. Appendix lists the source of these ten texts. We compute tile average of differences (AD), wlriancc of differences (VI)), and standard deviation of differences (SD) of the critical POSes. *l':tblc 2 itemizes tile wdues.</Paragraph> <Paragraph position="9"> For oriental languages like Chinese, the correspondence with its aligned counterpart in occidental languages is not so manifest like alignment of two languages within the same family. On the one hand, the POSes may be changed within an alignment; on the other hand, a sequence of English words may correspond to a Chinese word. These phenomena make the alignment much harder. HowEver, Table 2 shows that these POSes are good indicators for alignment.</Paragraph> <Paragraph position="10"> 4. How to Evaluate the Performance Before proceeding the experiment, another important issue is how to ewfluate the correct rate for an alignment. No literatures tonch on this issue. We just find in literatures what performance is rather than how to evaluate it. Note that aligmnent has the order constraint. On the OnE hand, when an error occurs, the performance should drop quickly. On the other band, the error will not broadcast to the next paragraph. That is to say, the error will be limitcd in a range. Our criterion for evaluating performance takes care of the two factors.</Paragraph> <Paragraph position="11"> For a given text, we could manually find the real alignment. This alignment consists of a sequence of beads, as mentioned previously. We call the sequence of beads Real Bead Sequence (RIC/S). In contrast, we may apply any alignment algorithm to finding an alignment. We call this aligmnent Contputed Head Sequence (CIL'C/).</Paragraph> <Paragraph position="12"> Ill order to evahlate the perforlnancc of tile alignment algoritlun, WE filrthcr define the lncrentental Bead A'eqHence (II}S).</Paragraph> <Paragraph position="13"> (3) hTcremental Bead Sequence 1BS of a given bead sequencc BS is a bead scquence, such that \[lead \]fli in IBN is snmmation of 15 (0 _<. j .<_ i - 1) in IJS.</Paragraph> <Paragraph position="14"> Therefore, two possible II3Ses, 1RBS and /C\]IS, are generated under this consideration. We define performance for an alignment as (4) Performance = nutnber of common beads in IRI3S and ICBS number of beads in IRBS Table 3 demonstrates how to calculate lhc performancE. Two beads, (3,3) and (10,10), are shared by II?BS and ICI}S. The total numllcr of beads in IReS is 10. Therefore, the performance is 20%. In the following experiment, we will use this method to evahmte performance.</Paragraph> <Paragraph position="15"> Tahle 3. ExamplEs for Calculating Perfimnance</Paragraph> <Paragraph position="17"/> </Section> <Section position="4" start_page="167" end_page="169" type="metho"> <SectionTitle> 5. Alignment Algorithm </SectionTitle> <Paragraph position="0"> The alignment algorithms proposed in the past literatures try to find an optimal alignment which Ires the largest alignment probability. Due to llle very large search space, they all consider only five types of beads: (0,1), (1,1), (1,2), (2,11, and (1,0). After examining our corpus, we can find other types of beads such as (1,3) and (1,4). Furthermorc, bcad type (2,4) is also found. Table 4 lists tile distribution of bead types in the testing lexts. Eiphl bead types appear in tile bilingual texls. Bead type (l,l) is the majority (63.9%). Bead types (1,31, 0,4), and (2,4), which :ire not treated in other papers, occupy 8.2%. If the alignment algorithm did not dcal with these bead types, the correct rate would be bound to 91.8%.</Paragraph> <Paragraph position="1"> It shows tile difficulty of tile alignment task. If we allow various types of beads and adopt the optimal search, tile processing cost is too high to stand. A good algorithm should satisfy tile following two conditions: It is a general local search algorithm.</Paragraph> <Paragraph position="2"> It allows the unlinfited bead types in the aligning process.</Paragraph> <Paragraph position="3"> Under this consideration, simulated annealing approach (Aarts and Korst, 1989) is used to align texts. The idea of annealing comes from condensed matte,&quot; physics. It involves two steps: 1) increasing temperature of matter; 2) decreasing temperature gradually until the matter in the ground configuration. Simulated annealing is to sinudate the almealing process. Therefore, a simnlated annealing mechanistn is composed of four parts: configuration, (ransition fimction, energy fimction, and annealing schedule. If we take an alignment as a configuration, the possible alignmenls constitute tile configuration space. In addition, every configuration is associated with an energy. The optimal configuration is tile one which has tile lowest energy. Simulated annealing is to find tile optimal configuratio,i from an initial configuration by generating a sequence of configurations under a control parameter.</Paragraph> <Paragraph position="4"> For our application, we introduce another component, Transition Vector. The five components are defined as follows.</Paragraph> <Paragraph position="5"> (5) Configuration (C): An alignment is a configuration naturally. For example, a possible bead sequence, {(1,21, (I,1), (1,1), (1,2), (1,1)}, is a configuration.</Paragraph> <Paragraph position="6"> (6) 7)'ansition l,'unction (T): Given a configuration, this fimction is responsible for generating its next configuration. A transition vector is generated ill random, alld then tile transition ftmction moves one configuration to another configuration according to the transition vector.</Paragraph> <Paragraph position="7"> (7) 7}'ansition I/ector (TV): A transition vector consists of 4 components (H, N, IV, D).</Paragraph> <Paragraph position="8"> B denotes the identification (counted from 01 of a selected bead.</Paragraph> <Paragraph position="9"> N specifies whether to generate a new bead or not. IfN equals to 0, no new bead ix generated. If N equals to 1, a new bead is generated.</Paragraph> <Paragraph position="10"> IV represents which language ill tile selected bcad should be moved ont. If W equals to 0, one of the marginal sentences of tile first language should bc moved out. Otherwise, one of tile marginal sentences of the second language should be moved out.</Paragraph> <Paragraph position="11"> I) represents the moving direction. 0 denotes the left marginal sentence of the selected bead is moved left, and 1 denotes the right marginal sentcnce of the selected bead is moved right.</Paragraph> <Paragraph position="12"> For example, transition fimclion will transit a configuralion {(1,2), (1,1), (1,1), (1,21, (1,1)} to {(1,2), (l,l), tO, l), (1,0), (1,2), (1,1)) according to tile transition vector TV -(2, 1,(1, l).</Paragraph> <Paragraph position="13"> (s) lOwrev I,)mclion (E): Assume each sentence has a weight, which is measured by tile nmnber of critical POSes. The weight difference of a bead is the difference between the weighls of respective sentences in one bead. The energy of a configuration is the sum of weight differences of all beads in a configuration.</Paragraph> <Paragraph position="14"> (9) Annealing Sclwdule (AS): When a new conti~,uu'ation (&quot; is generated, two alternatives are considered: move to the new configuration C' or retain tile current configuration C'. 'File criterion is if K(C') <&quot; E(C), the new configuration is adop|ed.</Paragraph> <Paragraph position="16"> we ,,viii also move Io the now confil,tlratmn.</Paragraph> <Paragraph position="17"> ()thep, vise, the cllrrctlt configtlralioll is retained. This is the well-known Metropolis ('rileri(m. The CPk is Ihc control parameter, which will be reduced gradually in tile a,mealing process.</Paragraph> <Paragraph position="18"> Now, we apply tile simulated annealing to aligning the texts, CE 01 to CE 10. Tile initial control parameter cpk is 1.0 and initial nm length L k is I000. Wc reduce the control parameter with 0.5% after each rim. Tile initial configuration is randomly generated. Wc conduct two cxpcriments, 1) without using paragraph markers; 2) with using paragraph markers. 'Fhc results arc shown in Table 5 and Table 6, respectively.</Paragraph> <Paragraph position="19"> The correct rates without and with using paragraph markers are 78.9% and 94.4%, respectively. The latter result (94.4%) is better than the botmd eorrecl rate (91.8%) mentioned before. It shows that those difficult bead types are resolved in our approach. Comparing Tables 5 and 6, we conclude Ihat when the paragraph markers are used, the performance increases significantly. Fig. 1 shows the significance of paragraph markers. In other words, if an alignmenl algorithm could use any reliable anchor points in IhE texts, the performance will incrEasE sharply.</Paragraph> <Paragraph position="20"> In fact, the pErlbrmancc of alignmcnt is depcndenl on the naturc of the texts. When aligning a noisy texts without rcliablc anchor points, we will definitely do a bad job. However, the simulated annealing approach could reduce the risk, and the performance will keep over 78% in our experiment.</Paragraph> </Section> class="xml-element"></Paper>