File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1037_metho.xml

Size: 12,208 bytes

Last Modified: 2025-10-06 14:14:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1037">
  <Title>j schang@cs.nthu.edu.tw</Title>
  <Section position="3" start_page="210" end_page="211" type="metho">
    <SectionTitle>
2. The Word Alignment Algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="210" end_page="210" type="sub_section">
      <SectionTitle>
2.1 Preliminary details. SenseAlign is a class-based
</SectionTitle>
      <Paragraph position="0"> word alignment system that utilizes both existing and acquired lexical knowledge. The system contains the following components and distinctive t~atures.</Paragraph>
      <Paragraph position="1"> A A greedy algorithm for aligning words. The algorithm is a greedy decision procedure for selecting preferred connections. The evaluation is based on composite scores of various factors: applicability, specificity, fan-out relative distortion probabilities, and evidence from bilingual dictionaries.</Paragraph>
      <Paragraph position="2"> B. Lexieal preprocessing. Morphological analysis, part-of-speech tagging, ktioms identification are performed for the two languages involved. In addition, certain morpho-syntactic analyses are performed to handle structures that are specific only to one of the two languages involved. By doing so, the sentences are brought closer to each other in the number of words.</Paragraph>
      <Paragraph position="3"> C. Two thesauri for classifying words. (McArthur 1992; Mei et al. 1993) Classification allows a word to align with a target word using the collective translation tendency of words in the same class. Class-base roles obviously have much less parameters, are easier to acquire and can be applied more broadly.</Paragraph>
      <Paragraph position="4"> 1). Two different ways of learning class-based rules. The class-based can be acquired either from bilingual materials such as example sentences and their translations or definition sentences tbr senses in a machine readable dictionary.</Paragraph>
      <Paragraph position="5"> E. Similarity between connection target and dictionary translations. In 40% of the correct connections, the target of the connection and dictionary translation have at least one Chinese character in common. To exploit this thesaury t effect in translation, we include similarity between target and dictionary translation as one of the factors.</Paragraph>
      <Paragraph position="6"> F. Relative distortion. Translation process tends to preserve contiguous syntactical structures. The target position in a connection high depends that of adjacent connections. Therelbre, parameters in an model of distortion based on absolute position are highly redundant. Replacing probabilities of the fbrm d(iLj, 1, m) with relative distortion is a feasible alternative. By relative distortion, rd for the connection (s,t), we mean (j-j')-(i-i') where i'th word, s' in the same syntactical structure of s, is connected to the j'th word, t' in TT,</Paragraph>
    </Section>
    <Section position="2" start_page="210" end_page="210" type="sub_section">
      <SectionTitle>
2.2. Acquisition of alignment rules. Class-based
</SectionTitle>
      <Paragraph position="0"> alignment rules can be acquired from a bilingual corpus. Table i presents the ten rules with the highest applicability acquired from the example sentences and their translations in LecDOCE.</Paragraph>
      <Paragraph position="1"> Alternatively, we can acquire rules from the bilingual definition text for senses in a bilingual dictionary.</Paragraph>
      <Paragraph position="2"> The definition sentence are disambiguated using a sense division based on thesauri for the two language involved. Each sense is assigned codes fi'om the two thesauri according to its definition in both languages.</Paragraph>
      <Paragraph position="3"> See Table 2 lbr examples of sense definition and acquired rules.</Paragraph>
    </Section>
    <Section position="3" start_page="210" end_page="211" type="sub_section">
      <SectionTitle>
2.3 Evaluation of connection candidates.
</SectionTitle>
      <Paragraph position="0"> Connection candidates can be evaluated using various factors of confidence. The probabilities of having a correct connection as fimctions of these fhctors are estimated empirically to reflect their relative contribution to the total confidence of a connection 1 From one aspect those words sharing common characters can be considered as synonyms tha would appear in a thesaurus. Fujii and Croft (1993) pointed out that this thesaury effect of Kanji in Japanese helps broaden tile query lhvorably for character-based information retrieval of Japanese documents.</Paragraph>
      <Paragraph position="1">  candidate, fable 3 lists the empirical probabilities of various factors.</Paragraph>
      <Paragraph position="2"> 2.4. Alignmen! algorithm. Our algorithm fbr word aligmnent is a decision procedure tbr selecting the preferred connection fiom a list of candidates. The initial list of selected connection contains two dummy connections. This establishes the initial anchor points tbr calculating relative distortion. The highest scored candidate is selected and added to the list of solution. The newly added connection serves as an additional anchor for a more accurate estimation of relative distortion. The connection candidates that are inconsistent with the selected connection are removed from the list. Subsequently, the rest of the candidates are re-evaluated again. Figure 1 presents the SenseAlign algorithm.</Paragraph>
      <Paragraph position="3"> 3. Example of running SenseAlign.</Paragraph>
      <Paragraph position="4"> To illustrate how SenseAlign works, consider the pair of sentences (1 e, 1 c).</Paragraph>
      <Paragraph position="6"> (lc) Zhuotian wuo budao yitiao yu.</Paragraph>
      <Paragraph position="7"> yesterday I catch one fish.</Paragraph>
      <Paragraph position="8"> Table 4 shows the connections that are considered in each iteration of the SenseAlign algorithm. Various factors used to evaluate connections are also given. Table 5 lists the connection in the final solution of alignment.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="211" end_page="211" type="metho">
    <SectionTitle>
4. Experiments with SenseAlign
</SectionTitle>
    <Paragraph position="0"> In this section, we present the experimental results of an implementation of SenseAlign and related algorithms. Approximately 25,000 bilingual example sentences from LecDOCE are used here as the training data. Here, the training data were used primarily to acquire rules by a greedy learner and to determine empMcally probability thnctions of various factors. The algorithnfs pertbrmance was then tested on the two sets of reside and outside data. The inside test consists of fitty sentence pairs from LecDOCE as input. The outside test are 416 sentence pairs fiom a book on English sentence patterns containing a comprehensive fifty-five sets of typical sentence patterns, l lowever, the words in this outside test is somewhat more common, and, thereby, easier to align.</Paragraph>
    <Paragraph position="1"> &amp;quot;fhis is evident from the slightly higher hit rate based on simple dictionary lookup.</Paragraph>
    <Paragraph position="2"> The first experiment is designed to demonstrate the effectiveness of an naive algorithm (DictAlign) based on a bilingual dictionary. According to our results, although DictAlign produces high precision alignment the coverage for both test sets is below 20%.</Paragraph>
    <Paragraph position="3"> However, if the thesaury eft}act is exploited, the coverage can be increased nearly three tblds to about 40%, at the expense of a decrease around 10% in precision,  race-track, so that they are safer for cars tc go round. ~\]J~' l.n.51 = SANDBANK. , f,~ll'l'&amp;quot;; 12.v. II (of a car or aircraft) to move ~ith one side higher than the other, esp. Mama making a turn {~'\[$;I.qi~g,~ il3 n. I I a row, csp. of OARs in an ancient boat or KEYs on a TYPEWRfFER ~Uf 14.n. II a placc in which money is kept and paid out on demand, and where related aetMties go on. ,JI\[47 j: 14.n.2i (usu. in comb.) a place where something is held ready for use, esp.</Paragraph>
    <Paragraph position="4"> ORGANIC producls o1&amp;quot; lmman origin for medical use. {i~{(igi!( &amp;quot; 14.n.31 (a person Mlo keeps) a supply of moncy or pieces for paymcnt or use in a game of chance. ;~p~ 15.v. 11 to put or kcep (money) in a bank (%:</Paragraph>
    <Paragraph position="6"> In our second experiment, we use SenseAlign described above for word aligmnent except that no bilingual diclionary is used. In our thiM expet+iment, we use the full SenseAlign to align the testing data.</Paragraph>
    <Paragraph position="7"> Table 6 indicates that acquired lexical infornmtion augmented and existing lexical information such as a bilingual dictionary can supplement each other to produce optimum aligmnent results. The generality of the approach is evident fi-om the thct that the coverage and precision for the ovtside test are comparable with those of the inside test.</Paragraph>
  </Section>
  <Section position="5" start_page="211" end_page="213" type="metho">
    <SectionTitle>
5. Discussions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
5.1 Machine-readable lexieal resources vs. corpora
</SectionTitle>
      <Paragraph position="0"> We believe the proposed algorithm addresses tile problem of knowledge engineering bottleneck by using both corpora and machine readable lexical resources such as dictionaries and thesauri. The corpora provide us with training and testing materials, so that empirical knowledge can be derived and evaluated objectively. The thesauri provide classification that can be utilized to generalize the empirical knowledge gleaned fi-om corpora SenseAlign achieves a degree of generality since a word pair can be accurately aligned, even when they occur rarely or only once ill the corpus. This kind of generality is unattainable by statistically trained word-based lnodels. Class-based models obviously offer advantages of smaller storage requirement and hi vher system efficiency. Such advantages do have their costs, tot' class-based models may be over-generalized and miss word-specific rules. However, work on class-based systems have indicated that the advantages oulweigh the disadvantages.</Paragraph>
    </Section>
    <Section position="2" start_page="211" end_page="213" type="sub_section">
      <SectionTitle>
5.2 Mutual information, and frequency. Gale and
</SectionTitle>
      <Paragraph position="0"> Church (1990) shows a near-miss example where (\]2 a Z2-1ike statistic works better than mutual infimnation for selecting strongly associated woM pairs to use in word alignment. In their study, they contend that 2 like statistic works better because it uses co- null nonoccurrence and the number of sentences where one word occurs while the other does not which are often larger, more stable, and more indicative than co-occurrence used in mutual information.</Paragraph>
      <Paragraph position="1"> The above-cited work's discussions of the Z2-1ike statistic and the fan-in factor provide a valuable reference for this work. In our attempt to improve on low coverage of word-based approaches, we use simple filtering according to fan-out in the acquisition of class-based rules, in order to maximize both coverage and precision. The rules that provide the most instances of plausible connection is selected.</Paragraph>
      <Paragraph position="2"> This contrasts with approaches based on word-specific statistic where strongly associated word pairs selected may not have a strong presence in the data.</Paragraph>
      <Paragraph position="3"> This generally corresponds to the results from a recent work on a variety of tasks such as terminology extraction and structural disambiguation. Dallie, Gaussier and Lange (1994) demonstrated that simple criteria related to frequency coupled with a linguistic filter works better than mutual information tbr terminology extraction. Recent work involving structural disambiguation (Brill and Resnik 1994) also indicated that statistics related to frequency outperform mutual intbrmation and q~2 statistic.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML