File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1610_metho.xml

Size: 17,440 bytes

Last Modified: 2025-10-06 14:08:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1610">
  <Title>Learning Domain-Specific Transfer Rules: An Experiment with Korean to English Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overall Runtime System Design
</SectionTitle>
    <Paragraph position="0"> Our Korean to English MT runtime system relies on the following off-the-shelf software components: Korean parser For parsing, we used the wide coverage syntactic dependency parser for Korean developed by (Yoon et al., 1997). The parser was not trained on our corpus.</Paragraph>
    <Paragraph position="1"> Transfer component For transfer of the Korean parses to English structures, we used the same lexico-structural transfer framework as (Lavoie et al., 2000).</Paragraph>
    <Paragraph position="2"> Realizer For surface realization of the transferred English syntactic structures, we used the RealPro English realizer (Lavoie and Rambow, 1997).</Paragraph>
    <Paragraph position="3"> The training of the system is described in the next two sections.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Data Preparation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Parses for the Bitexts
</SectionTitle>
      <Paragraph position="0"> In our experiments, we used a parallel corpus derived from bilingual training manuals provided by the U.S. Defense Language Institute. The corpus consists of a Korean dialog of 4,183 sentences about battle scenario message traffic and their English human translations.</Paragraph>
      <Paragraph position="1"> The parses for the Korean sentences were obtained using Yoon's parser, as in the runtime system. The parses for the English human transla-</Paragraph>
      <Paragraph position="3"> tions were derived from an English Tree Bank developed in (Han et al., 2000). To enable the surface realization of the English parses via RealPro, we automatically converted the phrase structures of the English Tree Bank into deep-syntactic dependency structures (DSyntSs) of the Meaning-Text Theory (MTT) (Mel'Vcuk, 1988) using Xia's converter (Xia and Palmer, 2001) and our own conversion grammars. The realization results of the resulting DSyntSs for our training corpus yielded a unigram and bigram accuracy (f-score) of approximately 95% and 90%, respectively.</Paragraph>
      <Paragraph position="4"> A DSyntS is an unordered tree where all nodes are meaning-bearing and lexicalized. Since the output of the Yoon parser is quite similar, we have used its output as is. The syntactic dependency representations for two corresponding Korean  and English sentences are shown in Figure 1.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Training and Test Sets of Parse Pairs
</SectionTitle>
      <Paragraph position="0"> The average sentence lengths (in words) and parse sizes (in nodes) for the 4,183 Korean and English sentences in our corpus are given in Table 1.</Paragraph>
      <Paragraph position="1"> In examining the Korean parses, we found that many of the larger parses, especially those containing intra-sentential punctuation, had incorrect dependency assignments, incomplete lemmatization or were incomplete parses. In examining the English converted parses, we found that many of  the parses containing intra-sentential punctuation marks other than commas had incorrect dependency assignments, due to the limitations of our conversion grammars. Consequently, in our experiments we have primarily focused on a higher quality sub-set of 1,483 sentence pairs, automatically selected by eliminating from the corpus all parse pairs where one of the parses contained more than 11 content nodes or involved problematic intra-sentential punctuation. null We divided this higher quality subset into training and test sets. For the test set, we randomly selected 50 parse pairs containing at least 5 nodes each. For the training set, we used the remaining 1,433 parse pairs. The average sentence lengths and parse sizes for the training and test sets are represented in Tables 2 and 3.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.3 Creating the Baseline Transfer Dictionary
</SectionTitle>
      <Paragraph position="0"> In our system, transfer dictionaries contain Korean to English lexico-structural transfer rules defined using the formalism described in (Nasr et.</Paragraph>
      <Paragraph position="1"> al., 1997), extended to include log likelihood ratios (Manning and Schutze, 1999: 172-175). Sample transfer rules are illustrated in Section 4. The simplest transfer rules consist of direct lexical mappings, while the most complex may contain source and target syntactic patterns composed of multiple nodes defined with lexical and/or syntactic features.</Paragraph>
      <Paragraph position="2"> Each transfer rule is assigned a log likelihood ratio calculated using the training parse set.</Paragraph>
      <Paragraph position="3"> To create the baseline transfer dictionary for our experiments, we had three bilingual dictionary resources at our disposal: A corpus-based handcrafted dictionary: This dictionary was manually assembled by (Han et al., 2000) for the same corpus used here. Note,  by baseline dictionary however, that it was developed for different parse representations, and with an emphasis primarily on the lexical coverage of the source parses, rather than the source and target parse pairs.</Paragraph>
      <Paragraph position="4"> A corpus-based extracted dictionary: This dictionary was automatically created from our corpus by the RALI group from the University of Montreal. Since the extraction heuristics did not handle the rich morphological suffixes of Korean, the extraction results contained inflected words rather than lexemes.</Paragraph>
      <Paragraph position="5"> A wide coverage dictionary: This dictionary of 70,300 entries was created by Systran, without regard to our corpus.</Paragraph>
      <Paragraph position="6"> We processed and combined these resources as follows: AF First, we replaced the inflected words with uninflected lexemes using Yoon's morphological analyzer and a wide coverage English morphological database (Karp and Schabes, 1992). AF Second, we merged all morphologically analyzed entries after removing all non-lexical features, since these features generally did not match those found in the parses.</Paragraph>
      <Paragraph position="7"> AF Third, we matched the resulting transfer dictionary entries with the training parse set, in order to determine for each entry all possible part-of-speech instantiations and dependency relationships. For each distinct instantiation, we calculated a log likelihood ratio.</Paragraph>
      <Paragraph position="8"> AF Finally, we created a baseline dictionary using the instantiated rules whose source patterns had the best log likelihood ratios.</Paragraph>
      <Paragraph position="9"> Table 4 illustrates the concurrent lexical coverage of the training set using the resulting baseline dictionary, i.e. the percentage of nodes covered by rules whose source and target patterns both match. Note that since the baseline dictionary contained some noise, we allowed induced rules to override ones in the baseline dictionary where applicable.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Transfer Rule Induction
</SectionTitle>
    <Paragraph position="0"> The transfer rule induction process has the following steps described below (additional details can also be found in (Lavoie et. al., 2001)): AF Nodes of the corresponding source and target parses are aligned using the baseline transfer dictionary and some heuristics based on the similarity of part-of-speech and syntactic context. null AF Transfer rule candidates are generated based on the sub-patterns that contain the corresponding aligned nodes in the source and target parses.</Paragraph>
    <Paragraph position="1"> AF The transfer rule candidates are ordered based on their likelihood ratios.</Paragraph>
    <Paragraph position="2"> AF The transfer rule candidates are filtered, one at a time, in the order of the likelihood ratios, by removing those rule candidates that do not produce an overall improvement in the accuracy of the transferred parses.</Paragraph>
    <Paragraph position="3"> Figures 2 and 3 show two sample induced rules. The rule formalism uses notation similar to the syntactic dependency notation shown in Figure 1, augmented with variable arguments prefixed with $ characters. These two lexico-structural rules can be used to transfer a Korean syntactic representation for ci-to-reul po-ra to an English syntactic representation for look at the map. The first rule lexicalizes the English predicate and inserts the corresponding preposition while the second rule inserts the English imperative attribute.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Aligning the Parse Nodes
</SectionTitle>
      <Paragraph position="0"> To align the nodes in the source and target parse trees, we devised a new dynamic programming  alignment algorithm that performs a top-down, bidirectional beam search for the least cost mapping between these nodes. The algorithm is parameterized by the costs of (1) aligning two nodes whose lexemes are not found in the baseline transfer dictionary; (2) aligning two nodes with differing parts of speech; (3) deleting or inserting a node in the source or target tree; and (4) aligning two nodes whose relative locations differ.</Paragraph>
      <Paragraph position="1"> To determine an appropriate part of speech cost measure, we first extracted a small set of parse pairs that could be reliably aligned using lexical matching alone, and then based the cost measure on the co-occurrence counts of the observed parts of speech pairings. The remaining costs were set by hand.</Paragraph>
      <Paragraph position="2"> As a result of the alignment process, alignment id attributes (aid) are added to the nodes of the parse pairs. Some nodes may be in alignment with no other node, such as English prepositions not found in the Korean DSyntS.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Generating Rule Candidates
</SectionTitle>
      <Paragraph position="0"> Candidate transfer rules are generated by extracting source and target tree sub-patterns from the aligned parse pairs using the two set of constraints described below.</Paragraph>
      <Paragraph position="1">  Figure 4 shows an example alignment constraint. This constraint, which matches the structural patterns of the transfer rule illustrated in Figure 2, uses the aid alignment attribute to indicate that in a Korean and English parse pair, any source and target sub-trees matching this alignment constraint (where $X1 and $Y1 are aligned, i.e. have the same aid attribute values, and where $X2 and $Y3 are aligned) can be used as a point of departure for generating transfer rule candidates. We suggest that alignment constraints such as this one can be used to define most of the possible syntactic divergences between languages (Dorr, 1994), and that only a handful of them are necessary for two given languages (we have identified 11 general alignment constraints @KOREAN:</Paragraph>
      <Paragraph position="3"> Each node of a candidate transfer rule must have its relation attribute (relationship with its governor) specified if it is an internal node, otherwise this relation must not be specified: e.g.</Paragraph>
      <Paragraph position="5"> necessary for Korean to English transfer so far).</Paragraph>
      <Paragraph position="6">  Attribute constraints are used to limit the space of possible transfer rule candidates that can be generated from the sub-trees satisfying the alignment constraints. Candidate transfer rules must satisfy all of the attribute constraints. Attribute constraints can be divided into two types: AF independent attribute constraints, whose scope covers only one part of a candidate transfer rule and which are the same for the source and target parts; AF concurrent attribute constraints, whose scope extends to both the source and target parts of a candidate transfer rule.</Paragraph>
      <Paragraph position="7"> Figures 5 and 6 give examples of an independent attribute constraint and of a concurrent attribute constraint. As with the alignment constraints, we suggest that a relatively small number of attribute constraints is necessary to generate most of the desired rules for a given language pair.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 Ordering Rule Candidates
</SectionTitle>
      <Paragraph position="0"> In the next step, transfer rule candidates are ordered as follows. First, they are sorted by their decreasing log likelihood ratios. Second, if two or more candidate transfer rules have the same log likelihood ratio, ties are broken by a specificity heuristic, with the result that more general rules are ordered ahead In a candidate transfer rule, inclusion of the lexemes of two aligned nodes must be done concurrently: e.g.</Paragraph>
      <Paragraph position="2"> defined to be the following sum: the number of attributes found in the source and target patterns, plus 1 for each for each lexeme attribute and for each dependency relationship. In our initial experiments, this simple heuristic has been satisfactory.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.4 Filtering Rule Candidates
</SectionTitle>
      <Paragraph position="0"> Once the candidate transfer rules have been ordered, error-driven filtering is used to select those that yield improvements over the baseline transfer dictionary.</Paragraph>
      <Paragraph position="1"> The algorithm works as follows. First, in the initialization step, the set of accepted transfer rules is set to just those appearing in the baseline transfer dictionary. The current error rate is also established, by applying these transfer rules to all the source structures and calculating the overall difference between the resulting transferred structures and the target parses, using a tree accuracy recall and precision measure (determined by comparing the features and dependency relationships in the transferred parses and corresponding target parses). Then, in a single pass through the ordered list of candidates, each transfer rule candidate is tested to see if it reduces the error rate. During each iteration, the candidate transfer rule is provisionally added to the current set of accepted rules and the updated set is applied to all the source structures. If the overall difference between the transferred structures and the target parses is lower than the current error rate, then the candidate is accepted and the current error rate is updated; otherwise, the candidate is rejected and removed from the current set.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.5 Discussion of Induced Rules
</SectionTitle>
      <Paragraph position="0"> In our experiments, the alignment constraints yielded 22,881 source and target sub-tree pairs from the training set of 1,433 parse pairs. Using the at@KOREAN: null</Paragraph>
      <Paragraph position="2"> lexicalization and preposition insertion tribute constraints, an initial list of 801,674 transfer rule candidates was then generated from these sub-tree pairs. The initial list was subsequently reduced to 32,877 unique transfer rule candidates by removing duplicates and by eliminating candidates that had the same source pattern as another candidate with a better log likelihood ratio. After filtering, 2,133 of these transfer rule candidates were accepted. We expect that the number of accepted rules per parse pair will decrease with larger training sets, though this remains to be verified.</Paragraph>
      <Paragraph position="3"> The rule illustrated in Figure 3 was accepted as the 65th best transfer rule with a log likelihood ratio of 33.37, and the rule illustrated in Figure 2 was accepted as the 189th best transfer rule candidate with a log likelihood ratio of 12.77. An example of a candidate transfer rule that was not accepted is the one that combines the features of the two rules mentioned above, illustrated in Figure 7. This transfer rule candidate had a lower log likelihood ratio of 11.40; consequently, it is only considered after the two rules mentioned above, and since it provides no further improvement upon these two rules, it is filtered out.</Paragraph>
      <Paragraph position="4"> In an informal inspection of the top 100 accepted transfer rules, we found that most of them appear to be fairly general rules that would normally be found in a general syntactic-based transfer dictionary. In looking at the remaining rules, we found that the rules tended to become increasingly corpusspecific. null The induction results were obtained using a Java implementation of the induction component. Microsoft SQL Server was used to count and deduplicate the rule candidates. The data preparation and induction processes took about 12 hours on a 300 MHz PC with 256 MB RAM.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML