File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2152_metho.xml
Size: 11,571 bytes
Last Modified: 2025-10-06 14:07:16
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2152"> <Title>An Integrated Architecture for Example-Based Machine Translation</Title> <Section position="4" start_page="0" end_page="1031" type="metho"> <SectionTitle> 2 The Travel Domain </SectionTitle> <Paragraph position="0"> Our prototype implementation o1' the HARMONY architecture was designed to cover the &quot;travel domain&quot;. This is composed of words, phrases, expressions, and sentences related to international travel, similar to what is covered by typical travel phrase books.</Paragraph> <Paragraph position="1"> Two principles guided our detailed definition of the translation domain. FirsL the translation domain should not be limited to a narrow sub-domain, such as appointment scheduling or hotel rcscrwttions. Second, the expressions considered in the domain should reflect the fact that people quickly adapt to limitations in human-machiue or machine-mediated communication by simplifying the input. For example, (Sugaya et al. 1999) found that the average length el' actual human utterances in a hotel resmwation task using speech translation was only 6.1 words, much shorter than some o1' the data that has been used in previous work on speech translation.</Paragraph> <Paragraph position="2"> 'File current vocabulary o1' 7,500 woMs is divided into a group o1' general words, a number of extensiblc word groups (such as names el' food items or diseases), and a number of area-specific woM groups (such as names of cities or tourist destinations).The travel domain is divided into eight &quot;situations&quot;: A general situation (including everyday conversation); transportation; accommodation; sightseeing; shopping; wining, dining, and nightlife; banking and postal; and doctor and pharmacy.</Paragraph> <Paragraph position="3"> We created a corpus for this dolnain, and divided it into a development set o1' 7,000 expressions, and a separate, unseen test set of 5,000 expressions. The development set is used for creation and refinelnent of the translation knowledge sources, and the test set is only used for evaluations. (Each evaluation uses a new, random 500-word sample from the 5,000 word test set.) The corpus was balanced to illustrate the widest possible variety of types o1' words, phrases, syntactic structures, semantic patterns, and pragmatic functions. The average length el' the expressions in the corpus is 6.5 words. Some examples from the development corpus are shown below. Even though this dolnain might seem rather limited, it still contains inany challenges for machine translation.</Paragraph> <Paragraph position="4"> * Can I have your last name, please ? * Is this the bus for Shinagawa station ? * 1 would like to make a reservation for two people for eight nights.</Paragraph> <Paragraph position="5"> * Can you tell us where we can see some Buddhist temples ? * Most supermarkets sell liquor.</Paragraph> <Paragraph position="6"> * Can you recommend a good Chinese restaurant in this area ? * I'd like to change 500 Dollars' in traveller's checks into Yen.</Paragraph> <Paragraph position="7"> * Are there any English..speaking doctolw at the hospital?</Paragraph> </Section> <Section position="5" start_page="1031" end_page="1032" type="metho"> <SectionTitle> 3 NLP Infrastructure </SectionTitle> <Paragraph position="0"> The prototype implementation is coustructed out of components that are based on a powerful infrastructure for natural language processing and language engineering. The three inain aspects of this infrastructure are the Grammar Programming Language (GPL), the GPL compiler, and the GPL runtime environment.</Paragraph> <Section position="1" start_page="1031" end_page="1031" type="sub_section"> <SectionTitle> 3.1 The Grammar Programming Language </SectionTitle> <Paragraph position="0"> The Gralnmar Programming Language (GPL) is an imperative programming language for feature-structure-based rewrite grammars. GPL is a l'ormalism that allows the direct expression of linguistic algorithms l'or parsing, transfer, and generation. Some ideas in GPL can be traced back to Tolnita's pseudo-unification fornmlisln (Tomita 1988), and to Lexical-Functional Grammar (Dalrymple et al. 1995). GPL includes variables, simple and complex tests, and various manipulation operators. GPL also includes control flow statements including if-then-else, switch, iteration over sub-feature-structures, and other features. An example of a silnplified GPL rule for English generation is shown in Figure 1.</Paragraph> <Paragraph position="1"> Wll SENT --) NP YN_SENT { !exist\[Sin VP SUBJ WH\]; local-variable WIt_VP = \[$m VP\]; local-wlriable WH PHP, ASE; $WH PHRASE = find-subfstruct in $WHVP where (?exist\[$x WH\]); $d I = \[$WH_PHRASE SLOT-VALUE\]; \[$WH PHRASE SLOT-VALUE TRACE\] = '+';</Paragraph> <Paragraph position="3"/> </Section> <Section position="2" start_page="1031" end_page="1031" type="sub_section"> <SectionTitle> 3.2 The GPL Compiler </SectionTitle> <Paragraph position="0"> GPL grammars are compiled into C code by the GPL compiler. The GPL compiler was created using the Unix tools lex and yacc (Levine et al.</Paragraph> <Paragraph position="1"> 1990). For each rewrite rule, the GPL compiler creates a main action function, which carries out most of the tests and manipulatious specified by the GPL statements.</Paragraph> <Paragraph position="2"> The GPL compiler handles disiunctive feature structures in an efficient manner by keeping track of sub-feature-structure references within each GPL rule, and by generating an expansion function that is called once before the action function. The coinpiler also tracks variable references, and generates and tracks separate test functions for nested test expressions.</Paragraph> </Section> <Section position="3" start_page="1031" end_page="1032" type="sub_section"> <SectionTitle> 3.3 The GPL Run-time Environment </SectionTitle> <Paragraph position="0"> The result of compiling a GPL grainmar is an encapsulated object that can be accessed via a public interface function. This interface fnnction serves as the link between the compiled GPL grammars, and the various language-independent and domain-independent software engines for parsing, transfer, generation, and others. This is illustrated in Figure 2.</Paragraph> <Paragraph position="1"> The compiled GPI. grammars use the feature structure library, which provides services for efficiently representing, testing, manipulating, and managing memory for feature structures. A special-purpose inemory manager maintains separate stacks of memory pages for each object size. This scheme allows garbage colleclion that is so fast that it can be performed after every aUempted GPL rule execution. In our experiments with Japanese and English parsing, we found that l)el'-rule garbage collection reduced the overall read/write memory requirements by as much as a factor of four to six.</Paragraph> </Section> </Section> <Section position="6" start_page="1032" end_page="1033" type="metho"> <SectionTitle> 4 Source Language Analysis </SectionTitle> <Paragraph position="0"> Translation is divided into the steps ot' analysis, transfer, and generation. Sourcc-hmguage analysis is illustrated in Figure 3.</Paragraph> <Paragraph position="1"> English analysis begins with tokenization and morphological analysis, which creates a lattice that contains lexical feature structures. I)uring multi-word matching, expressions from tile multi-word lexicon (such as White House or take on) are detected in the word lattice, and new arcs with the appropriate lexical feature structures are Lexical ambiguity reduction reduces the nulnber el' arcs in the word lattice. This module carries out part-of-speech tagging over the lattice, and reduces the lattice to those lexical feature structures that arc part of the number of best paths that represents tile best speed/accuracy trade-off (currently two). This calculation is based on the usual lexical and contextual bigram probabilities that were estimated from a training corpus, but it also takes into account manual costs that can be added to lexicon entries, or to individual part-of-speech bigrams.</Paragraph> <Paragraph position="2"> The resulting reduced lattice with lcxical single-word and multi-word feature structures is parsed using tilt GLR parsing algorithm extended to lattice input (Tomita 1986). The English parsing grammar consists of 540 GPL rules. The output is a sentential feature structure that represents the input to the transfer component.</Paragraph> </Section> <Section position="7" start_page="1033" end_page="1033" type="metho"> <SectionTitle> 5 Transfer </SectionTitle> <Paragraph position="0"> Transfer IY=om the source-language sentential feature structure to the target-language sentential feature structure is accomplished with a hybrid rule-based and example-based method. This is illustrated in Figure 4.</Paragraph> <Paragraph position="1"> The input feature structure is passed to the linguistic transfer procedure. This consists of a rule-rewriting software engine that executes the compiled English-to-Japanese transfer grammar. The transfer grammar consists of 140 GPL rules, and its job is to specify linguistic constraints on examples, combine multiple examples, transfer informatiou that is beyond the scope of the example database, and perl'orm various other transformations. The overall effect is to broaden the linguistic coverage, and to raise the grammatical accuracy far beyond the level of a traditional example-based transfer procedure.</Paragraph> <Paragraph position="2"> The linguistic transfer procedure operates on the input feature structure in a recursive manner, and it invokes the example matching procedure to find the best translation example for various parts of the input. The example matching procedure retrieves the best translation examples from the example database, which contains 14,000 example pairs ranging from individual words to entire sentences. In an ofl'-line step, the example pairs are parsed, disambiguated, and indexed for corresponding constituents using a Treebanking tool.</Paragraph> <Paragraph position="3"> At each invocation of the example matching procedure, linguistic constraints fl'om the transfer grammar are used to limit the search space to appropriate examples. In an ol'l'-line step, these constraints are pre-compiled into a complex index that allows a preliminary fast match. Examples that survive the fast match are matched and aligned with the input feature structure (or sub-feature-structure, during recursive invocations) using the thesaurus to calculate word similarity, and using various other constraints and costs for inserting, deleting, or altering slots and features. Rather than rely on the exact distance in the thesaurus to calculate lexical similarity, we use a scheme that is based on the information content of thesaurus nodes, similar to (Resnik 1995).</Paragraph> </Section> <Section position="8" start_page="1033" end_page="1033" type="metho"> <SectionTitle> 6 Target-language Generation </SectionTitle> <Paragraph position="0"> The Japanese target-language feature structure l'orms the input to the generation module, which is summarized in Figure 5 below. This module also consists o1' a rule-rewriting software engine, executing the compiled GPL Japanese generation grammar, which consists ol' 200 GPL rules. The generator uses the Japanese lexicon to create the Japanese target-language expression.</Paragraph> </Section> class="xml-element"></Paper>