File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1029_metho.xml
Size: 15,257 bytes
Last Modified: 2025-10-06 14:11:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1029"> <Title>SOLUTIONS FOR PROBLEMS OF MT PARSER METHODS USED IN MU-MACHINE TRANSLATION PROJECT -</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SOLUTIONS FOR PROBLEMS OF MT PARSER METHODS USED IN MU-MACHINE TRANSLATION PROJECT - </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> I. Introduction </SectionTitle> <Paragraph position="0"> A parser is a key component of a machine translation (MT) system. If it fails in parsing an input sentence, the b~ system cannot output a complete translation. A parser of a practical MT system must solve many problems caused by the varieties of characteristics of natural languages. Some problems are caused by the incompleteness of grammatical rules and dictionary information, and some by the ambiguity of natural languages. Others are caused by various types of sentence constructions, such as itemization, insertion by parentheses and other typographical conventions that cannot be naturally captured by ordinary linguistic rules.</Paragraph> <Paragraph position="1"> The authors of this paper \[lave been developing MT systems between Japanese and English (in both directions) under the Mu-machine translation project \[NAGAO 85\]. In the system's development, several methods have been implemented with grammar writing language GRADE \[NAKAMURA 84\] to solve the problems of the MT parser . In this paper, first the characteristics of GRADE and the Mu-MT parser are briefly described. Then, methods to solve the MT parsing problems that are caused by the varieties of sentence constructions and the ambiguities of natural languages are discussed from the viewpoint of efficiency and maintainability.</Paragraph> </Section> <Section position="3" start_page="0" end_page="133" type="metho"> <SectionTitle> 2. Characteristics of GRADE and Mu-MT parser </SectionTitle> <Paragraph position="0"> Mu-project has been developing two MT systems based on the transfer approach: Japanese-to-English MT system \[TSUJII 84\] \[NAGAO 84, 85, 86\], and English-to-Japanese MT system \[TSUJII 85\]. All grammars in the project, including the Japanese and English analysis grammars are written in the same grammar writing language, GRADE \[NAKAMURA 84\], which has the following features: I) Each grammatical rule is expressed by a Tree-to-Tree transformation rewriting rule based on a flexible pattern matching algorithm.</Paragraph> <Paragraph position="1"> 2) A 'subgrammar' method allows for the division of the whole grammar into several parts, and for the detailed control of the translation process.</Paragraph> <Paragraph position="2"> A subgrammar is a set of rewriting rules and the whole grammar is expressed as a network of such subgrammars (subgrammar-netw ork).</Paragraph> <Paragraph position="3"> 3) The parallel execution of a subgrammar's re-writing rules makes it possible to collect all possible interpretations of a specified part of an input sentence, and to compare them in order to select the one most feasible.</Paragraph> <Paragraph position="4"> Making use of these features, the English analysis grammar of Mu-Project is roughly divided into the following three subgrammars (SG) \[TSUJII 85\]: i) pre-analysis S G, which analyzes constructions such as itemized forms and insertion by parentheses which cannot be treated by ordinary grammatical rules, and divides the input sentence into several fragments.</Paragraph> <Paragraph position="5"> 2) main-analysis SG, which performs syntactic and semantic analysis in the usual sense.</Paragraph> <Paragraph position="6"> 3) post-analysis SG, which combines the fragments of sentences divided by the pre-analysis SG.</Paragraph> <Paragraph position="7"> Solutions to the efficiency and maintainability problems of an MT parser are described in the following sections with examples from the English analysis grammar.</Paragraph> <Paragraph position="8"> 3. Problems caused by_ Constructions of Sentences Constructions of sentences, such as, itemized forms cause serious problem of! MT parsers. There are many such exceptional constructions in written texts, especially in the abstracts of scientific and technological papers which the Mu-project aims to translate. (The current corpus of the project consists of the abstracts extracted from the INSPEC database without any pre-editiug.) The following is a typical example : Four major factors affect the cost of ownership: I) purchase price, 2) investment tax credits, 3) cost of maintenance and repairs and 4) efficiency costs. --- (i) This type of construction can be handled by the rules of the context free grammar (CFG) shown in figure i. However, if rewriting rules to handle such sentence constructions are added to the analysis grammar, the following problems would arise: a) Deterioration of analysis efficiency: Rewriting rules which need not be referred to in a structural analysis will increase. Since CFG parsers cannot distinguish the rewriting rules for the structural analysis from the rewriting rules for the typographical sentence construction analysis, the increase in the total number of rules reduces the efficiency of the analysis.</Paragraph> <Paragraph position="9"> b) The loss of useful heuristics for correct analysis: For example, that each item in an itemized form can be analyzed independently is the heuristics useful for the analysis grammar. To utilize such heuristics, the recognition of global sentence structures should precede the detailed structural analysis. It, however, cannot be utilized effectively in an analysis based on CFG.</Paragraph> <Paragraph position="10"> c) Deterioration of the maintainability of the analysis grammar: It becomes difficult to dis- null tinguish rules concerned with particular text types (i.e., abstracts of scientific papers, articles of newspapers, etc.) from rules concerned with more general linguistic phenomena. In constrast to such an approach, Tree-to-Tree-type rewriting rules and using subgrammar-networks to control the parsing, which are features of GRADE, allow the MT parser to analyze such sentential forms without the deterioration of efficiency and maintainability. The analysis procedure of an itemized form, for example, is the following: i) First, the fragments of an input sentence are separated from each other by a tree structure pattern such as that shown in figure 2. In this example, one fragment is the core part of the sentence and the others are itemized parts.</Paragraph> <Paragraph position="11"> 2) Each fragment is analyzed independently by the main-analysis subgrammar.</Paragraph> <Paragraph position="12"> 3) Finally, fragmental results are integrated into the whole analysis result by the post-analysis subgrammar.</Paragraph> <Paragraph position="13"> (c) Correspondences between the GRADE pattern and fragments of the sample sentence (i) to extract adequate fragments from the itemized form In this way, the grammatical rules for analysing sentence constructions are placed in the pre-analysis subgrammar and are separated from the main-analysis subgrammar where the syntactic and semantic analysis of the input sentences is performed. This separation avoids the degradation of both analysis efficiency and the grammar's maintainability - both serious problems in parsers based on CFG.</Paragraph> </Section> <Section position="4" start_page="133" end_page="184" type="metho"> <SectionTitle> 4. Ambiguity Problems </SectionTitle> <Paragraph position="0"> The ambiguity of natural language is one of the biggest problems in an MT parser. The MT parser often outputs a wrong analysis result, or exhausts computer time and memory, due to ambiguity. Probably, knowledge-bases combined with context analysis and an inference mechanism might enable disambiguation in the parsing of natural language. Nevertheless, the current knowledge engineering technology has not yet developed large scale and high quality knowledge-bases usable by a practical MT parser.</Paragraph> <Paragraph position="1"> Furthermore, disambiguation based on such knowledge bases is itself still a research problem for which we have not yet had any concrete solutions. Since the MT parser has to analyze fairly long sentences containing various sorts of constructions, such as long conjuncted phrases, appositions, ellipses, etc., it has to use heuristic methods for the disambiguation to determine the priorities of the possible syntactic and semantic interpretations produced for an input sentence.</Paragraph> <Paragraph position="2"> Since the analysis methods based on CFG usually handle each interpretation independently, the following two methods are typically useful for determining the priorities of such interpretations: a) First, obtaining all possible interprations from the input sentence and then comparing them.</Paragraph> <Paragraph position="3"> Note that such rules have to compare different interpretations (tree structures) and cannot be CFG rules.</Paragraph> <Paragraph position="4"> b) Heuristically adjusting the order of application of CFG rewriting rules, for example, to obtain jnst one interpretation from the input sentence and to ignore others \[NAGAO 82\].</Paragraph> <Paragraph position="5"> Unfortunately, the first method lowers parsing efficiency because of the inherent 'combinatorial explosion'. It often exhausts the limited computer time and memory and still gives no interpretaion. The second method has difficulties in maintaining the priority ordering of rewriting rules adequately.</Paragraph> <Paragraph position="6"> This difficulty increases with any increase of rules in the grammar whole.</Paragraph> <Paragraph position="7"> However, many kinds of ambiguity can be solved by adopting the controlling mechanism provided by GRADE's subgrammar and the 'procedural analysis' method \[TSUJII 84\]. In this section, we will focus on how to disambiguate the interpretation of functional words like 'after', 'as', and 'for' which can be used both as prepositions and as (sentential) conjunctions. These functional words bring about the two problems as follows: i) Processing efficiency; there are certain kinds of ambiguities which may be solved automatically, when we use a CFG based parser augmented by simple semantic checking. For example, 'as' and 'after' in the following sentences are used as prepositions and are not used as conjunctions: Remarkably, the printed board can be executed as a one-sided or a double-sided unit. ---(2) The solderbility of reflowed tin and 40 percent load coated copper has been examined after thermal aging designed to include extensive copper-tin intermetallic compound growth. ---(3) However, a lot of computer time and memory are necessary, because the number of possible partial interpretations increases combinatorially. For the correct interpretation of (3), we need such a semantic constraint as the agent of 'to design' should be a human. However, such a semantic constraint is more preference than a real constraint.</Paragraph> <Paragraph position="8"> 2) Ambiguous interpretation; for the correct disambiguation, a complete semantic and contextual analysis is necessary. The word 'for' in the following sentence is ambiguous.</Paragraph> <Paragraph position="9"> Many opportunities occur for contractors to obtain electrical maintenance work in factories.</Paragraph> <Paragraph position="10"> ---(4) This sentence has two possible syntactic structures as follows: Many opportunities occur \[S for contractors \[ to obtain \[NP electrical ma~tenance work ~P factories, l\] \] .... (4.1) Many opportunities occur for \[% \[NP contractors to obtain electrical maintemfilcd'f \[VP work in factories.\] \] ---(4.2) The dominani: reading is (4.1), but we cannot reject (4.2).</Paragraph> <Paragraph position="11"> Both these problems must be solved by an MT parser. The efficiency problem can be solved by adopting the '3 stage procedural analysis' as follows \[YAMAMOTO 86\]: Step \]. Disambiguation by simple but reliable cues.</Paragraph> <Section position="1" start_page="184" end_page="184" type="sub_section"> <SectionTitle> Rules for di:~ambiguating parts-of-speech are applied. </SectionTitle> <Paragraph position="0"> For example, it is tentatively determined that 'as' in (2) is a prepos:ition by simple but rather reliable clles such as; &quot;If there are no verbs after the ambiguous function word, its part-of-speech is a preposition.&quot; Such a rule :is easily expressed by using the flexible pattern matching functions of GRAI)E.</Paragraph> <Paragraph position="1"> Step 2. Disambiguation based on intermediate analysis resulL.</Paragraph> <Paragraph position="2"> 'After' in (3) cannot be disambiguated in step i. The ambiguity is solved in this step by the following rule: &quot;If the word sequence after the ambiguous function word is to be analyzed as a sentence, tile word is a conjunction, if the phrase is a nounphrase, then tile word is a prepesJtion.&quot; For (3), first, the word sequence 'thermal ...</Paragraph> <Paragraph position="3"> growth' after 'after' is extracted and analysed completely. The grammar for complete analysis, written as a subgrammar network, is called from this rule. For this senLence, the word 'after' is determined as a preposition, because the word sequence following 'after' is analyzed as a noun-phrase (Figure 3).</Paragraph> <Paragraph position="4"> ... examined after thermal ... growth.</Paragraph> <Paragraph position="5"> by partial analysis This kind of top-down processing is easily realized by the pattern matching functions and the invocation of subgramm ar-netw orks.</Paragraph> <Paragraph position="6"> Step 3. Complete Analysis.</Paragraph> <Paragraph position="7"> After step 2, each word (or phrase) is no longer (syntactically) ambiguous. The complete analysis of the sentence therefore hecomes straightforward. However, the determinations of parts-of-speech in step 1 and step 2 are tentative, and it sometimes happens that complete analyses cannot: be obtained becuase wrong decisions have been made. In such cases, these tentative decisions may be changed by step 3.</Paragraph> <Paragraph position="8"> During step I and step 2's construction of tentative inLerpretatious, various sorts of heuristic rules are applied; these are ordered according to their relative 'strengths'. The part-of-speech interpretation of 'for' in (4), for example, shows that a heuristic rule based on surface syntactic cues such as &quot;If there is a word sequence '... for NP co verb ...', the 'for' is a preposition which marks the subject of the infinitive clause.&quot; is useful. This kind of heuristic rules is useful for choosing the most feasible interpretations, even if there are several syntactically and semantically possible interpretations.</Paragraph> <Paragraph position="9"> The '3 stage procedual analysis', in which bottom-up processes (step l) and top-down processes (step 2) are carefully combined, can be implemented straight-forwardly by uLilizing the rich control schemes provided in GRADE.</Paragraph> </Section> </Section> class="xml-element"></Paper>