File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1066_metho.xml
Size: 26,891 bytes
Last Modified: 2025-10-06 14:11:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1066"> <Title>RUSSIAN-FRENCH AT GETA : OUTLINE OF THE METHOD AND DETAILED EXAMPLE</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> RUSSIAN-FRENCH AT GETA : OUTLINE OF THE METHOD AND DETAILED EXAMPLE </SectionTitle> <Paragraph position="0"> Ch. BOITET and N. NEDOBEJKINE</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> GETA, UNIVERSITY OF GRENOBLE </SectionTitle> <Paragraph position="0"> F-38041GRENOBLE-CEDEX 53 Introduction The original version of this paper is very detailed. Space limitations for publication in COLING's proceedings have forced us to reduce it by a factor of five. The more detailed version has been proposed for publication in '~inguistics&quot;. This paper is an attempt to present the computer models and linguistic strategies used in the current version of the Russian-French translation system developed at GETA, within the framework of several other applications which are developed in a parallel way, using the same computer system. This computer system, called ARIANE-78, offers to linguists not trained in programming an interactive environment, together with specialized metalanguages in which they write linguistic data and procedures (essentially, dictionaries and grammars) used to build translation systems. In ARIANE-78, translation of a text occurs in six steps : morphological analysis, multilevel analysis, lexical transfer, structural transfer, syntactic generation, morphological generation. To each such step corresponds a computer model (nondeterministic finite-state string to tree transducer, tree to tree transducer,...), a metalanguage, a compiler and execution programs. The units of translation are not sentences, but rather one or several paragraphs, so that the context usable, for instance to resolve anaphores, is larger than in other second-generation systems.</Paragraph> <Paragraph position="1"> As ARIANE-78 is independent of any particular application, we begin by presenting its main features in Part I. Some of them are standard in second-generation systems, while others are original. Among these, we insist on the multilingual aspect of the system, which is quite unique, on the very powerful control structures embodied in the supported computer models (non-determinism, parallelism, heuristic programming), and on its interactive data-base aspect.</Paragraph> <Paragraph position="2"> In the second and larger part, we successively describe each step of this Russian-French application. We first present the underlying computer model (there are 4 of them, as the second, third and fourth step use the same one), then the organization of the linguistic data. A small text is used throughout the text as a standard example. Examples of translations of larger texts appear at the end.</Paragraph> <Paragraph position="3"> I - Current GETA translation system The computer system ARIANE-78, together with appropriate linguistic data, constitutes a multilingual automatized translation system.</Paragraph> <Paragraph position="4"> The system is a rathersophisticated second generation system. It relies on classical as well as more original principles.</Paragraph> <Paragraph position="5"> I. C!assical second-generation principles Intermediate structures The process of translation of a text from a &quot;source&quot; language in a &quot;target&quot; language is split up into three main logical steps, as illustrated below : analysis, t~a~fer and generation. The output of the analysis is a &quot;structural descriptor&quot; of the input text, which is transformed in an equivalent structural descriptor in the target language by the transfer phase. This target structural descriptor is then transformed into the output text by the generation phase. Essential in our concePtion is the fact that analysis is performed independently of the target language(s). The &quot;deeper&quot; the analysis, the shorter the distance between the two structural descriptors. Ideally, one could imagine a &quot;pivotal&quot; level, at which they would be the same.</Paragraph> <Paragraph position="6"> In the past, Pr. Vauquois' team tried a slightly less ambitious possibility \[Vauquois, 1975\], namely to use an &quot;hybrid&quot; (Shaumjan) pivot language, where the lexical units are taken from a natural language, so that the transfer phase is reduced to a lexical transfer, without any structural change. As it is not always possible, or even desirable, to reach this very abstract level, one may choose not to go all the way up the mountain and to stop somewhere in the middle. This is why we call our structural descriptors &quot;i~termediate structures&quot;. Note that ARIANE-78 imposes nothing of that kind, both extremes are still possible, and in fact the linguistic teams have agreed on &quot;multilevel&quot; intermediate structures which contain very deep as well as low level types of information, ranging from logical relations to traces (see details below).</Paragraph> <Paragraph position="7"> SeParation of programs and linguistic data The second classical principle is to offer metalanguages, in order to keep the particular linguistic data (grammars, dictionaries) separated from the programs.</Paragraph> <Paragraph position="8"> --437--For instance, dictionary look-up is a standard function, which should not be modified in any way when a new language is introduced in the System. This separation also corresponds to a division of work and enhances transparency : dictionary look-up may be optimized by the programmers without the linguistic users ever being aware of it. The same goes for more complex functions, like pattern-matching in tree manipulating systems. In these metalanguages, linguists work directly with familiar concepts, like grammatical variables, classes, dictionaries and grammars. The grammar rules are rules of Some formal model (context free, context sensitive, transduction rules). That is, one may also consider such metalanguages as very high level algorithmic languages offering complex data types and associated operators. Although this principle of separation has been criticized as imposing too much &quot;rigidity&quot; on the users, critics have failed to understand that this is only the case when the metalanguages are not adequa~.</Paragraph> <Paragraph position="9"> A good comparison may be found in classical programming, where for example, the compiler and run-time package of PL/I is separated from programs written PL/I in exactly the same sense.</Paragraph> <Paragraph position="10"> Semantics b~ features The third classical principle touches sema~. In a second-generation MT systems, semantics may be only expressed by the use of features (concrete, abstract, countable,...), which are exactly like grammatical features. The theoretical framework is the one of a formal language, with a syntax describing the combination rules of the language units. There is no direct way, for instance, to relate two lexical units. In order for this to be possible, there should be a (formalized) domain, possibly represented as a thesaurus, and rules of interpretation. However, this limitation may be partially overcome in ARIANE-78's lexical transfer step. Remark also that semantic features may be extremely refined for some limited universe, and give surprisingly good results \[TAUM-METEO, J975\].</Paragraph> <Paragraph position="11"> 2. Principles p.roper_ to GETA's sySStem We relate them to the three main principles exposed above.</Paragraph> <Paragraph position="12"> Intermediate structures In ARIANE-78, we split up each of the three main phases into two steps. This is essentially for algorithmic as well as for linguistic reasons, Morphological analysis, lexic~ transfer and morphological generation are undoubtedly very much simpler than the order steps, and it has seemed reasonable and linguistically motivated to keep them separate and to use simpler algorithmic models to realize them.</Paragraph> <Paragraph position="13"> However, this could not be the case in other environments, for example if the input would be very noisy (oral input).</Paragraph> <Paragraph position="14"> ARIANE-78 uses a unique kind of data-structure to represent the unit of translation from morphological analysis to morphological generation, namely a complex labeled tree structure : each node of such a tree bears a value for each of the &quot;grammatical variables&quot; used in the current step.</Paragraph> <Paragraph position="15"> GETA's system is mu~ngual by design : an analysis cannot explicitly use information from the target language, and generation is likewise independent of the source language.</Paragraph> <Paragraph position="16"> Moreover, in a given user space, ARIANE-78 ensures the coherence of the linguistic data written to construct a multilingual application. Computer environment The principle of separation of programs and linguistic data is strictly observed in our system. An additional feature is to propose several algorithmic models designed to be of maximal adequacy and generality as well as of minimal computational complexity.</Paragraph> <Paragraph position="17"> Functions of an integrated MT system include preparation of the linguistic data, management of the corpora and execution of the linguistic data over texts. ARIANE-78 provides a conver6atio~al environment for these functions, hiding implementation chores to the user. It also includes a spe~aZized data-base management system for the texts and the linguistic files.</Paragraph> <Paragraph position="18"> Semantics Semantic features may be declared as normal grammatical features in each step. At lexical transfer, the linguist may relate several source and target lexical units, these relations being elaborated in the succeeding structural transfer phase. This is however certainly not sufficient to call the system &quot;third generation&quot;. 3. Organization of the translation process Overall schema The schema below shows the different steps of the translation process. The components of ARIANE-78 implementing the 4 different algorithmic models appear within circles, they are linked by double lines to rectangles corresponding to the linguistic data written in the associated metalanguage for the indicated step. Simple arrows indicate the flow of control.</Paragraph> <Paragraph position="19"> Organization of a step In each step, the linguistic data may be of four kinds : grammatical va~u6ables (like gender, number, semantic type), classes, describing useful combinations of values of variables, d/ct/0nar/es and grammars, containing the rules and the strategy to use them.</Paragraph> <Paragraph position="21"> ~.targ. text.</Paragraph> <Paragraph position="22"> They are expressed in a metalanguage. Their syntax and cohenrency is first checked by the corresponding compiler, which generates a compact intermediate code. At run-time, this code is interpreted by standard &quot;execution programs&quot;. This approach separates the linguistic and algorithmic problems, and makes debugging and maintenance much easier.</Paragraph> <Paragraph position="23"> The complete system is operational on IBM compatible machines under VM/CMS. ARIANE is the name of the interactive monitor interfacing with the user.</Paragraph> <Paragraph position="24"> For more explanations about our terminology and our intermedlate structures&quot;, see \[15, 22, 23\].</Paragraph> <Paragraph position="25"> II - An application to Russian-French translation We will use a small size text as our standard example. Note that usual translation units are not sentences, but rather paragraphs. We use an unambiguous latin transcription.</Paragraph> <Paragraph position="26"> Input text</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> * SFORMULIROVAN PRINCIP, S POMOTHQYU KOTOROGO OPREDELYAETSYA KRITERIJ, PRIGODNYIJ DLYA NELINEJNOJ TERMODINAMIKHESKOJ SISTEMYI. </SectionTitle> <Paragraph position="0"> (A principle has been defined, with whose help one defines a criterion useful for the non linear thermodynamic system).</Paragraph> <Paragraph position="1"> I. Morphological analysis The grammar, classes and dictionaries are written in the ATEF formalism \[l, 8, IO, 19\]. The strategy of the analyzer has been described in \[16\]. Its output is a &quot;flat tree&quot; with standard structure and with leaves labelled by the masks of variables computed by the analyzer.</Paragraph> <Paragraph position="3"/> </Section> <Section position="4" start_page="0" end_page="442" type="metho"> <SectionTitle> 2. Multilevel analysis </SectionTitle> <Paragraph position="0"> This part is the most difficult. It is written in ROBRA \[5, 6, 7, 8, |2\], a general tree-transducer system. In order to build a whole transformational system, the linguist writes ~n6fo~UJ~O~ r~ (TR) and groups them in transformationa~ gr~mars (TG). When a TG is applied to an object tree, all compatible occurrences of its TR are executed in parallel.</Paragraph> <Paragraph position="1"> The overall flow of control is described in the control graph. Using a built-in backtracking algorithm, ROBRA finds the first possible traversal of the control graph leading to an exit (&NUL symbol), thereby applying each traversed TG to the object tree.</Paragraph> <Paragraph position="2"> Rules are grouped in grammars when they correspond to related linguistic phenomena, or when they express transformations used for a certain logical step of the linguistic process (here, multilevel analysis) or, more strategically, when they share the same execution modes (e.g., iterative rules will appear in &quot;exhaustive&quot; grammars, others in &quot;unitary' grammars. This architecture makes it possible to limit the interaction between rules and avoid many combinatorial problems, to develop strategies and heuristics, and to test and modify TGs separately (different trace parameters may be associated to each TG).</Paragraph> <Paragraph position="3"> Let us now give the control graph used in multilevel analysis of Russian, with some Its aim is to homogenize and to simplify the input tree.</Paragraph> <Paragraph position="4"> DGR is used only when there is an analytic expression of degree, to represent it synthetically (NG variab~). ENON-ENONI-ENON2 : these 3 grammars break down the sentences into textually marked &quot;utterances&quot;. Commas, unambiguous conjonctions and relative pronouns ... are used.</Paragraph> <Paragraph position="5"> GNI builds simple nominal groups like Adj + N or Prep + N or mum + N.</Paragraph> <Paragraph position="6"> GN2 looks for further elements in the nominal groups, and solves certain ambiguities.</Paragraph> <Paragraph position="7"> RLT looks for the nominal antecedents of relative and participial clauses constructed by ENON2.</Paragraph> <Paragraph position="8"> SN searches for a personal verb as main element of the utterance, and for verbal modifiers, like negative and conditional particles or auxiliaries.</Paragraph> <Paragraph position="9"> SN2 tries to solve the adverb - short form adjective ambiguity and builds embedded nominal groups.</Paragraph> <Paragraph position="10"> MARQ builds all types of subordinate verbal and infinitive clauses. It further tries to solve the previous ambiguity.</Paragraph> <Paragraph position="11"> AMB searches for the most important terms of the clause (subject, object, near dative), thereby resolving ambiguities between subject and object, adjectives and adverbs, etc.</Paragraph> <Paragraph position="12"> NALF treats non-alphabetical forms as appositions or verbal complements.</Paragraph> <Paragraph position="13"> CASC handles all genitive imbrications, by (provisionally) attaching dominated groups to non-ambiguous groups.</Paragraph> <Paragraph position="14"> PHR marks all strongly governed groups subordinated to the utterance with logical relations as agent, patient, attribute... If possible, this is also done on dependent groups.</Paragraph> <Paragraph position="15"> CIRC and GEm4 realize the distribution of prepositional and genitive nominal groups between their noun heads, according to several syntactic and semantic criteria.</Paragraph> <Paragraph position="16"> ELID searches for antecedents of pronominal expressions and isolated adjectives, and builds noun groups by copying the lexical unit of the antecedent. If the elliptic element is not a personal pronoun, it becomes qualifier or determiner according to its syntactic class. The syntactic and logical functions of the new group are computed.</Paragraph> <Paragraph position="17"> SUBCORD is purely tactical (modification of the hierarchy of certain subordinate clauses.</Paragraph> <Paragraph position="18"> FTR copies certain information from non-terminals onto terminal &quot;head&quot; nodes, to prepare for lexical transfer.</Paragraph> <Paragraph position="19"> --440--We give now the result of the multi-level analysis of our standard example. Note that node 5 (noun group with head node 6 &quot;PRINCIP&quot;) has correctly been given syntactic function subject and logical relation patient (A2). Syntactic functions of non-terminals appear as auxiliary lexical units (UL).</Paragraph> <Paragraph position="20"> essentially includes a bilingual multichoice dictionary of &quot;transfer rules&quot; accessed by the UL. Each rule is a sequence of 3-uples (condition, image subtree, assignments), the last condition being empty (true).</Paragraph> <Paragraph position="21"> The automaton traverses the input in preorder, creating the object tree as follows. The UL of the current node is used to access the dictionary. The first triplet of the item whose condition is verified is chosen. The image subtree (generally consisting of only one node) is added to the output, with values of variables computed by the assignment part.</Paragraph> <Paragraph position="22"> Hence, the output tree is very similar to the input tree. The possibility to transform one input node into an output subtree may be used to create compound words or to create auxiliary nodes used in the following step (structural transfer) to treat idioms.</Paragraph> <Paragraph position="23"> As this model is algorithmically very simple, it is the only one where no trace is provided. The example below gives an idea of the metalanguage of the dictionary.</Paragraph> <Paragraph position="25"> &quot;0(1,2)&quot; describes the image subtree for &quot;NAPRIMER&quot;. The other ones are reduced to one node (default). &quot;+VBF\]&quot; says that the non-null values of variables in format VBF\] will be copied into the target node. RFPF is an assignment procedure. &quot;~PP&quot; says that all variables of format PP (except the UL) will be copied onto node \].</Paragraph> <Paragraph position="26"> The following structure is the result of this step on our standard example.</Paragraph> <Paragraph position="27"> 18. POUN 1 . &quot;TEXTE&quot; 2. &quot;UIFI~&quot; L 3. &quot;ENONCE&quot; 4. FORMULER 5. &quot;SUJET&quot; 25. deg 6.PRINCIPE 7. &quot;ENONCE&quot; 8. &quot;CI~&quot; I I. DEFINIR 12. &quot;SUJET&quot;</Paragraph> </Section> <Section position="5" start_page="442" end_page="442" type="metho"> <SectionTitle> 4. Structural transfer </SectionTitle> <Paragraph position="0"> The algorithmic component used in this step is again ROBRA, which has been very briefly presented in 2. The aim of this step is to realize all transformations of contrastive nature, so as to produce the desired intermediate target structure as output.</Paragraph> <Paragraph position="1"> The following gives the control graph of the TS written for this step in the current version of our translation system.</Paragraph> <Paragraph position="2"> PRL handles idioms, predicted in lexical transfer by generating auxiliary subtrees. It checks whether predicted idioms are present and takes appropriate action.</Paragraph> <Paragraph position="3"> RECOP copies certain information (required mode, type of adjective, postponed preposition inversion of arguments) from terminal &quot;head&quot; nodes onto their fathers.</Paragraph> <Paragraph position="4"> RCTF handles non-standard government, particular uses of &quot;DE&quot;, erases some prepositions, takes care of passive-active transformations, etc.</Paragraph> <Paragraph position="5"> EFFAC erases remaining auxiliary nodes generated in TL (idioms, non standard prepositions).</Paragraph> <Paragraph position="6"> ACTL handles particular idiom translations, like &quot;ESLI + Inf&quot; ~ &quot;SI ON + Present&quot;, etc. QUALD handles actualization and qualifiers (modes, tenses, determination...), and generates the correct order in nominal groups.</Paragraph> <Paragraph position="7"> ART uses the remaining designators to compute the determination of nominal groups.</Paragraph> <Paragraph position="8"> DERV handles derivations (-ANT, -EUR, -ITE, etc.), negation (NON, PEU, IN...), prefixes and others. DTM makes the final computation of determination of noun groups.</Paragraph> <Paragraph position="9"> As we see, structural transfer is relatively simple in this version. However, many improvments are planned in our future version. The result of this step is given below.</Paragraph> <Paragraph position="10"> Note the modification of order in the last nominal group, as well as the generation of the impersonal &quot;ON&quot;.</Paragraph> <Paragraph position="11"> --443 ....</Paragraph> <Paragraph position="12"> 1 . &quot;TEXTE&quot; ! 2. 'tULFRA&quot; 3. &quot;EIONCE&quot; 4.ON ..... B JET&quot; 26. deg 7. PRINCIPE 8. &quot;ENONCE&quot; rithmic component. The aim of this step is to produce a tree structure where the terminal nodes contain all the information necessary for generating the output text, and to give the final surface order of the words. This is a constraint imposed by the nature of the algorithmic component SYGMOR, used in the last step.</Paragraph> <Paragraph position="13"> RC copies variables from head nodes onto their fathers, and checks for number and gender correctness. AC! handles noun coordination, place of subject, and generation of preposition before infinitive, or of periphrases. ADJ handles agreement in gender and number between nouns, adjectives and articles. RELATIF chooses the relative pronoun (DONT, QUI, LEQUEL). AC2 handles homographs and noun ellipses. ART generates the correct article (UN, LE), and ART2 reflexive pronouns, auxiliary verbs, negations (NE...PAS, NON, IN-) and special punctuation marks to present alternate translations in case of doubt. ULZERO is strategical.</Paragraph> <Paragraph position="14"> --444--</Paragraph> </Section> <Section position="6" start_page="442" end_page="442" type="metho"> <SectionTitle> 1. &quot;TEXTE&quot; 2. &quot;U!FRA&quot; I 3. &quot;ENONCE&quot; </SectionTitle> <Paragraph position="0"/> </Section> <Section position="7" start_page="442" end_page="442" type="metho"> <SectionTitle> 6. Morphological generation </SectionTitle> <Paragraph position="0"> This is the last step of the translation process. Words of the output text are generated.</Paragraph> <Paragraph position="1"> Some facilities must be provided by the algorithmic component, SYGMOR to handle elisions and contractions.</Paragraph> <Paragraph position="2"> SYGMOR realizes the composition of two transducers : the first, &quot;tree-to-string&quot;, produces the frontier of the object tree ; the second transforms this string (of masks of variables) into a string of characters, under the control of the linguistic data. These data are made of declaration of variables, formats and condition procedures, dictionaries (with direct addressing by the values of certain declared variables, whereby the first dictionary must be referenced by the UL, and a grammar.</Paragraph> <Paragraph position="3"> Each item in a dictionary gives a list of <condition / assignment / string> triplets, the last one having an empty (true) condition. TPiA, PSSPT, TP3A are names of condition procedures, VID, V3H, V3A are names of formats. The apostrophs ('AI) are used in the grammar to make contractions.</Paragraph> <Paragraph position="4"> It should be noted that, unlike ATEF, SYGMOR realizes a finite-state deterministic automaton, thus reflecting the lesser complexity of the synthesis process. To process a mask, SYGMOR looks for the first applicable rule (at least one must have an empty condition), applies it and follows the transitions indicated, unless it finds an inapplicable obligatory rule. In this case, the system executes the special rule ERREUR or a default action if this rule has not been declared. It is thus possible to generate an arbitrary error string at that point. For instance, non translated source lexical units will be printed between special markers.</Paragraph> <Paragraph position="5"> The output of SYGMOR on our standard example is the following text, which is then transformed by ARIANE in a script file and formatted, thereby adding documentary information. null Output text</Paragraph> </Section> <Section position="8" start_page="442" end_page="442" type="metho"> <SectionTitle> ON A FORMULE LE PRINCIPE A L'AIDE DUQUEL ON DEFINIT LE CRITERE UTILE POUR LE SYSTEME THERMODYNAMIQUE NON LINEAIRE. RUSSE RAPPORT LANGUES DE TRAITEMENT: RUS-FRA TEXTE D'ENTREE: </SectionTitle> <Paragraph position="0"/> </Section> <Section position="9" start_page="442" end_page="442" type="metho"> <SectionTitle> LA STRUCTURE DO NOYAU ATOMIQUE. DANS LE MOT D'FMTREE ON SOULIGNE LE ROLE IMPDRIAHT QUE LE SYMPOSIUM h JOUE OANS LE DEVELOPPEMLHT DE LA PHYSIQUE NUCLEAIRE DES FAIBLES ENERGIES EN UrIlON SOVIETIQUE. PENDANT LE SYMPOSIUM ON A EXAMINE LA SERIE DES EIUOES IHPOR\]ANTES REALISEES PAR LES SAVAtlTS </SectionTitle> <Paragraph position="0"> SOVIETIQUES. EN PARTICULIER. ON h ETUD\]E LA NON-</Paragraph> </Section> <Section position="10" start_page="442" end_page="442" type="metho"> <SectionTitle> CONS\[RVAIION DE LA PARIIE DAMS LES PROCESSUS? PROCEDES? NUCLEAIRES, DIVISION SPONIANEE DES ISOTOPES DES ELEMENTS SUPERLOURDS ET DECOUVERTE DE L'EFFET DES OHORES PENDANT LA DISPERSION DES PARIICULES. OH A REUNI LES DONNEES STATISIIQUES COtIVAIHCANIE QUI REFLETENT LA C~OlSSANCE DU HOMBRE DES RAPI'ORIS PROPOSES. ON REMARQUE LA PRESEHCE PARHI LES PAREICIPANIS DES SP\[CIALISTES DIS PAY5 EIRANGERS. </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>