File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4208_metho.xml

Size: 22,669 bytes

Last Modified: 2025-10-06 14:13:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4208">
  <Title>TYPOLOGY STUDY OF FRENCH TECHNICAL TEXTS, WITH A VIEW TO DEVELOPING A MACHINE TRANSLATION SYSTEM</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TYPOLOGY STUDY OF FRENCH TECHNICAL TEXTS,
WITH A VIEW TO DEVELOPING A MACHINE
TRANSLATION SYSTEM
B.ROUDAUD
B'VITAL (SITE group)
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Within the industrial context of the information society, technical translation represents a considerable commercial stake. In the light of this, machine translation is considered as being an application of paramount importance. It is for this reason that the activities of B'VITAL have always centered around the processing of technical texts.</Paragraph>
    <Paragraph position="1"> The following article gives an account of the various tasks carried oat over the last few years on corpus analysis. We have drawn conclusions as to the validity of the notion of text typologies, applied in particular to technical matter, with a view to developing a machine translation system. The study was conducted using a fair amount of French documents and has led us to observe in particular, that a same typology may be identified in texts originating from varying fields.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The technical literature of a nuclear plant represents about 150 000 pages (figure quoted by EDF, France, in 1990). About 20 000 pages make up the maintenance literature of a an aircraft, of which 5 000 are subject to revision every three months on average.</Paragraph>
    <Paragraph position="1"> In France, the cost per page for a translation varies from 250 to 400 francs for a translation from French to one of the other European languages (according to Bossard Consultants), which amounts to 6 million francs for the maintenance literature of an aircraft.</Paragraph>
    <Paragraph position="2"> If a machine translation system is able to reduce considerably the time spent per page by translators, even if the result is not perfect, it will signify a gain with respect to delivery times and costs.</Paragraph>
    <Paragraph position="3"> To avoid problems concerning non standardised terminology, we have endeavoured to stick to the terms familiar to traditional grammar, even though they may not always be appropriate.</Paragraph>
    <Paragraph position="4"> I wiah to thank my colleagues, D. Baehut, O. Gamrat and M.C. Puerta, for their help.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 - DEFINITION OF THE
CORPUS STUDY
</SectionTitle>
    <Paragraph position="0"> According to our method, based on GETA's works (Grenoble), if the MT system comes across a non handled phenomenon, the sentence is not rejected and translation is carried through using those parts of the sentence which have been successfully analysed (CL \[1\] &amp; \[2\]). Within the framework of an industrial application in which the ratio of development costs to gain in quality is preponderant, rare phenomena may thus be handled in a very simplified way, or even not handled at all.</Paragraph>
    <Paragraph position="1"> Midway between controlled syntax systems and systems 'which translate everything', our approach to MT favours the development of systems adapted, a posteriori, to the texts to be translated. This theory goes by the assumption that it is possible to define and then recognise the typology of texts, specifying in particular their form, the linguistic phenomena present or absent and the general vocabulary used (excluding terminology). The study, which was strongly influenced by the MT application, began in 1984, during the French Projet National de Traduction Automatique (PNTAO), with an intensive research phase. It is still currently in progress although only sporadically, with the study of new texts. The first study aimed at proving that it was possible to define a typology for the given corpus. Its result was the definition of an initial typology (less restrictive than the METEO typology for instance, see \[6\]), which was further refined (though not radically changed) during the years which followed. The definition of file typology retained consisted of a list of the phenomena handled, in other words, a list of the static grammar charts (Cf. \[2\] &amp; \[3\]) which are part of the linguistic specifications of the system. We will not give a full formal explanation of the defined typology, however, we will view it from a more informal point of view.</Paragraph>
    <Paragraph position="2"> The whole corpus was made up of documents in French provided by different fh'ms (Sonovision, SITE, A6rospatiale, EDF, Rh6ne Poulenc, Syseca...). It consists mainly of aeronautical documents (maintenance documents, job cards...), data processing texts (extracts from reference documents or from user guides, or ACRES DE COLING-92, NANTES, 23-28 AO~T 1992 I 2 8 4 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 software error messages), minutes of meetings, extracts from work schedules or from technico-commercial documents.</Paragraph>
    <Paragraph position="3"> The aeronautical documents which make up the greater part of the corpus (about 50%), maintenance manuals and job cards, were initially chosen as the basic corpus for the PN'fAO for ,several reasons : they were available ill machine readable form, at Souovisiou (which later became the SITE group), they corresponded to a consklerable commercial need, they concerned a sector of strong export activity, they were representative of a large number of technologies (mechanics, data processing, fluid mechanics, strength of materials, etc.).</Paragraph>
    <Paragraph position="4"> It is for these reasons that they coustitute even today, an important source of enrichment for the corpus.</Paragraph>
    <Paragraph position="5"> The initial corpus was made up of about 400 pages taken from maintenance and service guides of Marcel Dassault aircraft. Part of the texts, provided with the English translation, were made up of job cards. I~tch job card indicates how to go about a specific servicing operation. They contain general recommea&amp;ltions and  The other texts were ~rvice notes describing an apparatus or a mechanism, how it functions and the servicing procedures to be followed. The following is an illustration of this second type of text :</Paragraph>
    <Paragraph position="7"> Le raccord auto-obturable d'un {)quipement mtacanlque a pour but d'assurer le raccordement raplde d'une gent)ration l~ydl'aullque ~un banc de test et permet la mlse en fonctlon des diff6rents organes de celte g~n~ratlon hydraulique saris perte de I iquide e\[ sans entree d'air.</Paragraph>
    <Paragraph position="8"> Le raccord auto-obturable sert egalement au rempllssage et au d~gazage des circuits, The study enabled us to pinpoint tile characteristic of these two types of texts, It turned out that their typology was identical. The main difference is at form level : the first type of text contains enumerated instructions, generally described using short phrases without fnll stops and which do not exist in the second type of text.</Paragraph>
    <Paragraph position="9"> 'the purpose for extending the study was to iuvestigate the possible existing differences amongst various types of documents in various fields. It turned out that most of the literature iulended for technicians COlXespouds to the original typology.</Paragraph>
    <Paragraph position="10"> Amongst the other corpora studied (about 800 pages form various fields, e.g. electrical installation manuals, computer manuals and advetlisiug, software manuals and advertising, aeronautical texts, description of composite materials, maintenance documentation for mechanical installations, etc.), we have classified the texts into two groups : texts pertinent to the initial corpus, in other words those texts whose typology was identical or closely re~sembled that of the first corpus, non-pertinent texts containing new phenomena.</Paragraph>
    <Paragraph position="11"> 311is second type of text will probably require a sub-categorisatiou into other typologies (a task not as yet carried out).</Paragraph>
    <Paragraph position="12"> Typology is field independent. We have received extracts tiom aeronautical documents, provided by Adrospatiale. Some of these texts tall into the non-pertinent category, while others fall into rite pertinent category. Furthermore, it was found that texts originating h'om different fields (data processing or electrical engineering, for example) fell into the typology.</Paragraph>
    <Paragraph position="13"> Amongst the non-pertinent texts we found mainly texts which require rewriting rather than Ixanslatinn, lot instance user manuals (at least those intended for non specialists), scientific articles, advertising and legal texts. Such texts are obviously less adapted to MT. We al~ found minutes of meetings and client service reports, whose syntax is often very f,.mciful.</Paragraph>
    <Paragraph position="14"> As far as quantity goes, the volume of technical taatter is much greater than that of advertising and legal matter (Cf. Van Dijk consultancy figures), Maintenance and service guides, intended for specialists, and referetlce guides represent a much greater volume than that of user guides, intended for non-specialists (which are less hoxnogenoas typologically speaking).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 - RESULT OF THE CORPUS
STUDY
</SectionTitle>
    <Paragraph position="0"> The study laid rile emphasis on the problems directly related to tn, mslation, in particular with respect to accuracy, The sub-sections wlfich follow illustrate tile defined typology mid instance the linguistic phenomena not encountered in the so-called pertinent texts studied. Several points must be underlined : ACRES DI! COLING-92, NAN'lJ~S, 23-28 AOl~q' 1992 l 2 8 5 PREC. OF COL1NG-92, NANrES, AUG. 23-28, 1992 1- any phenomenon classed as being absent may nevertheless be marginally present in a text ; 2- the corpus study showed that few of the phenomena absent from the initial study were later added.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.1. FREQUENT PHENOMENA
</SectionTitle>
    <Paragraph position="0"> It is not possible to list fully all the phenomena encountered. We will however endeavour to give the main features of the typology of the texts studied, giving the most unusual or significant examples (from the pertinent corpus). The following hence plays more of an illustrative role than a descriptive role.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1- Commas
</SectionTitle>
    <Paragraph position="0"> Generally speaking, puntuation is used somewhat irregularly. There are few commas, and they do not always appear where they would normally be expected. Consequently, it appears pointless to use them for linguistic purposes, for example, in the determining of the limit of a coordination.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 - Full stops
</SectionTitle>
    <Paragraph position="0"> As previously mentioned, the job cards come in the form of enumerated operations. Each operation is described by one or more small paragraphs containing sentences, As a rule, there are no full stops indicating the end of a paragraph :  They are veay frequent in the texts, especially as operations to be carried out are described using the infinitive form (and not using the imperative form : faites ceci, or personal form : vous faites ceci ). They appear as the main clause of a sentence, as an object or as an adverbial : Apres avow solgneusement ass@ch6 les partie externes du raccord, soumettre celul-cl a une pression interne de 0,6 bar pendant 5 minutes...</Paragraph>
    <Paragraph position="1"> ...it est prdfTrable de mettre l'avion sous abris.</Paragraph>
    <Paragraph position="2"> Ne pas tendre exag~rement les cordes.</Paragraph>
    <Paragraph position="3"> Aglr sur la valve pour la faire reculer.</Paragraph>
    <Paragraph position="4"> Infimtive chuses introduced by file preposition d, with a passive adjectival value are frequently encountered. They bear a certain modality : fixer un obturateur sur la buse d'entree du circuit d'air iLEe, LFI92~ (to be cooled) * .. Ileffet i~t_gJ2Le,13.~ (to be obtained) Les coupures de frettes ne sont pas ~.</Paragraph>
    <Paragraph position="5"> Drendre en consideration Finally, the various infinitive clauses can be inter-coordinated or coordinated with different types of clauses (object e* adverbial clauses). Coordinations (apart from enumerations) are generally limited to two or three clauses.</Paragraph>
    <Paragraph position="6"> Conjugated verbal clauses We class a conjugated verbal clause as being any clause governed by a conjugated verb. The clauses encountered may appear either as a main (or independen0 clause or a subordinate clause (object or adverbial). All the common types of clauses can be found (personal, impersonal, active, passive...). Coordinations (enumerations aside) are generally limited to two or three clauses. Examples of characteristic sentences : Pour les vents tres violents, il est prdferable de mettre l'avion sous abri.</Paragraph>
    <Paragraph position="7"> S1 le liqulde polluant s'inflltre dans la zone sltu@e entre la jante de la roue et le talon, il faut imperativement d@monter le pneu pour le nettoyer.</Paragraph>
    <Paragraph position="8"> Amongst the remarkable clauses, we encountered several examples of inverted subjects (which could pass as a stylistic turn unlikely to be found in a technical tex0. For example : Au cadre 19 se trouve un bo~tier etanche...</Paragraph>
    <Paragraph position="9"> Relative clauses Very few are encountered in the texts (on average 1 relative clause every 2 pages), most of them are introduced by the relative pronoun qta (for more than 95% of the texts studied). Here are a few examples : II est ba~d sur la conductibilit6 thermique.., et sur le mode de circulation adopt@ qul permettent un 6change... entre les deux circuits d'air qui la traverse.</Paragraph>
    <Paragraph position="10"> AC'lXS DE COLING-92. NANTES. 23-28 AOt~q' 1992 1 2 8 6 PROC. OF COLING-92. NANTES. AUG. 23-28, 1992 ,.. elles poss~dent chacune un anneau travers lequel passe la sangle de rappel,..</Paragraph>
    <Paragraph position="11"> Noun phrases We class noun phrases as being all groups with a nominal value whether or not they are introduced by a proposition. It is likely that all the possible types of noun phrases exist in the corpus. We observed with interest that in some cases, the groups were quite often barely&amp;quot; grmnmatically acceptable. The principle of juxtaposing nouns in French is a rare phenomena. All the same, juxtaposition is quite a generalised pratice in the texts studied (this is probably the case for a lot of technical texts). The following juxtapositions are  The following complex examples inuslrate the degree of complexity noun phrases may attain (the part of the sentence which is not the noun phrase is bracketed) : (11 proc~de) de la technique des refroidisseurs 9 surfaces secondalres, appel~,s retroidisseurs a lames et intercalaires ou refroidisseurs compacts qui convlent particuli~,rement bten au gaz ayant un mauvais coefficient d'echange thermique.</Paragraph>
    <Paragraph position="12"> Verb nominalisation (verbiaction noun derivation) is also frequently used : ... effectuer une v~riflcation du tarage...</Paragraph>
    <Paragraph position="13"> (rather than v#rifier le forage) ,., proc~der ~ la d~pose des panneaux...</Paragraph>
    <Paragraph position="14"> Faire une verification du reglage...</Paragraph>
    <Paragraph position="15"> These structures are built using a small number of French verbs (loire, effectuer,...), and they are used with all nominalisations of verbs of action.</Paragraph>
    <Paragraph position="16"> Idiomatic expressions We class idiomatic expressions as being variable or non-variable expressions, which require identification within a text in order to ensure correct translation. Idiomatic expressions exist in all the syntactic categories. Nominal idiomatic expressions are probably the most frequent, particularly in specialised terminology (more than 80 % of the temis are nominal idiomatic expressions, in aeronaalical termiuology). Without going into further detail, it can be said that they are all present in the corpus. However, only a relatively small number of verbal expressions can be found (in comparison with the potential wealth of the French language) and there has been no example of a transformation of these expressions (for example les mesures qui doivent ~tre prises) in the corpus. They are generally made up using a limited number of verbs (prendre, tenir, mettre...) : Malntenlr en place l'embase...</Paragraph>
    <Paragraph position="17"> Cette pression de rempllssage tlent compte d'une perte de charge...</Paragraph>
    <Paragraph position="18"> ... mettre les commandes du r~gulateur oxyg~)ne sur ON.,, We must bear in mind that expressions of the effectuer-plus-a-nominalised-verb type do not belong to the specific idiomatic expression class, and are considered as being a commonly accepted Wansformation.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.2. RARE OR NONEXISTENT
PHENOMENA
</SectionTitle>
    <Paragraph position="0"> The following is a list of significant examples of linguistic phenomena rarely or never encountered, it is not an exhaustive list.</Paragraph>
    <Paragraph position="1"> Interrogative clauses No direct nor indirect interrogatives were encountered in any of the texts.</Paragraph>
    <Paragraph position="2"> Imperative clauses No imperatives (imperative mood) were encountered in the pertinent corpus, they are replaced by infinitive clauses (such is the case for most service or maintenance notes in French) : proc~der au campement de I'avlon ne pas tendre exag~rement les cordes Amongst the non-pertinent texts, messages for users of a data processing system, user guides or training guides may contain imperatives or (sometimes) interrogatives.</Paragraph>
    <Paragraph position="3"> Direct speech No examples of direct speech were encotmtered. Comparative phrases with a congmrative reference Comparative phrases containing a comparative reference were extremely rare in the corpus studied. Only one example was encountered : La b~che ~ eau est plus volumineuse que la cuve d'{)vaporation, composee d'~lement en alliage le, ger, soudes.</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACTES DE COLING-92, NANTES. 23-28 AOt~q&amp;quot; 1992 1 2 8 7 PROC. OF COL1NG-92, NAN'I\]~S. AUG. 23-28, 1992
</SectionTitle>
    <Paragraph position="0"> Although not totally absent, personal pronouns are rare in technical texts (one per page on average).</Paragraph>
    <Paragraph position="1"> None of the personal pronouns encountered referred to a human (in fact, humans are rarely referred to in the texts) in the pextinent corpus.</Paragraph>
    <Paragraph position="2"> This information meant that for the French-English system, personal pronouns would always be &amp;quot;it&amp;quot; (or &amp;quot;they&amp;quot;) as it is not necessary to search for the refcmmce.</Paragraph>
    <Paragraph position="3"> Rhetorical figures Rhetorical figures are rare and correspond to established usages.</Paragraph>
    <Paragraph position="4"> Only one metaphor was encountered in the texts : .. les commandes chemlnent sur le cot~ droll sous le plancher passager..</Paragraph>
    <Paragraph position="5"> The problem is solved &amp;quot;at dictionary level, given that a cable can cheminer (make its way).</Paragraph>
    <Paragraph position="6"> The metonyms encountered appear to be accepted metonyms which are transposable into the target language. Thus, in the above example, the term commandes which represents the cable or cables which propagate the controls, is literally translated into English, without shocking the technicians of the field.</Paragraph>
    <Paragraph position="7"> Anaphora are rare, and as with personal pronouns, if encountered could be handled in a simplified way. Finally, with regard to the aspectual phenomena, in French, syntactically speaking, there is very little aspectual indication. English, which was our target language very often uses the progressive form. We therefore paid particular attention to the reconstitution of the progressive form. After having carried out a comparative study of the texts, we were able to draw the conclusion that the progressive form was virtually nonexisent in the translated texts.</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 - CONCLUSION
</SectionTitle>
    <Paragraph position="0"> The corpus study has enabled us to draw several conclusions. Firstly, it has enabled the definition of a text typology which is pertinent to a considerable volume of texts of different origins and different technical fields (although this excludes in particular computer user guides and documents for the general public).</Paragraph>
    <Paragraph position="1"> The definition of this typology has enabled us to draw up accurate linguistic specifications, simplifying and sometimes ignoring the handling of certain linguistic phenomena. The specifications were then used to develop an FAT system (based on GETA's system, ARIANE), whose grammars are less complex, and hence easier to maintain. The quality of translation, tested on a important number of aeronautical texts (maintenance guides) and on several specimens of pertinent texts, have proved the validity of this approach : revision times varied from 20 to 30 minutes per page (average translation times in SITE is about 1 page per hour).</Paragraph>
    <Paragraph position="2"> The study of texts of different typologies has enabled as to pinpoint the limits of a minimal language, used in all types of texts, and hence the indispensable kernel of any new typology.</Paragraph>
    <Paragraph position="3"> From an economic point of view, the texts targeted by the typology described here represent a considerable amount of the documents to be translated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML