File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2188_metho.xml

Size: 19,322 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2188">
  <Title>Corpus-based annotated test set for Machine Translation evaluation by an Industrial User</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Eva DAUPHIN and V~ronika LUX
AEROSPATIALE - CCR
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This article is concerned with the building of a test data set for assisting the industrial user in machine translation evaluation. The emphasis is laid on the interest of an approach based on the study of bilingual corpus pragmatic characteristics. The study of one chapter of the maintenance manual of the Super Puma helicopter made it possible to identify the pragmatic characteristics relevant in the choice of the morpho-syntactic structures and translation processes actually used. The textual test set consists in a SGML file including the source text sequences aligned with the reference translation sequences and also including the pragmatic, formal and translational characteristics in the form of annotations (labels and formal descriptions).</Paragraph>
    <Paragraph position="1"> Introduction Corpus studies appear to be one of the most appropriate techniques to identify the linguistic constraints and needs which will be used as evaluation measurements and criteria to judge the adequacy of a machine translation system to an industrial user's environment. In this article, the linguistic constraints correspond to the linguistic characteristics of the corpora to be treated by the machine translation system. These constraints are illustrated in the source language corpora to be submitted to machine translation.</Paragraph>
    <Paragraph position="2"> The linguistic needs correspond to the minimal level of quality required for the produced translation. These needs are illustrated in a human attested translation of the chosen source corpora. The identification and formalisation of the linguistic constraints and needs illustrated in corpora represent a major step during the evaluation process of machine translation applications by an industrial user. Corpus study is not a new concept in the NLP domain, but the methods used can be quite different depending on the expected results and applications. In this article, we will describe how we built a reusable annotated test set through the study of a bilingual corpus.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="1061" type="metho">
    <SectionTitle>
1 A highly structured documentation
</SectionTitle>
    <Paragraph position="0"> The corpus we chose to study is the maintenance manual for the Super Puma helicopter written in French and its attested human translation in English 1. An important characteristic of this corpus is that it is written and used in compliance with the so-called ATA 2 100 specification.</Paragraph>
    <Section position="1" start_page="0" end_page="1061" type="sub_section">
      <SectionTitle>
1.1 ATA 100: a short presentation
</SectionTitle>
      <Paragraph position="0"> Tile ATA 100 specification role is to provide a set of rules for the writing and exploitation of aircraft after-sale documents. Both document writers and users are supposed to be familiar with this specification. As such, the ATA 100 specification defines the document production and use environment. In particular, the specification imposes a very strict way of structuring the text in terms of topical and discursive organisation, and also in terms of practical document production by providing a dedicated SGML DTD for the maintenance manuals (ATA 100 DTD). The relevance of ATA 100 in the writing and exploiting process of the document encouraged us to consider the corpus from a quite new point of view in the NLP domain: the pragmatic approach. The ATA 100 specification can indeed be considered as a sociolectal system or standard (i.e. a system that rules the communicative usage inside a restrained community of persons).</Paragraph>
      <Paragraph position="1"> From a communicative point of view the ATA specification defines the types of discursive genres (illocutionary force of the utterances) the writer has to adopt according to a pre-defined document structure. For example, all the maintenance documents must be divided into tasks and subtasks. Each maintenance task description should be preceded by a definition of this task. Any task to be performed should be described as a succession of subtasks which are explained in the form of a succession of orders.</Paragraph>
      <Paragraph position="2"> Each task and subtask has a precise denomination that takes the form of titles in the document. Because the maintenance manuals are submitted to annual updates, the document contains also a large number of factual information such as dates, version numbers, aircraft type reference, page numbers etc. The production and  exploitation environment having a strong impact on the way the documents are written, it was quite natural to first characterise the corpus we were intending to study from a pragmatics point of view.</Paragraph>
    </Section>
    <Section position="2" start_page="1061" end_page="1061" type="sub_section">
      <SectionTitle>
12 Pragmatic labelling of sequences
</SectionTitle>
      <Paragraph position="0"> For us, the pragmatic labelling was a first step in the classification of utterances based on their communicative value. Practically, we decided to assign to each utterance of the text a label indicating its textual and discursive status according to the ATA 100 indications. The pragmatic study of the corpus resulted in the definition of 4 types of labels: the meta-textual indicators, the topical meta-utterances, the discursive meta-utterances and the illocutionary typed utterances (orders, definitions, etc.).</Paragraph>
      <Paragraph position="1"> For example, the METNORM and METXNORM labels correspond to the Task and Subtask titles in the text: Stockage des instruments D#stockage des atterrisseurs Remise en service de I'appareil The illocutionary aim of the METNORM or -XNORM is to help the user understanding the topical organisation of the document. They actually illustrate the operational organisation of the maintenance work to be done by the user.</Paragraph>
      <Paragraph position="2"> 2 Underlying syntactic behaviours The second step of the corpus study consisted in the observation of the formal structures of the utterances previously labelled from a pragmatic point of view. As we can notice in the above examples, the pragmatic value of the utterances has a very clear incidence on their morpho-syntactic structure.</Paragraph>
    </Section>
    <Section position="3" start_page="1061" end_page="1061" type="sub_section">
      <SectionTitle>
2.1 Some observations
</SectionTitle>
      <Paragraph position="0"> In our corpus, we observed that the Meta-Textual Indicators usually present a phrasal structure (they are not complete sentences) and include a large number of brachygraphical signs (acronyms, codes, alpha-numerical references, etc.). The meta utterances are all nominal phrases resulting from the nominalisation of verbal groups: Stockage des instruments Destockage des atterrisseurs or from the topicalisation of an object: E-16ments stockfis en containers pressuris6s Nominalisation and topicalisation can thus be considered as processes used by the writer to &amp;quot;textualise&amp;quot; knowledge for its reader.</Paragraph>
      <Paragraph position="1"> As far as the Directive - Operation Utterances (EDOPER) are concerned, we also could observe very strong regularities which are all based on the same morpho-syntactic basic scheme: VERB + OBJECT, the verb is always an infinitive one, the real subject (the reader) is never mentioned and adverbial complements may be inserted at specific places in the sentence depending on their semantic value (time, manner, place, mean, etc.). 22 Typical morpho-syntactic schemes This type of morpho-syntactic observations has been carried out for all the utterances of the text and resulted in the definition of twelve morpho-syntactic basic schemes presenting the characteristics of the linguistic structures used by the writer. Our morpho-syntactic schemes are based on the concept of syntagmatic components which are further specified using a set of features. We used two kinds of features: morphological features (tense, mode, voice, derivation, etc.), and functional features (manner adverbial complement, direct object, subject, agent, etc.). For example, most of the Topical Meta-Utterances (MET) correspond to the morpho-syntactic scheme SNDEV which is the following: N + (AJ) + (SN1) + (SPIISN2) 3 + (SAVISP2) 4 with the following features:  * N: deverb = +, indicating that the noun (N) is the result of the nominalisation of a verb (deverb) * A J: fonction = #pith#te, indicating that the adjective (AJ) has the function of modifier, * SNI: fonction = COMPADV and type = temps, indicating that the nominal phrase has a function of adverbial complement (COMPADV) with a &amp;quot;time&amp;quot; semantic (type = temps) * SPI: fonction = dev-OBJ and prep = de, indicating that the prepositional phrase has a function of object of a nominalisation (dev-OBJ) introduced by the preposition de (prep = de).</Paragraph>
      <Paragraph position="2"> * SN2: fonction = dev-OBJ, indicating that the nominal phrase has a function of direct (there is no prepositional introducer) object of a nominalisation (dev-OBJ) * SAV and SP2: fonction = COMPADV and type =  mani#re, indicating that the SAV and/or the SP2 have the function of adverbial complement (COMPADV) with a &amp;quot;manner&amp;quot; semantics (type = maniere). The scheme presentation reflects the results of a textual study and in order to formalise some particular phenomena, we had to introduce some specific features such as &amp;quot;deverb&amp;quot; for the nouns resulting from the nominalisation of verbs. These schemes are actually generic representations that allow us to characterise all the textual sequences of our corpus using only 12 scheme labels.</Paragraph>
    </Section>
    <Section position="4" start_page="1061" end_page="1061" type="sub_section">
      <SectionTitle>
2.3 Co-description
</SectionTitle>
      <Paragraph position="0"> The study of the possible co-description of an utterance by a pragmatic label on one hand and by a morpho-syntactic scheme on the other hand made it possible to assess compatibilities and incompatibilites between the pragmatic value of an utterance and its linguistic structure. The following table shows that for each pragmatic value, we can find a typical underlying morpho-syntactic structure.</Paragraph>
      <Paragraph position="2"> At this stage of the corpus study, we described each textual sequence of our text with two labels: a pragmatic one (indicating the textual and illocutionary status of the sequence) and a morpho-syntactic one (describing its formal behaviour).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1061" end_page="1063" type="metho">
    <SectionTitle>
3 Underlying translational behaviours
</SectionTitle>
    <Paragraph position="0"> The third step of the corpus study intended to show that it was possible to add translational information on the already obtained pragmatic and morpho-syntactic information. To get these translational information, we carried out a contrastive study of our French text with its attested human translation.</Paragraph>
    <Section position="1" start_page="1061" end_page="1061" type="sub_section">
      <SectionTitle>
3.1 Some observations
</SectionTitle>
      <Paragraph position="0"> The pragmatic characteristics of the text apparently implied some translation choices. Indeed, for us, only the directive illocutionary value of utterances could explain the choice of translating the French infinitive verbs in English imperative verbs. Infinitive, in French has no intrinsic value of imperative ; all the infinitive verbs in French are not necessarily translated by an imperative verb in English.</Paragraph>
      <Paragraph position="1"> Also, the pragmatic phenomena of &amp;quot;terminologisation&amp;quot; which are not the same in French and in English explain the possible structural non correspondence between some nominal phrases: Appareil entrepos6 non stock6 --&gt; Aircraft Stored-No Preservation Measures or between some sentences. The keeping of the pragmatic value from French to English can also lead to the restitution in English of some missing elements in French: rotation des roues --&gt; rotate the wheels.</Paragraph>
    </Section>
    <Section position="2" start_page="1061" end_page="1061" type="sub_section">
      <SectionTitle>
3.2 Translational annotations
</SectionTitle>
      <Paragraph position="0"> The contrastive study led us to identify some recurrent (in our corpus) translational consequences due to the pragmatic value preservation from French to English. This part of the study resulted also in the identification of some phenomena that ffave to be strictly formalised if we want them to be correctly handled by a machine translation system. This is the case for the terminological elements that may have quite unpredictable translations (or equivalents): GTM--&gt; Engines Circuits an&amp;no..barom#triques --&gt; Air Data Systems Concerning the problem of term translation, the best solution we found for annotating the test set consists in tagging them in the French sequence and in the English corresponding sequence using SGML tags (&lt;T&gt; and &lt;/T&gt;):  observations, we chose to express them in the form of an &amp;quot;oriented rule&amp;quot; attached to the French-English pair of corresponding sequences. If we take the example of determination in the Topical Met&amp;Utterances (topical titles), we observe that the nominal phrases which are the object of a nominalisation are nearly always determined: Destockage de la structure Nettoyage des parties m#talliques whereas they are systematically undetermined in English: Depreservation of airframe Cleaning of metal parts The annotation concerning the omission of determiner in English is the following: V + SP --&gt; V + SP(DT-) 5 and is attached to the concerned pairs of English-French sequences.</Paragraph>
      <Paragraph position="1"> 4 Building an annotated test set The corpus study allowed us to get a large number of information concerning the French text on one hand, and the English text on the other hand. We also have information on the corpus-specific translational processes used from French to English. In order to build the test set, it appeared necessary to structure this information so that it could be exploited by the industrial evaluator.</Paragraph>
    </Section>
    <Section position="3" start_page="1061" end_page="1063" type="sub_section">
      <SectionTitle>
4.1 Defining an annotation scheme
</SectionTitle>
      <Paragraph position="0"> To structure our annotated test set, we defined a so-called &amp;quot;annotation scheme&amp;quot; and we adopted the descriptive language SGML. The test set is based on the notion of equivalent textual sequence pairs directly extracted from the aligned French-English original studied corpus. Each pair of aligned sequences compose what we called a test unit. The information concerning the French sequence is directly attached to it (morpho-syntactic scheme,  complete morpho-syntactic description and tagged terms) ;the information concerning the English sequence is directly attached to it (complete morpho-syntactic description and tagged terms) and the information concerning the sequence pair (pragmatic label, factual data) is attached to the created test unit.</Paragraph>
    </Section>
    <Section position="4" start_page="1063" end_page="1063" type="sub_section">
      <SectionTitle>
4.2 The use of SGML
</SectionTitle>
      <Paragraph position="0"> To really get a structured file of annotated test data, we chose to build it in compliance with the SGML ISO standard. We thus wrote an SGML DTD in order to formalise the conceptual annotation scheme. The result obtained for the following sequence pair:  This format, though a bit complex for an human eye, has the advantage to clearly separate annotations from the original textual data. Moreover, this format allows an easy exploitation of the contained data provided the evaluator uses SGML tools with which selection and extraction of subset of data become really easy (each tagged data is a potential selection criteria).</Paragraph>
      <Paragraph position="1"> Conclusion The interest of building this kind of annotated test set from corpora is multiple. First, it allows the evaluator to have at his disposal a whole set of potential test data which are clearly representative of his real industrial needs. Being enriched by pragmatic and morpho-syntactic annotations, it considerably helps the evaluator to clearly identify the phenomena well or badly handled by a machine translation system. Indeed, using the annotations, the evaluator can easily link the mistakes of a machine translation system with the concerned linguistic units. The presence of a reference annotated translation is also of great help for keeping the evaluator as impartial as possible (even if a r~ference human translation may not always be the best one). This is particularly true when dealing with terminology. Finally, using SGML for the building of the test set file allows one to perform targeted evaluations by giving extraction criteria such as the morpho-syntactic schemes or even the pragmatic labels: an evaluator can decide to select all the test units including the pragmatic METNORM label in order to carry out a specific evaluation of the MT system performances when translating the task and subtask titles of the Super Puma helicopter maintenance manual.</Paragraph>
      <Paragraph position="2"> As stated above, our initial aim in this corpora study was the building of a corpus-based annotated test set, to be used for the evaluation of MT systems. The results nevertheless seem interesting for some other purposes.</Paragraph>
      <Paragraph position="3"> Firstly, these results are potential contributions to the specification of dedicated NLP systems. In particular they allow one to suggest heuristics for the processing of linguistic phenomena which, in general, are known to be complex problems for NLP. One example, in the case of MT, is the translation of French determiners into English.</Paragraph>
      <Paragraph position="4"> Another example, in the case of automatic analysis, is the resolution of anaphora: more than 80% of the pronouns in our corpus refer to the object complement of the last sentence.</Paragraph>
      <Paragraph position="5"> On the long run, future MT systems could take advantage of the pragmatic information contained in the SGML tags of the source text, to drive both the analysis and the transfer phases. For example, a verbal form in the infinitive, when occurring in a procedural part of French text (identified as such via SGML tags) would be analysed as a sequence with injunctive value, and translated into English by a verbal form in the imperative. When occurring in a tittle, a similar infinitive verbal form could be translated by an &amp;quot;ing&amp;quot; verbal form.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML