File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3006_metho.xml

Size: 14,436 bytes

Last Modified: 2025-10-06 14:12:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-3006">
  <Title>Towards Personal MT: general design, dialogue structure, potential role of speech</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Towards Personal MT:
</SectionTitle>
    <Paragraph position="0"> general design, dialogue structure, potential role of speech</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Clwistian BOITET
GETA, IMAG Institute
CIdJF &amp; CNRS)
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Personal MT (PMT) is a new concept in dialogue-based MT (DBMT) , which we are currently studying and prototyping in the LIDIA project Ideally, a PMT system should run on PCs and be usable by everybody.</Paragraph>
    <Paragraph position="1"> To get his/her text translated into one or several languages, the writer would accept to cooperate with the system in order to standardize and clarify his/her document. There are many interesting aspects in the design of such a system. The paper briefly presents some of them (HyperText, distributed architecture, guided language, hybrid transfer/interlingua, the goes on to study in more detail the structure of the dialogue with the writer and the place of speech synthesis \[1\].</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Keywords
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Introduction
</SectionTitle>
    <Paragraph position="0"> A first classification of MAT (Machine Aided Translation) systems is by user. &amp;quot;Classical&amp;quot; MAT systems are for the watcher, for the revisor (post-editor), or for the translator. A new concept is that of &amp;quot;personal MT&amp;quot;, or MAT for the writer.</Paragraph>
    <Paragraph position="1"> MT for the watcher appeared in the sixties. Its purpose is to provide informative rough translations of large amounts of unrestricted texts for the end user.</Paragraph>
    <Paragraph position="2"> MT for the revisor appeared in the seventies. It aims at producing raw translations good enough to be revised by professionals in a cost-effective way. This implies that the system needs to be specialized tor a certain sublanguage. For a system to be cost-effective, it is generally agreed that at least 20000 pages must be handled (e.g. 10000 pages/year for at least 2 years).</Paragraph>
    <Paragraph position="3"> Leaving &amp;quot;heavy MT&amp;quot;, not adapted to small volumes of heterogeneous texts, several firms have developed MAT systems for translators, in the form of tools (e.g.</Paragraph>
    <Paragraph position="4"> Mercury-Termex~), or of integrated environments (e.g.</Paragraph>
    <Paragraph position="5"> Alps TSSrU).</Paragraph>
    <Paragraph position="6"> The concept of MT for the author (writer/speaker) has recently crystallized, building on previous studies on interactive MT, text critiquing and dialog structures \[5, 6, 7, 9, 12\]. Its aim is to provide high quality translation/interpretation services to end users with no knowledge of the target languages or linguistics.</Paragraph>
    <Paragraph position="7"> A sccond classification of MAT systems is by the types of knowledge felt to be central to their flmctioning. Linguistic Based MT uses : core knowledge about the language ; specific knowledge about the corpus (domain, typology) ; intrinsic semantics (a term coined by J.P. Desclds to cover all information formally marked in a natural language, but which refers to its interpretation, such as semantic features or relations : concreteness, location, cause, instrument... ) ; but not : extrinsic semantics (static knowledge de~ribing the domain(s) of the text, e.g. in terms of facts and rules) ; situational semantics (describing the dynmnic situations and their actors) ; pragmatics (overt or covert intentions in the communicative context).</Paragraph>
    <Paragraph position="8"> Knowledge-Based MT uses extralinguistic knowledge on top of linguistic knowledge. Finally, Dialogue-Based MT insists on extracting knowledge from a human (tile author or a specialist). These options are not exclusive, however. In KBMT~89 \[7\], for example, ambiguities persisting after using linguistic and extralinguistic knowledge are solved through a dialogue with the writer initiated by the &amp;quot;augmentor&amp;quot;. In ATR's Machine Interpretation project, the dialogues center around a well-defined task (organization of international conferences), but may also concern extraneous matters (cultural events, health problems...). This feature, added to the enormous ambiguity inherent in speech input, will likely force such systems to be dialogue-baseA as well as knowledge-based \[5\]. Ii1 Personal MT, we may rely on some core extralinguistic knowledge base, but not on any detailed expertise, because the domains and types of text should be unrestricted. Hence, Personal MT must be primarily dialogue-based.</Paragraph>
    <Paragraph position="9"> A third classification of MAT systems is by their internal organization (direct/transfer/interlingua, use of classical or specialized languages, procedurality / declarativeness...) through which so-called &amp;quot;generations&amp;quot; have been distinguished. This level of detail will not be too relevant in this paper.</Paragraph>
    <Paragraph position="10"> 30 1 \]\[, A project in Personal MT ILo G(~als MDIA (Large Internationalization of the Documents by Interacting with their Authors) aims at studying the theoretical and methodological issues of the PMT approach, to be experimented on by first building a small prototype, and more generally at promoting this concept within the MT community.</Paragraph>
    <Paragraph position="11"> We are trying to develop an architecture which would be suitable for very large applications, to be upscaled later with industrial partners if results are promising enough. For example, we don't intend to incorporate more than a few hundred or flmusand words in the prototype's (LIDIAol) dictionaries, although we try to develop robust indexing schemes and to imt)lcment the lexical data base in a way which would allow supporling on the order of 1 to lOMwords in 10 languages. The same goes for the grammars.</Paragraph>
    <Paragraph position="12"> Even in a prototype, however, ttle sh'ucture of the dialogue with the author must be studied with care, and offers interesting possibilities. Clearly, the writer should be allowed to write freely, and to decide for himself when and on which part of his document to start any ldnd of interaction. But changes in the text should be controlled so that not all changes would torce the system to start the interaction anew.</Paragraph>
    <Paragraph position="13"> From a linguistic point of view, it is extremely exciting to see, at last, a possibility to experiment with Zemb's theme/rheme/pheme &amp;quot;statutory&amp;quot; ,articulation of propositions \[13\], and/or Prague's topic/focus opposition, which are claimed to be of utmost importance for translation : both are almost impossible to compute automatically, because the tests are very often expressed in terms of possible transformations in a given discourse context. But, in PMT, we may ask the author.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Outline
</SectionTitle>
    <Paragraph position="0"> The prototype system for LIDIA-1 is constrained as follows.</Paragraph>
    <Paragraph position="1"> Translation from French into Russian, German and English (inversing previous systems), with other target languages being studied in cooperative frameworks ; Small corpus from the Ariane-G5 user interface  The choice of HyperCard reflects the fact that Hypertexts are becoming the favorite supports for technical documentation. It also relics on tile assumption that writers will more readily agree to participating in a dialogue if the tool they are using is very interactive than if they use a more classical text processor, t:inally, there are some linguistic advantages. First, the textual parts are clearly isolated in fields, and not cluttered with images, formulas, tabs, markups, etc. Scripts should not be. translated -- if they generate messages, these must be taken from normal fields, and not directly generated (linguistic requirements may lead to better programming practices!).</Paragraph>
    <Paragraph position="2"> Second, the textual parts may be typed, thus greatly facilitating analysis. For example, a given field may contain only titles, another only menu items, another only sentences without the initial subject (which is often contained in another field), etc. A distinct possibility is to define microlanguages as types of very short textual fragments (less than 2 or 3 lines, to be concrete), and to define sublanguages as structured collections of microlanguages for longer textual fragments.</Paragraph>
    <Paragraph position="3"> Distributed architecture The idea to use a distributed architecture has both a practical and theoretical basis. First, we want to use the Ariane-G5 system, a comprehensive generator of MT systems developed over many years \[11\]. Although some micros can support this system (PC~AT/370, PS2/7437), their user-friendliness and availability are no match to those of the Mac.</Paragraph>
    <Paragraph position="4"> Second, looking at some other experiences (Alps, Weidner), we have concluded that some parts of sophisticated natural language processing can not be performed in real time on small and cheap machines without oversimplifying the linguistic parts and degrading quality down to near uselessness. Rather, it should be possible to perform the &amp;quot;heavy&amp;quot; parts in an asynchronous but still user-friendly way, as IBM researchers have done for the Critique system \[9\]. Of course, this idea could be implemented on a single machine running under a multitasking operating system, if such a system were available on the most popular micros, and provided tile heavy linguistic computations don't take hours.</Paragraph>
    <Paragraph position="5"> Guided Language The &amp;quot;guided language approach&amp;quot; is a middle road between free and controlled text. The key to quality in MT, as in other areas of AI, is to restrict the domain in an acceptable way.</Paragraph>
    <Paragraph position="6">  By &amp;quot;controlled language&amp;quot;, we understand a subset of natural language restricted in such a way that ambiguities disappear. That is the approach of the TITUS system : no text is accepted unless it completely conforms to one predefined sublanguage. While this technique works very well in a very restricted domain, with professionals producing the texts (technical abstracts in textile, in this case), it seems impossible to generalize it to open-ended uses involving the general public.</Paragraph>
    <Paragraph position="7"> What seems possible is to define a collection of microlanguages or sublanguages, to associate one with each unit of translation, and to induce the writer/speaker to conform to it, or else to choose another one.</Paragraph>
    <Paragraph position="8"> Hybrid Transfer/lnterlingua By &amp;quot;hybrid Transfer/Interlingua&amp;quot;, we mean that the interface structures produced by analysis are multilevel structures of the source language, in the sense of Vauquois \[4, 11, see also 2, 3\], where some parts are universal (logico-semantic relations, semantic features, abstract time, discourse type...), while others are language-specific (morphosyntactic class, gender, number, lexical elements, syntactic functions...). In PMT, because of the necessity of lexical clarification, we should go one step further toward interlingua by relating the &amp;quot;word senses&amp;quot; of the vocabularies of all the languages considered in the system and making them independent objects in the lexical data base.</Paragraph>
    <Paragraph position="9"> II. Structure of the dialogue with the writer . Interactions concerning typology, terminology and style Hence, the first interaction planned in LIDIA concerns typology : given a stack, the system will first constn~ct a &amp;quot;shadow&amp;quot; file. For each textual field, it will ask its typology (microlanguage for very small texts, sublanguages for others), and attach it to the corresponding shadow record. In the case of &amp;quot;incomplete&amp;quot; texts, where for example the subject of the first sentence is to be taken from another field (as in tables containing command names and their explanations), it will ask how to construct a complete text for translation, and attach the corresponding rule to the shadow record.</Paragraph>
    <Paragraph position="10"> The second level of interaction concerns spelling.</Paragraph>
    <Paragraph position="11"> Any spellchecker will do. However, it would be best to use a lemmatizer relying on the lexical database of the system, as the user must be allowed to enter new words and will expect a coherent behavior of the entire system. Level three concerns terminology. The lexical database should contain thesaurus relations, indicating among other things the preferred term among a cluster of (quasi-)synonyms (e.g. plane/aircraft/ship/plane). Which term is preferred often depends on local decisions : it should be easy to change it for a particular stack, without of course duplicating the thesaurus. Note that the lexical database should contain a great variety of terms, even incorrect or dubious, whereas terminological databases are usually restricted to normalized or recommended terms. In PMT, we only want to guide the author : if s/he prefers to use a non standard term, that should be allowed.</Paragraph>
    <Paragraph position="12"> Level four concerns style, understood in a simply quantitative way (average length of sentences, fi:equency of complex conjuncts/disjuncts, rare verbal forms, specific words like dont in French, relative frequency of nouns/articles, etc.). From the experience of CRITIQUE \[9\], it seems that such methods, which work in real time, may be very useful as a first step to guide towards the predetermined text types (micro- or sub-languages).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML