XML Viewer - a83-1026

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1026_metho.xml
Size: 28,102 bytes
Last Modified: 2025-10-06 14:11:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="A83-1026">
  <Title>iNVESTIGATING THE POSSIBILITY OF A HICROPROCESSOR-BASED MACIIINE TRANSLATTON SYSTEM</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
iNVESTIGATING THE POSSIBILITY OF A HICROPROCESSOR-BASED
MACIIINE TRANSLATTON SYSTEM
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="169" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> This paper describes an on-goin~ research project being carried out by staff and students ac the Centre for Computational Linguistics co examine the feasibility of Hachine Translation (~T) in a microprocessor environment. The system incorporates as far as ~ossihle ~eacures of large-scale HT systems ~hac have proved desirable or effective: it is mutCilinRual, algorithms and da~a are strictly separated, and the system is hi=hly modular. Problems of terminological polysemy and syntactic complexity are reduced via the notions of controlled vocabulary and restricted syntax. ~iven these constraints, iE seems feasible Co achieve transtacion via an 'interttngua', avoidin~ any language-pair oriented 'transfer' scare. The paper concentrates on a description of the separate modules in the transla~ion process as they are currently envisaged, and decatts some of the problems specific to the microprocessor-based approach to ~ chac have So ~ar come tO tight.</Paragraph>
    <Paragraph position="1"> I. BACKC2OU:'D ;'-':D '.'VPS2VI':&amp;quot; This paper describes preliminary research in the design of Bede, a limited-syntax controlledvocabulary ~achine Translation system co run on a microprocessor, translacine between English, ~rench, Cerman and Dutch. Our experimental corpus is a car-radio manual. Bede (named after the 7th Cencury English tin~uist), is essentially a research project: we are noc immediately concerned ~v~ch commercial apolicacions, though such are clearly possible if the research proves fruitful.</Paragraph>
    <Paragraph position="2"> &amp;quot;:ork on Bede ac this stage thouRh is primarily experimentnl. The aim at the moment \[s co investigate the extent to which a microprocessor-based ~ system of advanced desi2n is Possible, and the limitations that have to be imposed in order co achieve .~ ~or;&lt;in~ system. This paper 'Je~crihes the overall system design snecif~c~Cion t) .~nPScn we are currently working.</Paragraph>
    <Paragraph position="3"> ~n cite basic design of the system we attempt to incorporate as much as possible Features of far,escale ~ systems ~hac have proved to be desirable or effective. Thus. Bede is mul~ilinBual by ,~csi(zn. alqorithms and linRuistic data are striccl~ separated, and the system \[s desiRned in ~ore o- less independent modules.</Paragraph>
    <Paragraph position="4"> T~n ~\[cron'occssor environment means that ~:r~r~l ~I&amp;quot; siz~ ~ro ~{~norE,l~E: '4~ta ~cruccures  both dynamic (created by and maniputated during the translation process) and static (dictionaries and linguistic rule packages) are constrained co be as economical in terms of sC/oraBe space and access procedures as possible. Limitations on ~ncore and periohera\[ storage are important considerations in the system design.</Paragraph>
    <Paragraph position="5"> In large genera\[ purpose ,HT systems, i= is necessary to assume that faiture to translate the given input correctty is generally not due to incorrectly ~ormed input, bu~ to \[nsufficientJy elaborated ~ranslacion algorithms. This is particularly due to =wo problems: the lexical problem of choice of appropriate translation equivalents, and the strategic problem of e~fec~ive analysis of the wide range of syntactic patterns Eound in nacural language. The reduction of these problems via ~he notions of&amp;quot; controlled vocabu\[ary and restricted syntax seems particularly appropriate in the microprocessor environment, since the alternative of makin~ a system |n~tnitely extendable \[s probably no~ feasible, Given these constra/nts, it seems feasible to achieve cranstacion via an InCerltngua. ~n ~hich the canonicat structures from the source lan=ua~e are mapped directly onto those of the target language(s), avoidin R any language-pair oriented 'transfer' sta~e. Translation thus cakes place in ~wu puase~= anaiysls ot source ~ext an~ synthesis of target text.</Paragraph>
    <Paragraph position="6"> A. Incorporation of recent desL~n ortncio\[es ~odern ~ system design can be char~cterLsed hv three principles thac have proved Co be desirable and effective (Lehmann eta\[, tg~}o:I-\]): each of these is adhered co in the desiRn oF Rede.</Paragraph>
    <Paragraph position="7"> Bede Es mutt\[lingual by design: early &amp;quot;!T systems were designed with specific lan~uaBe-oatrs in mind, and translation algorithms were elaborated on this basis. The main conseouence of this was that source lan~uaRe analysis ~C/as effected within the perspective of the B~ven target \[anguaRe, and was therefore often of little or no use on the addition into the system of a further language (of. ~in~, IORI:12; ~:in~ Perschke, 1982:28).</Paragraph>
    <Paragraph position="8"> In Bede, there is a strict separation of algorithms and \[inguiscic data: oarlv &amp;quot;T ~y~ccms '~ere quite sin~n\[y 'translation nrc~ra:~s', tnd ~nv underlying linguistic theory which might have been present was inextricably bound up with the program itself. This clearly entailed the disadvantage that any modification of the system had to be done by a skilled programmer (of. Johnson, IgRO:IAO).</Paragraph>
    <Paragraph position="9"> Furthermore, the side-effects of apparently quite innocent modifications were often quite far-reaching and difficult to trace (see for example Boscad, lq82:130), Although this has only recently become an issue in HT (e.g. Vauquois, 1979:I.3; 1981=10), it has of course for a long time been standard practice in other areas of knowledge-based programming (Newell, 1973; Davis &amp; King, 1977).</Paragraph>
    <Paragraph position="10"> The third principle now current in MT and to be incorporated in Bede is that the translation process should be modular. This approach was a feature of the earliest 'second generation' systems (of. Vauquois, 1975:33), and is characterised by the general notion that any complicated computational task is best tackled by dividing it up into smaller more or less independent sub-casks which communicate only by means of a strictly defined interface protocol (Aho et al, 1974). This is typically achieved in the bit environment by a gross division of the translation process into analysis of source language and synthesis of target language, possibly with an intermediate transfer sca~e (see !.D below), with these phases in turn sub-divided, for example into morphological, lexical and syntactico-semantlc modules. This modularity may be reflected both in the linguistic organisation of the translation process and in the provision of software devices specifically tailored to the relevant sub-task (Vauquois, 1975:33). This is the case in Bede, where for each sub-task a grammar interpreter is provided which has the property of being no more powerful than necessary for the task in question. This contrasts with the approach taken in TAt~-H~c~o (TAUM, Ig73), where a single general-purpose device (Colmerauer's (1970) 'O-Systems') is orovided, with the associated disadvantage that for some 'simple' tasks the superfluous power of the device means that processes are seriously uneconomical. Bede incorporates five such 'grammar types' with associated individual formalisms and processors: these are described in detail in the second half of this paper.</Paragraph>
    <Paragraph position="11"> B. The microproce,ssor environment !t is in the microprocessor basis that the principle interest in this system lies, and, as mentioned above, the main concern is the effects of the restrictions that the environment imposes. Development of the Bede prototype is presently caking place on ZRO-based machines which provide 6Ak bytes of in-core memory and 72Ok bytes of peripheral store on two 5-I/~&amp;quot; double-sided double-density floppy disks. The intention is that any commercial version of Bede would run on more powerful processors with larger address space, since we feel chat such machines will soon rival the nopularity of the less powerful ZRO's as the standard desk-cop hardware. Pro~rarzninR so far has been in Pascal-&amp;quot; (Sorcim, 197q), a Pascal dialect closely resembling UCSD Pascal, but we are conscious of the fact that both C (Kernighan &amp; Ritchie, 1978) and BCPL (Richards &amp; Whitby-Strevens, Ig7g) may be more suitable for some of the software elements, and do not rule out completing the prototype in a number of languages. This adds the burden of designing compatible data-structures and interfaces, and we are currently investigating the relative merits of these languages. Portability and efficiency seem to be in conflict here.</Paragraph>
    <Paragraph position="12"> Microprocessor-based MT contrasts sharply with the mainframe-based activity, where the significance of problems of economy of storage and efficiency of programs has decreased in recent years. The possibility of introducing an element of human interaction with the system (of. Kay, Ig80; Melby, 1981) is also highlighted in this environment. Contrast systems like SYSTRAN (Toma, 1977) and GETA (Vauquois, 1975, lq7g; Boiler &amp; Nedobejkine, IggO) which work on the principle of large-scale processing in batch mode.</Paragraph>
    <Paragraph position="13"> Our experience so far is chat the economy and efficiency in data-structure design and in the elaboration of interactions between programs and data and between different modules is of paramount importance. While it is relatively evident thac large-scale HT can be simulated in the microprocessor environment, the cost in real time is tremendous: entirely new design ~nd implementation strategies seem co be called for.</Paragraph>
    <Paragraph position="14"> The ancient skills of the programmer that have become eroded by the generosity afforded by modern mainframe configurations become highly valued in this microprocessor application.</Paragraph>
    <Paragraph position="15"> C. Controlled vocabulary and restricted sync@x The state of the art of language processing is such chat the analysis of a significant range of syntactic patterns has been shown to be possible, and by means of a number of different approaches.</Paragraph>
    <Paragraph position="16"> Research in this area nowadays is concentrated on the treatment of more problematic constructions (e.g. Harcus, lqgO). This observation has led us tO believe that a degree of success in a small scale MT project can be achieved via the notion of restricting the complexity of acceptable input, so that only constructions that are sure tc ne Correctly analysed are permitted. This notion of restricted syntax ~ has been tried with some success in larger systems (cf. Elliston, IGYn: Lawson, 107q:81f; Somers &amp; HcNaught, I9~O:ao~, resulting both in more accurate translation, and in increased legibility from t~e human point of view. AS Elliston points out, the development ef strict guidelines for writers leads not only t: the use of simpler constructions, but also to =he avoidance of potentially ambiguous text. In either case, the benefits for ~ are obvious.</Paragraph>
    <Paragraph position="17"> Less obvious however is the acceptability of such constraints; yet 'restricted syntax' need noc imply 'baby talk', and a reasonably extensive range of constructions can be included.</Paragraph>
    <Paragraph position="18"> Just as problems of syntactic analysis ca~: e(.</Paragraph>
    <Paragraph position="19"> alleviated by imposing some degree of contrn~ over  the syntactic complexity of the input, so the corresponding problem of lexical disambiguation chat large-scale HT systems are faced with can be eased by the notion of controlled vocabulary. A major problem for PIT is the choice of appropriate translation equivalents at the lexical level, a choice often determined by a variety of factors at all linguistic levels (syntax, semantics, pragmatics). In the field of mulCilingual terminology, this problem has been tackled via the concept of terminological equivalence (WUster, 1971): for a given concept in one language, a translation in another language is established, these being considered by definition to be in one-to-one correspondence. In the case of Beds, where the subject-matter of the texts to be translated is fixed, such an approach for the 'technical terms' in the corpus is clearly feasible; the notion is extended as far as possible to general vocabulary as well. For each concept a single term only is permitted, and although the resulting style may appear less mature (since the use of near synonyms for the sake of variety is not permitted), the problems described above are somewhat alleviated. Polysemy is noC entirely avoidable, but if reduced co a bare minimum, and permitted only in specific and acknowledged circumstances, the problem becomes more easily manageable.</Paragraph>
    <Paragraph position="20"> D. Interlin~ua A significant dichotomy in PIT is between the 'transfer' and 'tnterlingua' approaches. The former can be characterised by the use of bilingual transfer modules which convert the results of the analysis of the source language into a representation appropriate for a specific target language. This contrasts wlth the interlingua avproach in which the result of analysis is passed directly co the appropriate synthesis module.</Paragraph>
    <Paragraph position="21"> It is beyond the scope of the present paper to discuss in detail the relative merits of the two approaches (see Vauquois, i975:l&amp;2ff; Hutchins, lq78). I~ should however consider soma of the major obstacles inherent in the incerlingua approach.</Paragraph>
    <Paragraph position="22"> The development of an Interlingua for various purposes (noc only translation) has been the subject of philosophical debate for some years, and proposals for ~T have included the use of formalized natural language (e.g. Hel'~uk, Ig7&amp;; Andreev, lg67), artificial languages (like ~soeranco), or various symbolic representations, ~hecher linear (e.~. BUlcins, I061) or otherwise (e.~. &amp;quot;~ilks, 1073). Host of chess approaches are problematic however (for a thorough discussion of the lncerlingua approach co ~, see Often &amp; Pacak (1071) and Barnes (ig83)). Nevertheless, some incerlingua-based HT systems have been developed co a considerable degree: for example, the ~renohle team's first attempts at wT cook this approach (Veillon, 106R), while the TITUS system still in use ac the Anscicut Textile de France (Ducroc. Ig72; Zinge\[, 1~78~ is claimed to be (ncerlin~u,l-based.</Paragraph>
    <Paragraph position="23">  It seems that it can be assumed a priori thac an entirely language-independent theoretical representation of a given text is for all practical purposes impossible. A more realistic target seems to be a representation in which significant syntactic differences between the languages in question are neutralized so chat the best one can aim for is a languages-specific (sic) representation. This approach implies the definition of an Interlingua which cakes advantage of anything the languages in the system have in common, while accomodating their idiosyncrasies.</Paragraph>
    <Paragraph position="24"> This mains chat for * system which involves several fairly closely related languages the interlinsua approach is at least feasible, on the understanding that the introduction of a significantly different type of language may involve the complete redefinition of the Incerlingua (Barnes, 1983). ~rom the point of view of Beds, then, the common base of the languages involved can be used to great advantage. The notion of restricted syntax described above can be employed to filter out constructions chac cause particular problems for the chosen Interlingua representation.</Paragraph>
    <Paragraph position="25"> There remains however the problem of ~he representation of lexical items in the Interlingua. Theoretical approaches co this problem (e.g. Andreev, 1967) seem quite unsatisfactory. BuC the notion of controlled vocabulary&amp;quot; seems to offer a solution. If a oneco-one equivalence of 'technical' terms can be achieved, this leaves only a relatively small area of vocabulary for which an incerlingual representation must be devised. It seems reasonable, on a small scale, co treat general vocaOuiary tn an enelagous way co technical vocabulary, in particular creating lexical items in one language that are ambiguous with respect co any of the ocher languages as 'homographs'. Their 'disambiguation' must cake place in Analysis as there is no biltgual 'Transfer' phase, and Synthesis is purely deterministic. While this approach would be quite unsuitable for a larRescale general purpose HT system, in the present context - where the problem can be minimised - ~c seems Co be a reasonable approach.</Paragraph>
    <Paragraph position="26"> Our own model for the Bede tnCerlingua has noc yet been finalised. We believe this co be an area for research and experimentation once the system software has been more fully developed. ~ur current hypothesis is chat the InterlinRua will cake the form of a canonical representation of the text in which valency-houndness and (deep) ~e will play a significant role. Sentential features such as tense and aspect will be capcured by 'universal' system of values for the languages involved. This concepcion of an Interlingua clearly falls short of the language-independent pivot representation typically envisaged Ccf.</Paragraph>
    <Paragraph position="27"> Boitet &amp; NedobeJklne, 1980:2), but we hope :o demonstrate chac it is sufficient for the languages in our system, and chat it could be adapted without significant difficulties to cater for the introduction of other (related) Western European languages. We feel chat research in chLs area will, when the time comes, be a siEniflcanc and valuable by-product of the project as a whole. II. DESCRIPTION OF THE SYSTEM DESIGN In this second half of the paper we present a description of the translation process in Bede, as it is currently envisaged. The process is divided broadly into two parts, analysis and synthesis, the interface between the two being provided by the Interlingua. The analysis module uses a Chart-like structure (cf. Kaplan, 1973) and a series of grammars to produce from the source text the Incerlingua tree structure which serves as input to synthesis, where it is rearranged into a valid surface structure for the target language.</Paragraph>
    <Paragraph position="28"> The 'translation unit' (TU) is taken co be the sentence, or equivalent (e.g. section heading, title, figure caption). Full details of the rule formalisms are given in Somers (Ig81).</Paragraph>
    <Paragraph position="29"> A. Strln~ segmentation The TU is first subjected to a two-stage string-segmentation and 'lemmatlsation' analysis. In the first stage it is compared word by word with a 'stop-list' of frequently occurring words (mostly function words); words not found in the stop-list undergo string-segmentatlon analysis, again on a word by word basis. Stringsegmentation rules form a finite-state grammar of affix-stripping rules ('A-rules') which handle mostly inflectional morphology. The output is a Chart with labelled arcs indicating lexical unit (LU) and possible interpretatio n oPS the stripped affixes, this 'hypothesis' to be confirmed by dictionary look-up. By way of example, consider (I~, a possible French rule, which takes any word ending in -issons (e.g. finissons or h4rissons) and constructs an arc on the Chart recording the hypothesis that the word is an inflected form of an '-it' verb (i.e. finir or *h4rir).</Paragraph>
    <Paragraph position="31"> At the end of dictionary look-up, a temporary 'sentence dictionary' is created, consisting of copies of the dictionary entries for (only) those LUs found in the current TU. This is purely an efficiency measure. The sentence dictionary may of course include entries for homographs which will later be rejected.</Paragraph>
    <Paragraph position="32"> B. Structural analysis I. 'P-rules' The chart then undergoes a two-stage structural analysts. In the first stage, context-sensitive augmented phrase-structure rules ('P-rules') work towards creating a single arc spanning the entire TU. Arcs are labelled with appropriate syntactic class and syncactico-semantic feature information and a trace of the lower arcs which have been subsumed from which the parse tree can be simply extracted. The trivial P-rule (2) iS provided as an examnle.</Paragraph>
    <Paragraph position="33">  'geometry', and 'assignment stipulations'. The nodes of the Chart are by default identified by the value of the associated variable CLASS, though it is also possible to refer to a node by a local variable name and test for or assign the value of CLASS in the stipulations. Our rule formalisms are quite deliberately designed to reflect the formalisms of traditional linguistics.</Paragraph>
    <Paragraph position="34"> This formalism allows experimentation with a large number of different context-free parsing algorithms. We are in fact still experimenting in this area. For a similar investigation, though on a machine with significantly different time and space constraints, see Slocum (1981).</Paragraph>
    <Paragraph position="35"> 2. 'T-rules' In the second stage of structural analysis, the tree structure implied by the labels and traces on these arcs is disjoined from the Char~ and undergoes general tree-Co-cree-transductions as described by 'T-rules', resulting in a single tree structure representing the canonical form of the TU.</Paragraph>
    <Paragraph position="36"> * The formalism for the T-rules is similar co that for the P-rules, except in the geometry part, where tree structures rather than arc sequences are defined. Consider the necessarily more complex (though still simplified) example (3~. which regularises a simple English passive.</Paragraph>
    <Paragraph position="38"> Notice the necessity to 'disambPSRuate' the two NPs via curly-bracketted disamblRuators; the possibility of defining a partial geometry via the 'dummy' symbol ($~; and how the AUX and PREP are eliminated in the resulting tree structure.</Paragraph>
    <Paragraph position="39"> Labellings for nodes are copied over by default unless specifically suppressed.</Paragraph>
    <Paragraph position="40"> With source-language LUs replaced by unique multiiingual-dictionary addresses, this canonical representation is the Interlingua which is passed for synthesis into the target language(s~.</Paragraph>
    <Paragraph position="41"> C. Synthesis Assuming the analysis has been correctly performed, synthesis is a relatively straight-forward deterministic process. Synthesis commences with the application of further T-rules which assign new order and structure ~o she Interlingua as appropriate. The synthesis T-rules for a given language can be viewed as analogues ~f the T-rules that are used for analysis of that language, though it is unlikely that for syntbes~s  the analysis rules could be simpLy reversed, Once the desired structure has been arrived at, the trees undergo a series of context-sensitive rules used to assign mainly syntactic features co the leaves ('L-rules'), for example for the purpose of assigning number and gender concord (etc.). The formalism for the L-rules is aglin similar to that for the p-rules and T-rules, the geOmett'y pert this time definYng a single tree structure with no structural modification implied. A simple example for German is provided here (4).</Paragraph>
    <Paragraph position="42">  The llst of labelled leaves resulting from the application of L-rules is passed to morphological synthesis (the superior branches are no longer needed), where a finite-state grammar of morpbographemic and afftxation rules ('H-rules') is applied to produce the target string. The formalism for H-rules is much less complex than the A-rule fomelism, the grammar being again straightforwardly deterministic. The only taxing requirement of the M-rule formalism (which, at the ~ime of writing, has not been finalised) is that it must permit a wide variety of string manipulations to be described, and that it must define a transaparent interface with the dictionary. A typical rule for French for example might consist of stipulations concerning information found both on the leaf in question and in the dictionary, as in (5).</Paragraph>
    <Paragraph position="43"> (5) leaf info.: CLASS.V; TENSE.PRES; NUH.SING; PEgs-3; HOOD=INDIC dict. info.: CONJ(V)=IRREG assign: Affix &amp;quot;-T&amp;quot; to STEHI(V) D. General comments on system design The general modularity of the system will have been quite evident. A key factor, as mentioned above, is that each of these grammars is just powerful enough for the cask required of It: thus no computing power is 'wasted' at any of the intermediate stages.</Paragraph>
    <Paragraph position="44"> At each interface between grammars only a small part of the data structures used by the donating module is required by the receiving module. The 'unwanted' data structures are written to peripheral store co enable recovery of partial s~ructures in the case of failure or mistranslation, though automatic backtracking to previous modules by the system as such is not envisaged as a major component.</Paragraph>
    <Paragraph position="45"> The 'static' data used by the system consist of the different sets of l~nguistic rule packages, plus ~he dictionary. The system essentially has one large mu\[tilingual dictionary from which numerous software packages generate various subdiccionaries as required either in the :rans\[acion process itself, or for lexicographers  working on the system. Alphabetical or other structured language-specific listings can be produced, while of course dictionary updating and editing packages are also provided.</Paragraph>
    <Paragraph position="46"> The system as a whole can be viewed as a collection of Production Systems (PSs) (Newell, 1973; Davis &amp; King, 1977; see also Ashman (1982) on the use of PSs in HT) in the way that the rule packages (which, incidentally, as an efficient7 iI~alute, undergo separate syntax verification and 'compilation' into interpretable 'code') operate on the data structure. The system differs from the classical PS setup in distributing its static data over two databases: the rule packages and the dictionary. The combination of the rule packages and the dictionary, the software interfacing these, end the rule interpreter can however be considered as analgous to the rule interpreter of a classical P$.</Paragraph>
    <Paragraph position="47"> IIl. CONCLUSION As an experimental research project, Bede provides us with an extremely varied range of computational linguistics problems, ranging from the principally linguistic task of rule-writing, to the essentially computational work of software tmplen~lncatton, with lexicography and terminology playing their part along the way.</Paragraph>
    <Paragraph position="48"> gut we hope too that Bade is more than an academic exercise, and that we are making a significant contribution to applied Computational linguistics research.</Paragraph>
  </Section>
  <Section position="3" start_page="169" end_page="169" type="metho">
    <SectionTitle>
IV. ACKNOWLEDCHENTS
</SectionTitle>
    <Paragraph position="0"> I present this paper only as spokesman for a large group oPS people who have worked, are working, or will work on Bede. Therefore I would like to thank colleagues and students at C.C.L., past, present, and future for their work on the project, and in particular Rod Johnson, Jock HcNeughc, Pete Whitelock, Kieran ~ilby, Tonz Barnes, Paul Bennett and Reverley Ashman for he\[~ with ~his write-up. I of course accept responsibility for any errors thac slipped through that tight net.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML