XML Viewer - a83-1028

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1028_metho.xml
Size: 41,529 bytes
Last Modified: 2025-10-06 14:11:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="A83-1028">
  <Title>A STATUS REPORT ON THE LRC MACHINE</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The LRC MT system is one of very few large-scale applications of modern computational linguistics techniques \[Lehmann, 1981\]. Although the LRC MT system is nearing the status of a production system (a version should be delivered to the project sponsor about the time this conference takes place), it is not at all static; rather, it is an evolving collection of techniques which are continually tested through application to moderately large technical manuals ranging from 50 to 200 pages in length. Thus, our &amp;quot;applied&amp;quot; system remains a research vehicle that serves as an excellent testbed for proposed new procedures.</Paragraph>
    <Paragraph position="1"> In general, the criteria for our choice of linguistic and computational techniques are three: effectiveness, convenience of use, and efficiency.</Paragraph>
    <Paragraph position="2"> These criteria are applied in a context where the production of an MT system to be operational in the near-term future is of critical concern.</Paragraph>
    <Paragraph position="3"> Candidate techniques which do not admit near-term, large-scale application thus suffer an overwhelmins disadvantage. The questions confronting us are, then, twofold: (I) which techniques admit such application; and (2) which of these best satisfy our three general criteria? The first question is usually answered through an evaluation of the likely difficulties and requirements for implementation; the second, through empirical results in the course of experiments.</Paragraph>
    <Paragraph position="4"> Our evaluation of the LRC ~E system's current status will be based on three points! (a) the system's provision of all the tools necessary for users to effect the complete translation process (including text processing, editing, terminology meinten~ok-up, etc.); (b) quantitae., throughput on a particul. (c) qualitative PerformS'known about overall performliveness (i.e., the number c o be\] supported by a single &amp;quot; * , m, cheer \[expected\] cnrougnp any other personnel necessar-to-day operation of the sySted \] overall costs of translat the norm experienced in human-fine I numbers&amp;quot; will not be a of the conference.</Paragraph>
    <Paragraph position="5"> but the ninary experiments by our sponthus some reasonable projecti,</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="167" type="metho">
    <SectionTitle>
~UES EMPLOYED
</SectionTitle>
    <Paragraph position="0"> Our ~een &amp;quot;linguistic ceohniqu%ional techniques&amp;quot; (dlscussqection) is somewhat artificihalidity in a broad sense, aSfrom an overview of the point s section we present the reas of the following  A. Phras, In ,e employ a phrasestructur~Y sufficient lexical controls ~ lexical-funccional grammar all our linguistic decision most controversial, and cons the most attention. Generall are two competing claims: rules per se are inadequ~-S., \[Cullingford, 1978\]);~r forms of gr~ummar (ATNs \[~rmational \[Petrick, 1973\], ;1972\], word-experts \[Small,~rior. We will deal with th.</Paragraph>
    <Paragraph position="1"> Th~ght that claim that syntax ~propriate models of languas according to this notion, be treated \[almost\] entirely on the basis of semantics, guided by a strong underlying model of the current situational context, and the expectations that may be derived therefrom. We cannot argue against the claim chat semantics is of critical concern in Natural Language Processing. However, as yet no strong case has been advanced for the abandonment of syntax. Moreover, no system has been deleloped by any of the adherents of the &amp;quot;semantics only&amp;quot; school of thought that has more-or-less successfully dealt with ALL of a vide range -- or at least large volume -- of ~aterial. A more damaging argument against this school is that every NLP system to date that HAS been applied Co large volumes of text (in the attempt to process ALL of it soma significant sense) has been based on a strong syntactic model of language (see, e.g., \[Boater et el., 1980b\], \[Damerau, 1981\]. \[Hendrix et el., 1978\], \[Lehmann et el,, 1981\], \[Martin et el., 1981\], \[Robinson. 1982\], and \[Sager. 1981\]). There are other schools of thought that hold phrase-structure (PS) rules in disrespect, while admitting the utility (necessity) of syntax. It is claimed that the phrase-structure formalism is inadequate, and that other forms of gr----r are necessary. (This has been a long-standing position in the linguistic community, being upheld there before most computational linguists jumped on the bandwagon; ironically, this position is nov being challenged by some within the linguistic community itself, who are once again supporting PS rules as a model of natural language use \[Gazdar, 1981\].) The anti-PS positions in the NLP c~umiCy are all, of necessity, based on practical considerations, since the models advanced to replace PS rules are formally equivalent in generative power (assuming the PS rules to be augmented, which is always the case in modern NLP systems employing them). But cascaded ATNs \[Woods, 1980\], for example, are only marginally different from PS rule systems. It is curious to note that only one of the remaining contenders (s transformational gr--,--r \[Damerau, 1981\]) has been demonstrated in large-scale application -- and even this system employs PS rules in the initial stages of parsing. Other formal systems (e.g., procedural gr=mm-rs \[Winograd, 1972\]) have been applied to semantically deep (but linguistically Lmpoverished) domains -- or Co excessively lJJniced domains (e.g., Smell's \[1980\] &amp;quot;word expert&amp;quot; parser seems to have encompassed a vocabulary of less than 20 items).</Paragraph>
    <Paragraph position="2"> For practical application, it is necessary thaca system be able co accumulate grammar rules, and especially lexical items, aca prodigious rate by current NLP standards. The formalisms competing with PS rules and dictionary entries of modest size seem to be universally characterizable as requiring enormous human resources for their implementation in even a moderately large environment. This should not be surprising: it is precisely the claim of these competing methodologies (chose Chac are ocher than slight variations on PS rules) that language is an exceedingly complex phenomenon, requiring  correspondingly complex techniques to model. For &amp;quot;deep understanding&amp;quot; applications, we do noc contest this claim. But we do maintain chat there are some applications that do not seem to require this level of effort for adequate results in a practical setting. Our particular application -automated translation of technical texts -- seems CO fell in ChiJ category.</Paragraph>
    <Paragraph position="3"> The LRC )iT system is currently equipped with something over 400 PS rules describing the Source Language (German), and nearly 10,000 lexical entries in each of two languages (German and the Target Language -- English). The current state of our coverage of the SL is that the system is able to parse and acceptably translate the majority of sentences in previously-unseen texts, within the subject areas bounded by our dictionary (specific figures will be related below). By the time this conference convenes, we will have begun the process of adding to the system an analysis grammar of the current TL (English), so that the direction of translation may be reversed; we anticipate bringing the English grammar up to the level of the German gr2m~ar in about a year's time. Our expectations for eventual coverage a~e that around 1,000 PS rules will he adequate co account for almost all sentence forms actually encountered in technical texts, whatever the language. We do not feel constrained to account for every possible sentence form in such texts -nor for sentence forms not found in such texts (as in the case of poetry) -- since the required effort would not be cost-effective whether measured in financial or human terms, even if it were possible using current techniques (which ve doubt).</Paragraph>
    <Paragraph position="4"> B. Syntactic Features Our use of syntactic features is relatively noncontroversial, given our choice of the PS rule formalism. We employ syntactic features for two purposes. One is the usual practice of using such features to restrict the application of PS rules (e.g., by enforcing subject-verb number agreement). The other use is perhaps peculiar to our type of application: once an analysis is achieved, certain syntactic features are employed to control the course (and outcome) of translation -- i.e., generation of the TL sentence. The &amp;quot;augmentations&amp;quot; to our PS rules include procedures written in a formal language (so that our linguists do not have Co learn LISP) that manipulate features by restricting their presence. their values if present, etc., and by moving them from node to node in the &amp;quot;parse tree&amp;quot; during the course of the analysis. As is the case with other researchers employing such techniques, we have found this to be an extremely powerful (and of course necessary) means of restricting the activities of the parser.</Paragraph>
  </Section>
  <Section position="4" start_page="167" end_page="169" type="metho">
    <SectionTitle>
C. Semantic Features
</SectionTitle>
    <Paragraph position="0"> We employ simple semantic features, as opposed to complex models of the domain. Our reasons are primarily practical. First, they seem sufficient for at least the initial stage of our application. Second, the thought of writing complex models of even one complete technical domain is staggering: the operation and maintenance manuals we ar e currently working with (describing a digital telephone switching system) are part of a document collection that is expected to comprise some 100,000 pages of text when complete. A research group the size of ours would not even be able to read that volume of material, much less write the &amp;quot;necessary&amp;quot; semantic models subsumed by it, in any reasonable amount of time.</Paragraph>
    <Paragraph position="1"> (The group would also have to become electronics engineers, in all likelihood.) If such models are indeed required for our application, we will never succeed.</Paragraph>
    <Paragraph position="2"> As it turns out, we are doing surprisingly well without such models. In fact, our semantic feature system is not yet being employed to restrict the analysis effort at all; instead, it is used at &amp;quot;transfer time&amp;quot; (described later) to improve the quality of the translations, primarily of prepositions. We look forward to extending the use of semantic features to other parts of speech, and to substantive activity during analysis; but even we were pleased at the results we achieved using only syntactic features.</Paragraph>
    <Paragraph position="3"> D. Scored Interpretations It is a well-known fact that NLP systems tend to produce many readings of their input sentences (unless, of course, constrained to produce the first reading only -- which can result in the &amp;quot;right&amp;quot; interpretation being overlooked). The LRC MT system produces all interpretations of the input &amp;quot;sentence&amp;quot; and assigns each of them a score, or plausibility factor \[Robinson, 1982\]. This technique can be used, in theory, to select a &amp;quot;best&amp;quot; interpretation from the possible readings of an ambiguous sentence. We base our scores on both lexical and grammatical phenomena -- plus the types of any spelling/typographical errors, which can sometimes be &amp;quot;corrected&amp;quot; in more than one way. Our experiences relating to the reliability and stability of heuristics based on this technique are decidedly positive: we employ only the (or a) highest-scoring reading for translation (the others being discarded), and our informal experiments indicate that it is very rarely true that a better translation results from a lower-scoring analysis. (Surprisingly often, a number of the higher-scoring interpretations will be translated identically. But poorer translations are frequently seen from the lower-scoring interpretations, demonstrating that the technique is indeed effective.) to syntactic constructs. (Actually, both styles are available, but our linguists have never seen the need or practicality of employing the open-ended variety). It is clearly more efficient to index tranoformations to specif ic rules when possible; the import of our findings is that it seems to be unnecessary to have open-ended transformations -- even during analysis, when one might intuitively expect them to be useful.</Paragraph>
    <Paragraph position="4"> F. Transfer Component It is frequently argued that translation should be a process of analyzing the Source Language (SL) into a &amp;quot;deep representation&amp;quot; of some sort, then directly synthesizing the Target Language (TL) (e.g., \[Carbonnel, 1978\]). We and others \[King, 1981\] contest this claim -especially with regard to &amp;quot;similar languages&amp;quot; (e.g., those in the Indo-European family). One objection is based on large-scale, long-term trials of the &amp;quot;deep representation&amp;quot; (in MT. called the &amp;quot;pivot language&amp;quot;) technique by the MT group at Grenoble \[Boitet, 1980a\]. After an enormous investment in time and energy, includin E experiments with massive amounts of text, it was decided that the development of a suitable pivot language (for use in Russian-French translation) was probably impossible. Another objection is based on practical considerations: since it is not likely that any NLP system will in the foreseeable future become capable of handling unrestricted input -- even in the technical area(s) for which it might be designed -- it is clear that a &amp;quot;fail-soft&amp;quot; technique is necessary. It is not obvious that such is possible in a system based solely on a pivot language; a hybrid system capable of dealing with shallower levels of understanding is necessary in a practical setting.</Paragraph>
    <Paragraph position="5"> This being the case, it seems better in near-term applications to start off with a system employing a &amp;quot;shallow&amp;quot; but usable level of analysis, and deepen the level of analysis as experience dictates, and resources permit.</Paragraph>
    <Paragraph position="6"> Our alternative is to have a &amp;quot;transfer&amp;quot; component which maps &amp;quot;shallow analyses of sentences&amp;quot; in the SL into &amp;quot;shallow analyses of equivalent sentences&amp;quot; in the TL, from which synthesis then takes place. While we and the rest of the NLP community continue to debate the nature of an adequate pivot language (i.e., the nature of deep semantic models and the processing they entail), we can hopefully proceed to construct a usable system capable of progressive enhancement as linguistic theory becomes able to support deeper models.</Paragraph>
    <Paragraph position="7"> G. Attached Translation Procedures E. Indexed Transformations We employ a transformational component, during both the analysis phase and the translation phase. The transformations, however, are indexed to specific syntax rules rather than loosely keyed Our Transfer procedures (which effect the actual translation of SL into TL) are tightly hound to nodes in the analysis (parse tree) structure \[Paxton, 1977\]. They are, in effect, suspended procedures -- the same procedures that constructed the corresponding parse tree nodes to begin with. This is to be preferred over a more  general, loose association based on syntactic constructs because, aside from its advantage in sheer computational efficiency, it eliminates the possibility that the '~rong&amp;quot; procedure can be applied to a construct. The only real argument against this technique, as we see it, is based on space considerations: to the extent that different constructs share the same transfer operations, replication of the procedures chat /~plemenc said operations (and editing effort to modify them) is possible. We have not noticed this to be a problem. For a while, our system load-up procedure searched for duplicates of this nature and eliminated them; however, the gains turned out to be minimal N different constructs typically do require different operetions.</Paragraph>
    <Paragraph position="8"> llI COMPUTATIONAL TECHNIQUES ~qFLOYED Again, our separation of &amp;quot;linguistic&amp;quot; from &amp;quot;computational&amp;quot; techniques is somewhat artificial. but nevertheless useful. In this section we present the reasons for our use of the following cmaputation~l techniques: (a) an all-paths, bottom-up parser; (b) associated rule-body procedures; (c) spelling correction; (d) another fail-soft analysis technique; and (e) recurslve parsing of parenthetical expressions.</Paragraph>
    <Paragraph position="9"> A. All-paths, Bottom-up Perser Among all our choices of computational techniques, the use of an all-paths, bottom-up parser is probably the most controversial. It also received our greatest experimental scrutiny.</Paragraph>
    <Paragraph position="10"> We have collected * substantial body of empirical evidence relating Co parsing techniques. Since the evidence and conclusions require lengthy discussion, and are presented elsewhere \[Slocum, 1981\], we will only briefly s~rize the results.</Paragraph>
    <Paragraph position="11"> The evidence indicates that our use of an all-paths bottom-up parser is justified, given the current state of the art in Computational Linguistics. Our reasons are the following: first, the dreaded &amp;quot;exponential explosion&amp;quot; of processing time has not appeared (and our sr---~=r and test texts are among the largest in the world), but instead, processing time appears Co be linear with sentence length -- even though our system produces all possible interpretations; second, top-down parsing methods suffer inherent disadvantages in efficiency, and bottom-up parsers can be and have been augmented with &amp;quot;top-down filtering&amp;quot; to restrict the syntax rules applied to those that an all-paths top-down parser would apply; third, it is difficult to persuade a top-down parser to continue the analysis effort to the end of the sentence, when it blocks somewhere in the middle -- which makes the implementation of &amp;quot;fail-soft&amp;quot; techniques that much more difficult; and lastly, the lack of any strong notion of how to construct a &amp;quot;best-path&amp;quot; parser, coupled with the raw speed of well-unplemented parsers, implies that an all-paths parser which scores interpretations and can continue the analysis to the end of the sentence is best in a practical application such as ours.</Paragraph>
    <Paragraph position="12">  B. Associated Rule-body Procedures We associate a procedure directly with each individual syntax rule, and evaluate it as soon as the parser determines the rule to be (seemingly) applicable \[Pratt, 1973; Hendrix, 1978\] -- hence the term &amp;quot;rule-body procedure&amp;quot;. This practice is equivalent to what is done in ATN systems. From the linguist's point of view, the contents of our rule-body procedures appear to constitute a formal language dealing with syntecCic and semantic features/values of nodes in the tree -- i.e., no knovledse of LISP is necessary to code effective procedures. Since these procedures are compiled into LISP, all the power of LISP is available as necessary. The chief linguist on our project, who has * vague knowledge of LISP, has employed OR and AND operators to a significant extent (we didn't bother to include them in the specifications of the formal language, though we obviously could have), and on rare occasions has resorted to using COND. No other calls to true LISP functions (as opposed to our formal operators, which are few and typically quite primitive) have seemed necessary.</Paragraph>
    <Paragraph position="13"> nor has this capability been requested, to date.</Paragraph>
    <Paragraph position="14"> The power of our rule-body procedures seems to lie in the choice of features/values that decorate the nodes, rather than the processin E capabilities of the procedures themselves.</Paragraph>
    <Paragraph position="15"> C. Spelling Correction There are limitations and dansers to spelling correction in general, but we have found it to be an indispensable component of an applied system.</Paragraph>
    <Paragraph position="16"> People do make spelling and typographical errors, as is well known; even in &amp;quot;polished&amp;quot; documents they appear with surprising frequency (about every other PaSe, in our experience). Arguments by LISP programmers (re: INTERLISP's DWIM) aside, users of applied NLP systems distinctly dislike being confronted with requests for clarification -- or, worse, unnecessary ~ailure -- in lieu of automated spelling correction. Spelling correction.</Paragraph>
    <Paragraph position="17"> therefore, is necessary.</Paragraph>
    <Paragraph position="18"> Luckily, almost all such errors are treatable with simple techniques: single-letter additions, omissions, and mistakes, plus two- or three-letter transpositions account for almost all mistakes.</Paragraph>
    <Paragraph position="19"> Unfortunately, it is not infrequently the case that there is more than one way to &amp;quot;correct&amp;quot; a mistake (i.e., resulting in different corrected versions). Even a human cannot always determine the correct form in isolation, and for NLP systems it is even more difficult. There is yet another problem with automatic spelling correction: how much to correct. Given unlimited rein, any word can be &amp;quot;corrected&amp;quot; Co any other. Clearly there must be limits, but what are they? Our informal findings concerning how much one may safely &amp;quot;correct&amp;quot; in an application such as ours are these: the few errors chat simple techniques ha~e not handled are almost always bizarre (e.g., repeated syllables or larger portions of words) or highly unusual (e.g., blanks inserted within words); correction of more than a one error in a word is dangerous (it is better to treat the word as unknown, hence a noun); and &amp;quot;correction&amp;quot; of errors which have converted one word into another (valid in isolation) should not be tried.</Paragraph>
    <Paragraph position="20"> D. Fail-soft Gr~-m-tical Analysis In the event of failure to achieve a comprehensive analysis of the sentence, a system such as ours -- which is to be applied to hundreds of thousands of pages of text -- cannot indulge in the luxury of simply replying with an error message stating Chat the sentence cannot be interpreted. Such behavior is a significant problem, one which the NLP commuuity has failed to come to grips with in any coherent fashion. There have, at least, been some forays. Weishedel and Black \[1980\] discusa techniques for interacting with the linguist/developer to identify insufficiencies in the grammar. This is fine for development purposes. But, of course, in an applied system the user will be neither the developer nor a linguist, so this approach has no value in the field. Rayes and Mouradlan \[1981\] discuss ways of allowing the parser to cope with ungr-----tical utterances; this work is in its infancy, but it is stimulating nonetheless. We look forward to experimenting with similar techniques in our system.</Paragraph>
    <Paragraph position="21"> What we require now, however, is a means of dealing with &amp;quot;ungrammatical&amp;quot; input (whether through the human's error or the shortcomings of our own rules) that is highly efficient, sufficiently general to account for a large, unknown range of such errors on its first outing, and which can be implemented in a short period of time. We found just such a technique three years ago: a special procedure (invoked when the analysis effort has been carried through to the end of the sentence) searches through the parser's chart to find the shortest path from one end to the other; this path represents the fewest, longest-spanning phrases which were constructed durinE the analysis. Ties are broken by use of the standard scoring mechanism that prwides each phrase in the analysis with a score, or plausibility measure (discussed earlier). We call this procedure &amp;quot;phrasal analysis'.</Paragraph>
    <Paragraph position="22"> Our phrasal analysis technique has proven to be useful for both the developers and the end-user, in our application: the system translates each phrase individually, when a comprehensive sentence analysis is not available.</Paragraph>
    <Paragraph position="23"> The linguists use the results to pin-point missing (or faulty) rules. The users (who are professional translators, editing the MT system's output) have available the best translation possible under the circumstances, rather than no usable output of any kind. To our knowledge, no other NLP system relies on a such a general technique for searching the parser's chart when an analysis effort has failed. We think that phrasal analysis -- which is simple and independent of both language and grammar -- could be useful in ocher applications of NLP technology, such as natural language interfaces to databases.</Paragraph>
    <Paragraph position="24"> E. Recursive Parsing of Parenthetical Expressions Few ~LP systems have dealt with parenthetical expressions; but MT researchers know well that these constructs appear in abundance in technical texts. We deal with this phenomenon in the following way: rather than treating parentheses as lexical items, we make use of LISP's natural treatment of them as list delimiters, and treat the resulting subliats as individual &amp;quot;words&amp;quot; in the sentence; these '~ords&amp;quot; are &amp;quot;lexically analyzed&amp;quot; via recursive calls to the parser.</Paragraph>
    <Paragraph position="25"> Aside from the elegance of the treatment, this has the advantage that &amp;quot;ungra---atical&amp;quot; parenthetical expressions may undergo phrasal analysis and thus become single-phrase entities as far as the analysis of the encompassing sentence is concerned; thus, ungr----atical parenthetical expressions need not result in ungrammatical (hence poorly handled) sentences.</Paragraph>
  </Section>
  <Section position="5" start_page="169" end_page="172" type="metho">
    <SectionTitle>
IV CURRENT STATUS
A. Adequate Support Tools
</SectionTitle>
    <Paragraph position="0"> No NLP system is likely to to be successful in isolation: an enviro,--ent of support tools is necessary for ultimate acceptance on the part of prospective users. The following support tools, we think, constitute a minimum workable enviro,--ent for both development and use: a DBMS for handling lexical entries; validation programs that verify the admissability of all linguistic rules (gr---.ar, lexicons, transformations, etc.) accordin E to a set of formal specifications; dictionary programs that search through large numbers of proposed new lexical entries (words, in all relevant languages) to determine which entries are actually new, and which appear to replicate existing entries; defaulting programs that &amp;quot;code&amp;quot; new lexical entries in the NLP system's chosen formalism automatically, given only the root forms of the words and their categories, using empirically determined best guesses based on the available dictionary database entries plus whatever orthographic information is available in the root forms; and benchmark programs to test the integrity of the NLP system after significant modifications \[Slocum, 1982\]. A DB}~ for handling grammar rules is also a good idea.</Paragraph>
    <Paragraph position="1"> For Machine Translation applications, one must add: a collection of text-processing programs that \[semi-\]automatically mark and extract translatable segments of text from large documents, and which automatically insert translations produced by the MT system back into the original document, preserving all formatting conventions such as tables of contents, section headings, paragraphs, multi-column tables, flowcharts, figure labels, and the like; a powerful on-line editing program with special capabilities (such as single-keystroke commands to look up words in on-line dictionaries) in addition to the normal editing commands (almost all of  which should be invokable with a single keystroke); and also, perhaps, (assess to) a &amp;quot;term databank,&amp;quot; i.e., an on-line database of technical terms used in the subject area(s) to be covered by the ~ system.</Paragraph>
    <Paragraph position="2"> The LRC MT system already provides all of the tools mentioned above, with the exception of the text editor and terminology database (both of which our sponsor viii provide). All of this comes in a single intngraCed working enviro-~ent, so that our linguists and lexicographers can implement changes and test them i~nediataly for their effects on translation quality, and modify or delete their additions with ease, if desired.</Paragraph>
    <Paragraph position="3"> S. Quantitative Performance The average performance of the LRC MT system when translating technical manuals from German into English, runnin S in compiled INTERLISP on a DEC 2060 with over a million words of physical m~ory, has been measured at slightly under 2 seconds of CPU time per input word; this includes storage management (the garbage collector alone cousmes 45Z of all CPU time on this limited-address-space machine), paging, swapping, and I/0 ~ that is, all forum of overhead. Our experience on the 2060 involved the translation of some 330 pages of text, in three segments, over a two year period.</Paragraph>
    <Paragraph position="4"> On our Symbolica LM-2 Lisp Machine. with 256K vords of physical memory, preliminary measurements indicate an average prefornance of 6-10 seconds (real time) per input word, likewise including all forms of overhead. Our LM-2 experience to date has involved the translation of about 200 pages of text in a single run. The paging rate indicates that, with added memory (512K words is &amp;quot;standard&amp;quot; on these machines), we could expect a significant reduction in this performance figure. With a faster, second-generation Lisp Machine, we would expect a more substantial reduction of real-time processing requirements. We hope to have had the opportunity to conduct an experiment on at least one such machine, by the time this conference convenes.</Paragraph>
    <Paragraph position="5"> C. Qualitative Performance Measuring MT system throughput is one thin S.</Paragraph>
    <Paragraph position="6"> Measuring &amp;quot;machine translation quality&amp;quot; is quite another, since the standards for measurement (and for interpretin S the measurements) are little understood, and vary widely. Thus, &amp;quot;quality&amp;quot; measurements are of little validity, However, because there is usually a considerable amount of lay interest in such n~--bers, we shall endeavor to indicate why they are basically meaningless, and then report our findings for the benefit of those who feel a need to know.</Paragraph>
    <Paragraph position="7"> Certainly it is the case that &amp;quot;correctness&amp;quot; numbers can theoretically give some indication of the quality of translation. If an ~ system were said to translate, say, IOZ of its input correctly, no one would be likely to consider it unable. The trouble is, quoted figures almost universally hover at the opposite extreme of the speetrtun -- around 90X -- for }iT systems that vary r~arkably v.r.t, the subjective quality of their output. (Since, to the lay person, &amp;quot;90Z correct&amp;quot; seom8 to constitute minimal acceptable quality.</Paragraph>
    <Paragraph position="8"> the consistent use of the 90Z figure should not be surprising.) The trouble arises from at least the following human variables: who performs the meaaur~ent? what, exactly, is measured? and by whet standards? Since almost all measurements are performed by the vendor of the system in question. there is obvious room for bias. Second, if one measures '~orda translated correctly,&amp;quot; whatever that means, that is a very different thin S from measuring, e.g., &amp;quot;sentences translated correctly.&amp;quot; whatever that means. Finally, there is the matter of defining the operative word, &amp;quot;correct'. Since no two translators are likely to agree on what constitutes a &amp;quot;correct&amp;quot; translation -- tO say nothing of establishing a rigorous, objective standard -- the notion of &amp;quot;correctness&amp;quot; will naturally vary depending on who determines it. It will also vary depending on the amount of time available to perform the measurement: it is widely recognized that an editor viii change more in a given translation, the more time he has to work on it. Finally, &amp;quot;correctness&amp;quot; will vary depending on the use to which the translation is intended to be put, the classical first division being information acquisition vs. dissemination.</Paragraph>
    <Paragraph position="9"> There are a few subsidiary qualifications that must be applied to statements of measured quality: what kinds of text were involved? who chose them? did the vendor have access to them before the cast? if so, in what form? and for bow lone? These are critically important questions relating to the interpretation of the results. It stands to reason that, to get the most trustworthy figures: the system should be applied Co such varieties of text as ic is intended to handle (in the near term, at least); the texts should be chosen by the user, and not divulged to the vendor beforehand except perhaps in the form of a list of words or technical terms (in root form) which appear therein -- and that.</Paragraph>
    <Paragraph position="10"> for not too ion S a period of time before the test. With the reader bearing all of the above in mind, we report the following quality measurements: during the last two years. LRC personnel have measured the quality of translations produced by the L~C MT system in terms of the percentage of sentences (actually.</Paragraph>
    <Paragraph position="11"> &amp;quot;translation units', since isolated words and phrases appear frequently) which were translated from German into acceptable English; if any change to the translated unit was necessary, however slight, the translation was considered incorrect; the test runs were made once or twice for each text -- once, before the text was ever seen by the LRC staff (a &amp;quot;blind&amp;quot; run), and once more, after a few months of system enhancement based in part on the previous results (a &amp;quot;follow-up&amp;quot; run); the project sponsor always provided the LRC with a  list of the words and technical terms said to be employed in the text (the list was sometimes incomplete, as one would expect of human compilations of the vocabulary in a large document). The first run, on a 50-page text, was performed only after the text had been studied for some time; the second and third runs, on an 80-page text, were performed both ways ('blind&amp;quot; and &amp;quot;follow-up'); the fourth test was a blind run on a 200-page text. The figures so measured varied from 55% to 85% depending on the text, and on whet~er the test was a blind or follow-up run.</Paragraph>
    <Paragraph position="12"> A fifth test -- a follow-up run on the text used in the fourth test -- has already been performed, but the qualitative results are not available at this writing. The results of this run and two more blind runs on ~wo very different texts totalling 160 pages should be available when the conference convenes; these qualitative results are all to be measured by professional technical translators employed by the project sponsor.</Paragraph>
    <Paragraph position="13"> D. Interpretation of the Results Any positive conclusions we might draw based on such data will be subject to certain objections. It has been argued that, unless an MT system constitutes an almost perfect translator, it will be useless in any practical setting \[Kay, 1980\]. As we interpret it, the argument proceeds something like this: (I) there are classical problems in Cemputational Linguistics that remain unsolved to this day (e.g., anaphora, quantifiers, conjunctions); (2) these problems will, in any practical setting, compound on one another so as to result in a very low probability that any given sentence will be correctly translated; (3) it is not in principle possible for a system suffering from malady (1) above to reliably identify and mark its probable errors; (4) if the human post-editor has to check every sentence to determine if it has been correctly translated, then the translation is useless.</Paragraph>
    <Paragraph position="14"> We accept claims (i) and (3) without question. We consider claim (2) to be a matter for empirical validation -- surely not a very controversial contention. As it happens, the substantial body of empirical evidence gathered by the LRC to date argues against this claim. By the time the conference convenes, we will have more definitive data to present, derived by the project sponsor.</Paragraph>
    <Paragraph position="15"> Regarding (4), we embrace the asaumption that a human post-editor will have to check the entire translation, sentence-by-sentence; but we argue that Kay&amp;quot;s conclusion (&amp;quot;then the translation is useless&amp;quot;) is again properly a matter for empirical validation. Meanwhile, we are operating under the assumption that this conclusion is patently false -- after all, where translation is taken seriously, human translations are routinely edited via exhaustive review, but no one claims that they are uselessl E. Overall Performance In this section we advance a meaningful.</Paragraph>
    <Paragraph position="16"> more-or-less objective metric by which any MT system can and should be judged: overall (man/machine) translation performance. The idea is simple. The MT system must achieve two simultaneous goals: first, the system's output must be acceptable to the translator/editor for the purpose of revision; second, the cost of the total effort (including amortization and maintenance of the hardware and software) must be less than the current alternative for like material -- human translation followed by post-editing.</Paragraph>
    <Paragraph position="17"> There may be a significant problem with the reliability of human revisors&amp;quot; judgements (which are nevertheless the best available): the writer has been told by professional technical editors/ translators (potential users of the LRC HT system) that they look forward to editing our machine translations &amp;quot;because the machine doesn't care&amp;quot; \[private communication\]. (That is, they would change more in a machine translation than in a supposedly equivalent human translation because they would not have to worry about insulting the original translator with what s/he might consider &amp;quot;petty&amp;quot; changes.) Thus, the &amp;quot;correctness&amp;quot; standards to be applied to MT will very likely differ from those applied to human translation, simply due to the translation source. Since the errors committed by an MT system seldom resemble errors made by human translators, the possibility of a &amp;quot;Turing test&amp;quot; for an MT system does not exist at the current time.</Paragraph>
    <Paragraph position="18"> When the conference convenes, we will present such data as we have, bearing on the issue of overall performance using our system. Preliminary data from at least one outside assessment should be available. This information will tend co indicate the readiness of our system for use in a production translation enviroement.</Paragraph>
    <Paragraph position="19"> V DISCUSSION We have commented on the relative merits in large-scale application of several linguistic techniques: (a) a phrase-structure grammar; (b) syntactic features; (c) semantic features; (d) scored interpretations; (e) transformations indexed to specific ~ules; (f) a transfer component; and (g) attached procedures to effect translation. We also have presented our findings concerning the practical merits of several computational techniques: (a) a bottom-up, all-paths parser; (b) associated rule-body procedures; (c) spelling correction; (d) chart searching in case of analysis failures; and (e) recursive parsing of parenthetical expressions. We believe these findings constitute useful information about the state of the art in Computational Linguistics.  We will not have any fim empirical evidence concerning overall performance until later in 1983, when the LEt )iT system will have been used in-house by our sponsor, for very-large-scale translation experiments. However, we will have some preliminary data from our sponsor source that can be adduced as a basis for extrapolation. (Our sponsor will indeed be using the data for just such a purpose.) This should constitute useful information about the state of the arc in Machine Translation at the University of Texas. To the extent that such findings are positive, they will lend credence Co our claims regarding the practical utility of the methods we employed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML