XML Viewer - a94-1015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1015_metho.xml
Size: 9,828 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1015">
  <Title>A Successful Case of Computer Aided Translation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Problem
</SectionTitle>
    <Paragraph position="0"> The work I am about to describe originated in the question &amp;quot;Which Machine Translation system should I use to translate my book from Portuguese into English?&amp;quot;. To which the only fair answer from anyone acquainted with the current state of the art of Machine Translation ought to be a (maybe qualified) &amp;quot;None!&amp;quot;. The book in question is Semigrupos Finitose Algebra Universal, a textbook on finite semigroup theory.</Paragraph>
    <Paragraph position="1"> The problem was then reformulated as &amp;quot;What may help me in avoiding to type in all the 400 pages of the book, given that it is a book on Mathematics and was prepared in IbTEX?&amp;quot;. A book on Mathematics meant that the language used was somewhat formal and that all mathematical formulas could be preserved during the translation. That the book has been prepared in I_$TEX , a text processor widely used by mathematicians and computer scientists (Lainport, 1986), meant that it would be possible to use the IbTEX commands in the text, for instance, to detect the boundaries of formulas (the same would be true of texts encoded using a mark-up language like, for instance, SGML). This far less ambitious goal of building some tools that would help the translation seemed quite attainable, even if at the time the final result was not expected to be as good as it turned out to be.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="91" type="metho">
    <SectionTitle>
3 The Method
</SectionTitle>
    <Paragraph position="0"> The basic method employed consists in having a dictionary of rewrite rules, each one being a sequence of words in Portuguese with its counterpart in English, and in applying these rules to the source text. The dictionary is looked up from the beginning so that the first rule in it whose left-hand side (lhs) matches a prefix of the source text is the one selected, irrespective of existing other rules that could be applied. This means that a dictionary rule whose lhs is a prefix of the lhs of another rule must appear after it. If no rule can be applied the first word in the source text is left unchanged. In any case the same method is used for the rest of the text.</Paragraph>
    <Paragraph position="1"> A finer analysis of the source text was added to this basic method in order to cope with I$TEX com- null mands, so that * mathematical formulas, that must be left unchanged, can be detected, * IbTEX denotations of diacritics are taken as belonging to the words they occur in, * some commands (which were called transpar null ent), such as those for section names or footnotes, have their arguments translated, while all the others are left unchanged.</Paragraph>
    <Paragraph position="2"> Another refinement was the possibility of having  rewrite rules with parameters that stand for formulas, as in de $1 sobre $2v-*of $1 on $2 Finally, in order to deal with proper nouns, a capital letter at the beginning of the text is given a special treatment. The dictionary is searched for a rule that matches the text as is. Then, the capital letter is converted into lower case and the dictionary is again searched for an appropriate rule. If both searches succeed, the rule that is nearest to the beginning of the dictionary is selected.</Paragraph>
  </Section>
  <Section position="6" start_page="91" end_page="91" type="metho">
    <SectionTitle>
4 The Tools
</SectionTitle>
    <Paragraph position="0"> Three small programs (amounting to a total of 9 pages) were written in Prolog to cope with different aspects of the problem at hand.</Paragraph>
    <Paragraph position="1"> One of them scans the source text and prints the words in it, skipping formulas and irrelevant ISTEX commands. The list of words thus obtained, after being sorted and after deletion of repeated words (what can be done by using the sort utility in a Unix system), is very useful in preparing the dictionary.</Paragraph>
    <Paragraph position="2"> The format adopted for the dictionary, as written by the user, is simply that of a text with a rewrite rule in each line, the left-hand side followed by a tabulation character (that will be shown as ~ in the sequel) and the right-hand side. Parameters standing for formulas in a rule are written as SN where N is a positive integer. Each such parameter must occur once and only once in each side of the rule and not at the beginning of its lhs. Examples of rules in this format are de $1 sobre $2 ~ of $1 on $2 sejam $1, $2 e $3 ~ let $1, $2 and $3 be The second program transforms the dictionary as typed by the user into a set of Prolog clauses for use by the translation tool, our third program.</Paragraph>
    <Paragraph position="3"> These clauses can be seen as the usual translation of Definite-Clause Grammar (DCG) rules into Prolog but for the order of the arguments. For the sake of efficiency in searching the dictionary for rules that can be applied at each step, we take the &amp;quot;string&amp;quot; arguments of the translation of a DCG non-terminal to be the first arguments of the corresponding Prolog predicate. As these arguments are lists, and the indexing mechanism of the Prolog compiler we use,  YAP (Damas et al., 1988), looks at the first element of lists, this results in a speed-up by a factor of 2 or 3. Other points are * the rewrite rules are numbered in order for the  translation tool to be able to decide on which rule to apply when dealing with capital letters (as seen at the end of last section), * each parameter is replaced by a predicate call that processes a formula, * when a lhs does not finish with a parameter, a predicate is called that checks the occurrence of a separator.</Paragraph>
    <Paragraph position="4">  The third program, the translation tool, implements the method described in the previous section. It was written as a DCG in Prolog with some fragments of Prolog to deal with special situations, like the fact that the dictionary rules are implemented as above.</Paragraph>
  </Section>
  <Section position="7" start_page="91" end_page="92" type="metho">
    <SectionTitle>
5 The Translation
</SectionTitle>
    <Paragraph position="0"> What I am about to describe was done by Jorge Almeida, the author and translator of the book.</Paragraph>
    <Paragraph position="1"> The strategy adopted for doing the translation was to build and tune a dictionary by focusing on the translation of a single, representative chapter of the book, then further enhance the dictionary on the basis of the translation of the other chapters, and finally to use this dictionary to translate the whole book with our translation tool. The text obtained in this way was then revised (by using a text editor) to have the final version in English of the book.</Paragraph>
    <Paragraph position="2"> The first step was to use our first program to have a list of the words in all the book. This list was sorted out (and repeated entries deleted) by using the UNIX sort utility. A preliminary version of the dictionary was built on the basis of the resulting list. Inspection of the output of the translation tool (using this dictionary on the selected chapter) suggested the addition of new rules to the dictionary. After some iterations of this process an acceptable translation of the selected chapter was obtained. Some  further refinements to the dictionary were made by applying this same technique to other chapters.</Paragraph>
    <Paragraph position="3"> Some rules used in the actual translation were a operaq~o ~+ the operation a ~-+ to Estado de S~o Paulo ~ Estado de S~o Paulo vari~veis distintas ~+ distinct variables The first two rules are related with a lexical ambiguity problem: a in Portuguese is both an article and a preposition. In the absence of a syntactic analysis, the latter alternative is taken as the default (2nd rule above); this means that a as an article will be either manually corrected in the translated text, or translated correctly if another translation rule in which it occurs is applied. This latter case is exemplified by the first rule above. The third rule is an example of an identity rule that, along with the treatment of capital letters, is useful in coping with proper nouns -- this particular rule blocks the translation of the preposition de by some generic rule. The last rule shows how inversions in word order can be dealt with.</Paragraph>
    <Paragraph position="4"> Most of the effort in translating the book was spent in building and tuning the dictionary. The amount of text typed during this phase is estimated in about 20 pages. About another 20 pages were typed during the revisions made to the output of the translation program (this includes the introduction of some small updates and corrections to the original text).</Paragraph>
    <Paragraph position="5"> I give now some statistics on the work done on building and tuning the dictionary and translating the book. The execution times below are CPU execution times of Prolog programs using the YAP compiler (Damas et al., 1988) running under EP/IX (a Control Data Corporation version of UNIX) on a CDC 4680 machine (with two MIPS 6000 RISC processors). null Selected chapter no. of pages (final) 77 no. of characters 205 KB no. of words 8571 no. of different words 1168 exec. time to extract words 10 sec exec. time to translate chapter 25 sec Final dictionary no. of rules ca. 6000 average total no. of words/rule 3.4 exec. time to process rules 52 sec Book after translation no. of characters 1040 KB total no. of typeset pages 436 estimated revision effort 80 hours</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML