File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/80/c80-1080_intro.xml
Size: 10,143 bytes
Last Modified: 2025-10-06 14:04:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1080"> <Title>TRANSLATING INTERACTIVE COMPUTER DIALOGUES FROM IDEOGRAPHIC TO ALPHABETIC LANGUAGES</Title> <Section position="2" start_page="528" end_page="528" type="intro"> <SectionTitle> BASIC </SectionTitle> <Paragraph position="0"> The baroque syntax of the BASIC language gives rise to many more problems than with LISP.</Paragraph> <Paragraph position="1"> There are four syntactic categories in BASIC that may be presented as ideographs: keywords, identifiers, character strings, and comments (REMarks). 14 Other elements of the language, namely numbers, arithmetic operators, and special punctuation symbols such as commas and quotation marks, are used in the Chinese language in the same way as in English, Furthermore, the Chinese use ordinary Western mathematical language, so we do not envisage translating the names of mathematical library functions like sin and cos.</Paragraph> <Paragraph position="2"> Keywords are stored in a Chinese-English translation table in the preprocessing computer. Single ideographs are used for keywords, and although this imposes a degree of unnaturalness on the Chinese representation, the resulting economy of keystrokes in entering programs was judged to outweigh any artificiality. In fact, multi-ideograph keywords could be accepted equally well if so desired.</Paragraph> <Paragraph position="3"> Identifiers in BASIC comprise an alphabetic letter which may be followed by a decimal digit. In Chinese, identifiers must comprise a single ideograph. Whenever an ideographic identifier is entered in a BASIC program, it is checked against the translation table. If it does not appear, it is added to the table with a 2-character translation. Thus the first ideograph which is not a keyword will translate to &quot;AO&quot;. the second to &quot;AI&quot;, and so on. Numerals, operators, and punctuation pass through the processor without translation. So also do English letters: this makes the filter transparent to English BASIC.</Paragraph> <Paragraph position="4"> If English and ideographics are mixed in a BASIC program, confusion may occur. The user cannot tell what English pseudonyms have been assigned to his ideographs, and so cannot guarantee to avoid variable name clashes. The ambiguity could be removed by translating English identifiers to a name selected by the preprocessor, in the same way that ideographic ones are. An easier possibility is simply to forbid mixed-language programming.</Paragraph> <Paragraph position="5"> Some English letters appear in Chinese programs m we have already mentioned mathematical functions. It is important to ensure that no parts of legal English strings can masquerade as translated identifiers; this is indeed the case for the 2-character identifiers AO, AI ..... Z9.</Paragraph> <Paragraph position="6"> Character strings are the most difficult items to translate, because BASIC contains string-processing functions such as LEN() (length of a string), LEFTS(), RIGHTS(), MID$() (substrings), and INSTRS() (searches one string for the first occurrence of another). It is not feasible to encode a sequence of ideographs as a single unit, for this would prevent decomposition. Instead we translate the ideographs individually into fixed-length English strings. The 3-character encoding outlined above is quite suitable, and has the advantage that the English representation is as short as possible. This is important because otherwise string overflows will occur often within the BASIC interpreter.</Paragraph> <Paragraph position="7"> The BASIC string-processing operations specify offsets in a string as character counts. Since strings now contain a fixed number of English characters per ideograph, all of these figures must be adjusted to account for the new unit of measurement. Thus, since ideographs are converted into three English characters each, ~LEN (the ideograph for LEN) is translated into (I/3)*LEN, @LEFTS(..., <expression>) into LEFTS( .... 3~(<expression>)), and so on ~.</Paragraph> <Paragraph position="8"> Identifying <expression>s when translating LEFTS, RIGHTS, and MID$ is the closest the preprocessor gets to the syntax of the BASIC language.</Paragraph> <Paragraph position="9"> Note that this scheme will not work if English and ideographics are mixed within strings. In a simple system, one might choose to outlaw this. However, since symbols such as punctuation and digits count as English, this requirement may be too stringent. The only alternative, if strin~ decomposition is to work properly, is to pad each English character that appears within a string to the length of the ideograph translations. If the pad character is chosen as a control character which would not otherwise appear in strings, it will be easy to remove when translating character strings received from the host on output; however, since one use of strings in BASIC is as filenames, and the host operating system will probably not welcome control characters in these, it is better to pad with a printing character instead. String operations which involve ASCII character codes, for example CHR$() which returns the character corresponding to a given ASCII code, and CHANGE() which transfers a string to an array of ASCII codes, are not implemented. The most sensible interpretation would be to return the 4-digit telecode mentioned earlier. This would involve communicating with the ideographic preprocessor, and so would need a non-standard implementation of BASIC on the host.</Paragraph> <Paragraph position="10"> As for program comments, any ideograph in a BASIC REM statement is converted to an English pseudonym using the same translation as for strings. There is no need to pad English characters, but it may be best to do so for the sake of uniformity.</Paragraph> <Paragraph position="11"> *The &quot;@&quot; prefix indicates that an ideograph is typed; thus &quot;@LEN&quot; should be read as the ideograph whose meaning is LEN.</Paragraph> <Paragraph position="12"> - 530 Lastly, it shOuld not be forgotten that error messages originating from the host computer will have to be translated before being presented on the terminal. Some BASIC implementations simply return an error number, like &quot;?16&quot;, which does not need altering. (Note that our aim is not to ~l~g~ the programming language, but to preserve it ~ warts and all wherever possible.) If error messages are used, they should appear in full in the translation table ~ a word-for-word translation would probably be too confusing in most cases.</Paragraph> <Paragraph position="13"> Fortunately, BASIC implementations do not include user-defined variable names or parts of program statements in error messages.</Paragraph> <Paragraph position="14"> Implementations. With the above considerations in mind, we sketch the working of both a simple preprocessor and a more sophisticated one. The first maintains a single translation table, which is initialized to hold the keywords of BASIC. Every ideograph input is translated via this table, which is augmented if the ideograph is absent with the next unused member of the sequence AO, AI, ..., A9, BO, ... as its English translation. Anything received from the host computer is inverse-translated using the table. Digits, operators, and punctuation pass transparently through the preprocessing filter in both directions.</Paragraph> <Paragraph position="15"> English characters do too, unless they appear as translations of ideographs in the table, in which case they are transformed back to ideographs on output. The only syntactic checking of the BASIC program by the preprocessor is in detecting, bracketting, and halving expressions which form the second argument of a LEFTS(), MID$()0 or RIGHTS() function, and the expressions can be detected easily by stacking parentheses.</Paragraph> <Paragraph position="16"> This simple system will work correctly for all-Chinese BASIC, providing punctuation is avoided in strings which are decomposed. The maximum number of different ideographs which can be used in any one interactive session is 260.</Paragraph> <Paragraph position="17"> It will work for all-English BASIC and Chinese BASIC with English identifiers and strings, provided the string decomposition functions are typed in English. It will work for mixed English and Chinese if English is avoided in decomposable strings, unless name clashes occur. These can be avoided by using single-letter variables and ensuring that numbers are not adjacent with letters in strings.</Paragraph> <Paragraph position="18"> The more sophisticated preprocessor uses BASIC syntax to distinguish strings and comments from identifiers. A table is maintained for identifiers as described above. English identifiers are translated, as well as Chinese ones, to avoid name clashes in mixed programs.</Paragraph> <Paragraph position="19"> Ideographs in strings and comments are translated to a fixed-length character representation, like the one developed above, which cannot clash with keywords, identifiers, or numbers. English characters, punctuation, and digits, occurring in strings, are padded to the same length. Note that an ideograph may occur both in a string or comment and as an identifier. This causes no special difficulty.</Paragraph> <Paragraph position="20"> What inadequacies still appear in this second preprocessor? Juxtaposition of characters on output cannot masquerade as ideograph translations because characters in strings are always padded. (The padding is, of course, removed before final output.) If the user types English, punctuation, or Chinese as input to his program it will all be translated before going to the BASIC program on the host.</Paragraph> <Paragraph position="21"> There is only one difficulty. If numbers are typed as input, there is no way that the preprocessor can tell whether they are destined for string input: INPUT AS, when they must be padded so that they can participate sensibly in string comparisons; or for numeric input:</Paragraph> </Section> class="xml-element"></Paper>