File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2238_metho.xml

Size: 12,445 bytes

Last Modified: 2025-10-06 14:15:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2238">
  <Title>Dialect MT: A Case Study between Cantonese and Mandarin</Title>
  <Section position="3" start_page="1460" end_page="1461" type="metho">
    <SectionTitle>
2 Linguistic Consideration of Dialect
MT
</SectionTitle>
    <Paragraph position="0"> Most differences among the dialects of a language are found in their sound inventory and phonological systems. Words with similar written forms are often pronounced differently in different dialects. For example, the same Chinese word &amp;quot;~ 7;~ &amp;quot; (Hong Kong) is pronounced xianglgang3 2 in Mandarin, but hoenglgong2 in Cantonese. There are also lexical differences although dialects share most of their words. Different dialects may use different words to refer to the same thing. For example, the word &amp;quot;umbrella&amp;quot; is ~ ~: (yu3san3) in Mandarin, and ~ (zel) in Cantonese. Differences in syntactic structure are less common but they are linguistically more complicated and computationally more challenging. For example, the positions of some adverbs may vary from dialect to dialect. To express &amp;quot;You go first&amp;quot;, we have  Comparative sentences represent another case where syntactic difference is likely to happen. For example the English sentence &amp;quot;A is taller than B&amp;quot; is expressed as  presented in Hanyu Pinyin Scheme (LICASS, 1996), and Cantonese in Yueyu Pinyin Scheme (LSHK, 1997). Numbers are used to denote tones of syllables. Yueyu Pinyin is based on Hanyu Pinyin. That means, across the two pinyin schemes, words with different pinyin symbols are normally pronounced differently. A than B tall Cantonese: A ~{ ~_ B A goul gwo3 B (4) A tall more B Sentences with double objects often follow different word orders, too. In a Mandarin sentence with two objects, the one referring to person(s) must be put before the other one. Yet, many dialects allow the order to be reversed, for example:  forms can be represented in a bi-dialect dictionary. For example, for Cantonese-Mandarin MT, we can use entries like word(pron, \[~, ni3\], \[+~, nei5\]) %you word(vi,\[x-~, zou3\], \[,~, hang4\]) %go word(n,\[~, hang2\], \[,~, hang4\]) %row word(adv, \[5~, xianl\], \[~, sin1\]) %first word(n, \[~j~:, yu3san3\],\['.~,,,, zel\]) %ubbrella where the word entry flag &amp;quot;word&amp;quot; is followed by three arguments: the part of speech and the corresponding words (in Chinese characters and pinyins) in Mandarin and in Cantonese. English comments are marked with &amp;quot;%&amp;quot;.</Paragraph>
    <Paragraph position="1"> Morphologically, there are some useful rules for word formation. For example, in Mandarin, the prefixes &amp;quot;~_}&amp;quot; (gongl) and &amp;quot;\]~g&amp;quot; (xiong2) are for male animals, and &amp;quot;fl~&amp;quot; (mu3) and &amp;quot;llt~&amp;quot;(ci2) female animals. But in most southern China dialects, the suffixes &amp;quot;~/0h~i&amp;quot; and &amp;quot;0.~/~:~ '' are often used instead. For examples  exists in word orders, the key task for MT is to decide what part(s) of the source sentence should be moved, and to where. It seems unlikely for words to be moved over long distances, because dialects normally exist in spoken, short sentences.</Paragraph>
    <Paragraph position="2"> Another problem to be considered is whether dialect MT should be direct or indirect, i.e., should there be an intermediate language/dialect? It seems indirect MT with the lingua franca as the intermediate representation medium is promising. The advantage is twofold: (a) good for multi-dialect MT; Co) more useful and practical as a lingua franca is a common and the most influential dialect in the family, and maybe the only one with a complete written system.</Paragraph>
    <Paragraph position="3"> Still another problem is the forms of the source and target dialects for the MT program. Most MT systems nowadays translate between written languages, others are trying speech-to-speech translation. For dialects MT, translation between written sentences is not that admirable because the dialects of a language virtually share a common written system. On the other hand, speech to speech translation involves speech recognition and speech generation, which is a challenging research area by itself. It is worthwhile to take a middle way: translation at the level of phonetic symbols. There are at least three major reasons: (a) The largest difference among dialects exists in sound systems. (b) Phonetic symbol translation is a prerequisite for speech translation. (c) Some dialect words can only be represented in sound. In our case, pinyins have been selected to represent both input and output sentences, because in China pinyins are the most popular tools to learn dialects and to input Chinese characters to computers. Chinese pinyin schemes, for Mandarin and for ordinary dialects are romanized, i.e., they virtually only use English letters, to the convenience of computer processing. Of course, pinyin-to-pinyin translation is more difficult than translation between written words in Chinese block characters because the former involves linguistics analysis at all the three aspects of sound systems, grammar rules and vocabulary contents in stead of two.</Paragraph>
  </Section>
  <Section position="4" start_page="1461" end_page="1462" type="metho">
    <SectionTitle>
3 The Problem of Ambiguities
</SectionTitle>
    <Paragraph position="0"> Ambiguity is always the most crucial and the most challenging problem for MT. Since inter-dialect differences mostly exist in words, both in pronunciation and in characters, our discussion will concentrate on word disambiguation for Cantonese-Mandarin MT. In the Cantonese vocabulary, there are about seven thousand to eight thousand dialect words (including idioms and fixed phrases), i.e., those words with different character forms from any Mandarin words, or with meanings different from the Mandarin words of similar forms. These dialect words account for about one third of the total Cantonese vocabulary. In spoken Cantonese the frequency of use of Cantonese dialect words is close to 50 percent (Li, et. al., 1995, p236).</Paragraph>
    <Paragraph position="1"> Because of historical reasons, Hong Kong Cantonese is linguistically more distant from Mandarin than other regions in Mainland China.</Paragraph>
    <Paragraph position="2"> One can easily spot Cantonese dialect articles in Hong Kong newspapers which are totally unintelligible to Mandarin speakers, while Mandarin articles are easily understood by Cantonese speakers. To translate a Cantonese article into Mandarin, the primary task is to deal with the Cantonese dialect words, especially those that do not have semantically equivalent counterparts in the target dialect. For example, the Mandarin Jf~(ju2, orange) has a much larger coverage than the Cantonese ~e~(gwatl). In addition to the Cantonese ~t~, the Mandarin also includes the fruits Cantonese refers to as ~I~ (gaml) and ~(caang2). On the other hand, the Cantonese ~ semantically covers the Mandarin ~ (go, walk) and ~ (row).</Paragraph>
    <Paragraph position="3"> Translation at the sound or pinyin level has to  deal with another kind of ambiguity: the homophones of a word in the source dialect may not have their counterpart synonyms in the target dialect pronounced as homophones as well. For example, the words ~:~(banana) and ~_.</Paragraph>
    <Paragraph position="4"> (intersection) are both pronounced xiangljiaol in Mandarin, but in Cantonese they are pronounced hoenglziul and soenglgaaul respectively, though their written characters remain unchanged.</Paragraph>
    <Paragraph position="5"> To tackle these ambiguities, we employs the techniques of hierarchical phrase analysis (Zhang and Lu, 1997) and word collocation processing (Sinclair, 1991), both rule-based and corpus-based. Briefly speaking, the hierarchical phrase analysis method firstly tries to solve a word ambiguity in the context of the smallest phrase containing the ambiguous word(s), then the next layer of embedding phrase is used if needed, and so on. As a result, the problem will be solved within the minimally sufficient context. To further facilitate the work, large amount of commonly used phrases and phrase schemes are being collected into the dictionary. Further more, interaction between the users and the MT system should be allowed for difficult disambiguation (Martin, 1997a).</Paragraph>
  </Section>
  <Section position="5" start_page="1462" end_page="1463" type="metho">
    <SectionTitle>
4 System Design and Implementation
</SectionTitle>
    <Paragraph position="0"> A rudimentary design of a Cantonese-Mandarin dialect MT system has been made, as shown in  sentences as input and generates Mandarin sentences in Hanyu Pinyin and in Chinese characters. The translation is roughly done in three steps: syntax conversion, word disambiguation and source-target words substitution. The knowledge bases include linguistic rules, a word collocation list and a bi-dialect MT dictionary.</Paragraph>
    <Paragraph position="1"> A simplified example will make the basic ideas clearer. Suppose the example word entries and transformational rules in Section 2 are included in the MT system's knowledge base.  is given as input for the system to translate into Mandarin. Because the input sentence contains the time adverb &amp;quot;sianl&amp;quot; (first), according to grammar rules, it is syntactically different from its counterpart in Mandarin. According to the flowchart, the Cantonese pinyin sentence is converted into a Mandarin structure. Rule 1 in the knowledge base is applied, producing nei5 sinl hang4 you first go Then the dictionary is accessed. The Cantonese word ~(hang4) corresponds to two Mandarin words, i.e., 7T~(vi. go, walk) and ~T(n. row). According to Rule 1, the verb Mandarin word is selected. And the individual Cantonese words in the sentence are substituted with their Mandarin counterparts, a target Mandarin sentence ni 3 xianl zou3 you first go like sentence (1) is then correctly produced. Input a Cantonese pinyin sentence  Similarly, with transformational rule 1-3, a more complicated Cantonese sentence like goulgwo3 wo3 ge3 yan4 bei2 cin4 keoi5 sinl tall more me PART person give money him first can be correctly translated into Mandarin:  bi3 wo3 gaol de ren2 xianl gei3 tal qian2 than me tall PART persons first give him money Those who are taller than me will give him some money first.</Paragraph>
    <Paragraph position="2"> We are in the progress of implementing an inter-dialect MT prototype, called CPC, for translation between Cantonese and Putonghua (i.e., Mandarin), both Cantonese-to-Putonghua and Putonghua-to-Cantonese. Input and output sentences are in pinyins or Chinese characters. The programming languages used are Prolog and Java. We are doing Cantonese-to-Putonghua first, based on the design. At its current state, we have built a Cantonese-Mandarin bi-dialect dictionary of about 3000 words and phrases based on some well established books (e.g., Zeng, 1984; Mai and Tang, 1997), (When completed, there will be around 10,000 word entries) and a handful of rules. A Cantonese-Mandarin dialect corpus is also being built. The program can process sentences of a number of typical patterns. The funded project has two immediate purposes: to facilitate language communication and to help Hong Kong students write standard Mandarin Chinese.</Paragraph>
    <Paragraph position="3"> Conclusion Compared with inter-language MT, inter-dialect MT is much more manageable, both linguistically and technically. Though generally ignored, the development of inter-dialect MT systems is both rewarding and more feasible. The present paper discusses the design and implementation of dialect MT systems at pinyin and character levels, with special attention on the Chinese Mandarin and Cantonese. When supported by the modem technology for multimedia communication of the Intemet and the WWW, dialect MT systems will produce even greater benefits (Zhang and Lau, 1996). Nonetheless, the research reported in this paper can only be regarded as an initial exploratory step into a new exciting research area. There is large room for further research and discussion, especially in word disambiguation and syntax analysis. And we should also notice that the grammars of ordinary dialects are normally less well described than those of lingua francas.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML