File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2145_metho.xml
Size: 9,584 bytes
Last Modified: 2025-10-06 14:07:16
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2145"> <Title>A Model of Competence for Corpus-Based Machine Translation</Title> <Section position="3" start_page="998" end_page="999" type="metho"> <SectionTitle> 3 Molecular vs. Holistic CBMT </SectionTitle> <Paragraph position="0"> As discussed in the previous section, all CBMT systems make use of sonle text dimensions in order to map a source language text into the target language.</Paragraph> <Paragraph position="1"> TMs, for instance, rely on the set of graphenfical symbols i.e. the ASCII set. Richer systems use lcxical, morphological, syntactic and/or semantic descriptions. The degree to which the set of descriptions is independent from the reference translations determines the molecularity of the theory. The more the descriptions are learned from and thus depend on the reference translations the more the system becomes holistic. Learning descriptions from refercnce translations makes the system more robust and easy to adjust to a new text domain.</Paragraph> <Paragraph position="2"> SBMT approaches e.g. (Brown et al., 1990) have a purely holistic view on languages. Every sentence of one language is considered to bca possible translation of any sentence in the other language. No account is given for the equivalence of the source language meaning and the target language meaning other than by means of global considerations concenfing frequencies of occurrence in the reference text. In order to compute the most probable trailslations, each pair of items of the source language and the target language is associated with a certain probability. This prior probability is derived from the reference text. In the translation phase, several target language sequences are considered and the one with the highest posterior probability is then taken to be the translation of the source language string.</Paragraph> <Paragraph position="3"> Similarly, neural network based CBMT systems (McLean, 1992) are holistic approaches. The training of the weights and the nfinimization of the classification error relies oil the reference text as a whole. Temptations to extract rules from the trained neural networks seek to isolate and make explicit aspects on how the net successfully classifies new sequences.</Paragraph> <Paragraph position="4"> The training process, however, remains holistic.</Paragraph> <Paragraph position="5"> TMs implement the molecular CBMT approach as they rely on a static distance metric which is independent from the size and content of the case base.</Paragraph> <Paragraph position="6"> TMs are molecular because they rely on a fixed and limited set of graphic symbols. Adding further example translations to the data base does not increase the set of the graphic symbols nor does it modify the distance metric. Learning capacities in TMs are trivial as their only way to learn is through extension of the example base.</Paragraph> <Paragraph position="7"> The translation templates generated by Giivenir and Cicekli (1998), for instance, differ according to the similarities and dissinfilarities found in the reference text. Translation templates in this system thus reflect holistic aspects of the example translations. The way in which morphological analyses is processed is, however, independent front the translation examples and is thus a molecular aspect in the system.</Paragraph> <Paragraph position="8"> Similarly, the ReVerb EBMT system (Collins, 1998) makes use of holistic components. The reference text is part-of-speech tagged. The length of translation segments as well as their most likely lifttim and final words arc calculated based on proba-</Paragraph> </Section> <Section position="4" start_page="999" end_page="999" type="metho"> <SectionTitle> 4 Coarse vs. Fine Graining CBMT </SectionTitle> <Paragraph position="0"> One task that all MT systems perform is to segment the text to be translated into translation units which -- to a certain extent -- can be translated independently. The ways in which segmentation takes place and how the translated segments are joined together in the target language are different in each MT system. null In (Collins, 1998) segmentation takes place on a phrasal level. Due to the lack of a rich morphological representation, agreement cannot always be granted in the target language when translating single words from English to German. Reliable translation cannot be guaranteed when phrases in the target language - or parts of it - are moved from one position (e.g. the object position) into another one (e.g. a subject position).</Paragraph> <Paragraph position="1"> In (Giivenir and Cicekli, 1998), this situation is even more problematic because there are no restrictions on possible fillers of translation template slots. Thus, a slot which has originally been filled with an object can, in the translation process, even accommodate an adverb or the subject.</Paragraph> <Paragraph position="2"> SBMT approaches perform fine-grained segmentation. Brown et al. (1990) segment the input sentences into words where for each source-target language word pair translation probabilities, fertility probabilities, alignment probabilities etc. are computed. Coarse-grained segmentation are unrealistic because sequences of 3 or more words (socalled n-grams) occur very rarely for n > 3 even ill huge learning corpora 1. Statistical (and probabilistic) systems rely on word frequencies found in texts and usually cannot extrapolate from a very small number of word occurrences. A statistical language 1 Brown et al. (1990) uses the Hansard French-English text containing several million words.</Paragraph> <Paragraph position="3"> model assigns to each n-gram a probability which enables the system to generate the most likely target language strings.</Paragraph> </Section> <Section position="5" start_page="999" end_page="1000" type="metho"> <SectionTitle> 5 A Competence Model for CBMT </SectionTitle> <Paragraph position="0"> A competence model is presented as two independent parameters, i.e. Coverage and Quality (see Figure 2).</Paragraph> <Paragraph position="1"> * Coverage of the system refers to the extent to which a variety of source language texts can be translated. A system has a high coverage if a great variety of texts can be translated. A lowcoverage system can translate only restricted texts of a certain domain with limited ternfinology and linguistic structures.</Paragraph> <Paragraph position="2"> * Quality refers to the degree to which an MT system produces successful translations. A system has a low quality if the produced translations are not even informative in the sense that a user cannot understand what the source text is about. A high quality MT-system produces user-oriented and correct translations with respect to text type, terminological preferences, personal style, etc.</Paragraph> <Paragraph position="3"> An MT systenr with low coverage and low quality is completely uninteresting. Such a system comes close to a randonr number generator as it translates few texts in an unpredictable way.</Paragraph> <Paragraph position="4"> An MT system with high coverage and &quot;not-toobad&quot; quality can be useful in a Web-application where a great variety of texts are to be translated for occasional users which want to grasp the basic ideas of a foreign text. On the other hand a system with high quality and restricted coverage might be useful for in-house MT-applications or a controlled language.</Paragraph> <Paragraph position="5"> An MT system with high coverage and high quality would translate any type of text to everyone's satisfaction, lIowever, as one can expect, such a system seems to bc not feasible.</Paragraph> <Paragraph position="6"> Boitct (1999) proposes &quot;the (tentative) formula: Coverage * Quality -= K &quot;where K depends on the MT technology and the amount of work encoded in the system. The question, then, is when is the max~ imum K possible and how nluch work do we want to invest for what purpose. Moreover a given K can mean high coverage and low quality, or it can mean the reverse.</Paragraph> <Paragraph position="7"> The expected quality of a CBMT system increases when segmenting more coarsely the input text. Consequently, a low coverage must bc expected due to the combinatorial explosion of the number of longer (:hunks. in order for a fine-grailfing system to gencrz~te at least informative translations, further knowledge resources need be considered. These knowledge resources may be either pre-defined and molecular or they can be derived fronl reference translations and holistic.</Paragraph> <Paragraph position="8"> TMs focus on the quality of translations. Only large clusters of nlcaning entities are translated into the target language in the hope that such clusters will not interfere with the context from which they are taken. Broader coverage can be achieved through finer grained segmentation of the input into phrases or single terms. Systems which finely segment texts use rich representation languages in order to adapt the translation units to the target language context or, as in the case of SBMT systems, use holistic derived constraints.</Paragraph> <Paragraph position="9"> What can bc learned and what should be learned from the reference text, how to represent the inferred knowledge, how to combine it with pre-defincd knowledge and the impact of difl'erent settings on the constant K in the formula of Boitet (1999) are all still open question for CBMT-design.</Paragraph> </Section> class="xml-element"></Paper>