File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/j96-1001_intro.xml

Size: 7,639 bytes

Last Modified: 2025-10-06 14:06:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-1001">
  <Title>NetPatrol Consulting</Title>
  <Section position="4" start_page="3" end_page="5" type="intro">
    <SectionTitle>
3. Collocations and Machine Translation
</SectionTitle>
    <Paragraph position="0"> Collocations, commonly occurring word pairs and phrases, are a notorious source of difficulty for non-native speakers of a language (Leed and Nakhimovsky 1979; Benson 1985; Benson, Benson, and Ilson 1986). This is because they cannot be translated on a word-by-word basis. Instead, a speaker must be aware of the meaning of the phrase as a whole in the source language and know the common phrase typically used in the target language. While collocations are not predictable on the basis of syntactic or semantic rules, they can be observed in language and thus must be learned through repeated usage. For example, in American English one says set the table while in British English the phrase lay the table is used. These are expressions that have evolved over time. It is not the meaning of the words lay and set that determines the use of one or the other in the full phrase. Here, the verb functions as a support verb; it derives its meaning in good part from the object in this context and not from its own semantic features. In addition, such collocations are flexible. The constraint is between the verb and its object and any number of words may occur between these two elements (e.g., You will be setting a gorgeously decorated and lavishly appointed table designed for a king).</Paragraph>
    <Paragraph position="1"> Collocations also include rigid groups of words that do not change from one context to another, such as compounds, as in Canadian Charter of Rights and Freedoms.</Paragraph>
    <Paragraph position="2"> To understand the difficulties that collocations pose for translation, consider sentences (le) and (lf) in Figure 1. Although these sentences are relatively simple, automatically translating (le) as (lf) involves several problems. Inability to translate on a word-by-word basis is due in part to the presence of collocations. For example, the English collocation to demonstrate support is translated as prouver son adhdsion. This translation uses words that do not correspond to individual words in the source; the English translation of prouver is prove and son adhdsion translates as one's adhesion. As a phrase, however, prouver son adhdsion carries the same meaning as the source phrase.</Paragraph>
    <Paragraph position="3"> Other groups of words in (le) cause similar problems, including to take steps to, provi2 These corpora had little noise. Most sentences neatly corresponded to translations in the paired corpus, with few extraneous sentences.</Paragraph>
    <Paragraph position="4">  Computational Linguistics Volume 22, Number 1 (le) &amp;quot;Mr. Speaker, our Government has demonstrated its support for these important principles by taking steps to enforce the provisions of the Charter more vigorously.&amp;quot; (lf) &amp;quot;Monsieur le Pr6sident, notre gouvernement a prouv6 son adh6sion ces importants principes en prenant des mesures pour appliquer plus syst6matiquement les pr6ceptes de la Charte.&amp;quot; Figure 1 Example pair of matched sentences from the Hansards corpus.</Paragraph>
    <Paragraph position="5"> sions of the Charter, and to enforce provisions. These groups are identified as collocations for a variety of reasons. For example, to take steps is a collocation because to take is used here as a support verb for the noun steps. The agent our government doesn't actually physically take anything; rather, it has begun the process of enforcement through small, concrete actions. While the French translation en prenant des mesures does use the French for take, the object is the translation of a word that does not appear in the source, measures. These are flexible collocations exhibiting variations in word order. On the other hand, the compound provisions of the Charter is very commonly used as a whole in a much more rigid way.</Paragraph>
    <Paragraph position="6"> This example also illustrates that collocations are domain dependent, often forming part of a sublanguage. For example, Mr. Speaker is the proper way to refer to the Speaker of the House in the Canadian Parliament when speaking English. The French equivalent, Monsieur le Prdsident, is not the literal translation but instead uses the translation of the term President. While this is an appropriate translation for the Canadian Parliament, in different contexts another translation would be better. Note that these problems are quite similar to the difficulties in translating technical terminology, which also is usually part of a particular technical sublanguage (Dagan and Church 1994). The ability to automatically acquire collocation translations is thus a definite advantage for sublanguage translation. When moving to a new domain and sublanguage, translations that are appropriate can be acquired by running Champollion on a new corpus from that domain.</Paragraph>
    <Paragraph position="7"> Since in some instances parts of a sentence can be translated on a word-by-word basis, a translator must know when a full phrase or pair of words must be considered for translation and when a word-by-word technique will suffice. Two tasks must therefore be considered: .</Paragraph>
    <Paragraph position="8"> .</Paragraph>
    <Paragraph position="9"> Identify collocations, or phrases which cannot be translated on a word-by-word basis, in the source language.</Paragraph>
    <Paragraph position="10"> Provide adequate translation for these collocations.</Paragraph>
    <Paragraph position="11"> For both tasks, general knowledge of the two languages is not sufficient. It is also necessary to know the expressions used in the sublanguage, since we have seen that idiomatic phrases often have different translations in a restricted sublanguage than in general usage. In order to produce a fluent translation of a full sentence, it is necessary to know the specific translation for each of the source collocations.</Paragraph>
    <Paragraph position="12"> We use XTRACT (Smadja and McKeown 1990; Smadja 1991a; Smadja 1993), a Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons tool we developed previously, to identify collocations in the source language (task 1). XTRACT works in three stages. In the first stage, word pairs that co-occur with significant frequency are identified. These words can be separated by up to four intervening words and thus constitute flexible collocations. In the second stage, XTRACT identifies combinations of word pairs from stage one with other words and phrases, producing compounds and idiomatic templates (i.e., phrases with one or more holes to be filled by specific syntactic types). In the final stage, XTRACT filters any pairs that do not consistently occur in the same syntactic relation, using a parsed version of the corpus. This tool has been used in several projects at Columbia University and has been distributed to a number of research and commercial sites worldwide.</Paragraph>
    <Paragraph position="13"> XTRACT has been developed and tested on English-only input. For optimal performance, XTRACT itself relies on other tools, such as a part-of-speech tagger and a robust parser. Although such tools are becoming more widely available in many languages, they are still hard to find. We have thus assumed in Champollion that these tools were only available in one of the two languages; namely, English, termed the source language throughout the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML