File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/j05-4003_concl.xml

Size: 3,468 bytes

Last Modified: 2025-10-06 13:54:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4003">
  <Title>Improving Machine Translation Performance</Title>
  <Section position="8" start_page="500" end_page="501" type="concl">
    <SectionTitle>
9. Discussion
</SectionTitle>
    <Paragraph position="0"> The most important feature of our parallel sentence selection approach is its robustness. Comparable corpora are inherently noisy environments, where even similar content may be expressed in very different ways. Moreover, out-of-domain corpora introduce additional difficulties related to limited dictionary coverage. Therefore, the ability to reliably judge sentence pairs in isolation is crucial.</Paragraph>
    <Paragraph position="1"> Comparable corpora of interest are usually of large size; thus, processing them requires efficient algorithms. The computational processes involved in our system are  Computational Linguistics Volume 31, Number 4 quite modest. All the operations necessary for the classification of a sentence pair (filter, word alignment computation, and feature extraction) can be implemented efficiently and scaled up to very large amounts of data. The task can be easily parallelized for increased speed. For example, extracting data from 600k English documents and 500k Chinese documents (Section 4.2) required only about 7 days of processing time on 10 processors.</Paragraph>
    <Paragraph position="2"> The data that we extract is useful. Its impact on MT performance is comparable to that of human-translated data of similar size and domain. Thus, although we have focused our experiments on the particular scenario where there is little in-domain training data available, we believe that our method can be useful for increasing the amount of training data, regardless of the domain of interest.</Paragraph>
    <Paragraph position="3"> As we have shown, this could be particularly effective for language pairs for which only very small amounts of parallel data are available. By acquiring a large comparable corpus and performing a few bootstrapping iterations, we can obtain a training corpus that yields a competitive MT system.</Paragraph>
    <Paragraph position="4"> We suspect our approach can be used on comparable corpora coming from any domain. The only domain-dependent element of the system is the date window parameter of the article selection stage (Figure 7); for other domains, this can be replaced with a more appropriate indication of where the parallel sentences are likely to be found. For example, if the domain were that of technical manuals, one would cluster printer manuals and aircraft manuals separately. It is important to note that our work assumes that the comparable corpus does contain parallel sentences (which is the case for our data). Whether this is true for comparable corpora from other domains is an empirical question outside the scope of this article; however, both our results and those of Resnik and Smith (2003) strongly indicate that good data is available on the Web.</Paragraph>
    <Paragraph position="5"> Lack of parallel corpora is a major bottleneck in the development of SMT systems for most language pairs. The method presented in this paper is a step towards the important goal of automatic acquisition of such corpora. Comparable texts are available on the Web in large quantities for many language pairs and domains. In this article, we have shown how they can be efficiently mined for parallel sentences.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML