File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0310_metho.xml
Size: 6,235 bytes
Last Modified: 2025-10-06 14:08:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0310"> <Title>Bootstrapping Parallel Corpora</Title> <Section position="3" start_page="100000" end_page="100000" type="metho"> <SectionTitle> 2 Co-training for Statistical Machine Translation </SectionTitle> <Paragraph position="0"> Most statistical natural language processing tasks use supervised machine learning, meaning that they require training data that contains examples that have been annotated with some sort of labels. Two conflicting factors make this reliance on annotated training data a problem: * The accuracy of machine learning improves as more data is available (as we have shown for statistical machine translation in Figure 1).</Paragraph> <Paragraph position="1"> * Annotated training data usually has some cost associated with its creation. This cost can often be substantial, as with the Penn Treebank (Marcus et al., 1993).</Paragraph> <Paragraph position="2"> There has recently been considerable interest in weakly supervised learning within the statistical NLP community. The goal of weakly supervised learning is to reduce the cost of creating new annotated corpora by (semi-) automating the process.</Paragraph> <Paragraph position="3"> Co-training is a weakly supervised learning techniques which uses an initially small amount of human labeled data to automatically bootstrap larger sets of machine labeled training data. In co-training implementations multiple learners are used to label new examples and re-trained on some of each other's labeled examples. The use of multiple learners increases the chance that useful information will be added; an example which is easily labeled by one learner may be difficult for the other and therefore adding the confidently labeled example will provide information in the next round of training.</Paragraph> <Paragraph position="4"> Self-training is a weakly supervised method in which a single learner retrains on the labels that it applies to unlabeled data itself. We describe its application to machine translation in order to clarify how co-training would work. In self-training a translation model would be trained for a language pair, say German=English, from a German-English parallel corpus. It would then produce English translations for a set of German sentences. The machine translated German-English sentences would be added to the initial bilingual corpus, and the translation model would be retrained.</Paragraph> <Paragraph position="5"> Co-training for machine translation is slightly more complicated. Rather than using a single translation model to translate a monolingual corpus, it uses multiple translation models to translate a bi- or multi-lingual corpus. For example, translation models could be trained for German=English, French=English and Spanish=English from appropriate bilingual corpora, and then used to translate a German-French-Spanish parallel corpus into English. Since there are three candidate English translations for each sentence alignment, the best translation out of the three can be selected and used to retrain the models. The process is illustrated in Figure 2. Co-training thus automatically increases the size of parallel corpora. There are a number of reasons why machine translated items added during co-training can be useful in the next round of training: * vocabulary acquisition - One problem that arises from having a small training corpus is incomplete word coverage. Without a word occurring in its training corpus it is unlikely that a translation model will produce a reasonable translation of it. Because the initial training corpora can come from different sources, a collection of translation models will be more likely to have encountered a word before. This some english sentencesome french sentenc some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence German English some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence Spanish English some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence some english sentencesome french sentence leads to vocabulary acquisition during co-training.</Paragraph> <Paragraph position="6"> * coping with morphology - The problem mentioned above is further exacerbated by the fact that most current statistical translation formulations have an incomplete treatment of morphology. This would be a problem if the training data for a Spanish translation model contained the masculine form of a adjective, but not the feminine. Because languages vary in how they use morphology (some languages have grammatical gender whereas others don't) one language's translation model might have the translation of a particular word form whereas another's would not. Thus co-training can increase the inventory of word forms and reduce the problem that morphology poses to simple statistical translation models. * improved word order - A significant source of errors in statistical machine translation is the word re-ordering problem (Och et al., 1999). The word order between related languages is often similar while word order between distant language may differ significantly. By including more examples through co-training with related languages, the translation models for distant languages will better learn word order mappings to the target language.</Paragraph> <Paragraph position="7"> In all these cases the diversity afforded by multiple translation models increases the chances that the machine translated sentences added to the initial bilingual corpora will be accurate. Our co-training algorithm allows many source languages to be used.</Paragraph> </Section> class="xml-element"></Paper>