File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2039_intro.xml

Size: 1,291 bytes

Last Modified: 2025-10-06 14:02:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2039">
  <Title>The influence of data homogeneity on NLP system performance</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Homogeneity of large corpora is still an unclear notion. In this study we make a link between the notions of similarity and homogeneity : a large corpus is made of sets of documents to which may be assigned a score in similarity defined by cross-entropic measures (similarity is implicitly expressed in the data). The distribution of the similarity scores of such subcorpora may then be interpreted as a representation of the homogeneity of the main corpus, which can in turn be used to perform corpus adaptation to tune a corpus based NLP system to a particular domain.</Paragraph>
    <Paragraph position="1"> (Cavagli`a 2002) makes the assumption that a corpus based NLP system generally yields better results with homogeneous training data rather than heterogeneous, and experiments on a text classifier system (Rainbow1), to mixed conclusions. We reassess this assumption by experimenting on language model perplexity, and on an EBMT system translating from Japanese to English. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML