File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2039_intro.xml
Size: 1,291 bytes
Last Modified: 2025-10-06 14:02:58
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2039"> <Title>The influence of data homogeneity on NLP system performance</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Homogeneity of large corpora is still an unclear notion. In this study we make a link between the notions of similarity and homogeneity : a large corpus is made of sets of documents to which may be assigned a score in similarity defined by cross-entropic measures (similarity is implicitly expressed in the data). The distribution of the similarity scores of such subcorpora may then be interpreted as a representation of the homogeneity of the main corpus, which can in turn be used to perform corpus adaptation to tune a corpus based NLP system to a particular domain.</Paragraph> <Paragraph position="1"> (Cavagli`a 2002) makes the assumption that a corpus based NLP system generally yields better results with homogeneous training data rather than heterogeneous, and experiments on a text classifier system (Rainbow1), to mixed conclusions. We reassess this assumption by experimenting on language model perplexity, and on an EBMT system translating from Japanese to English. null</Paragraph> </Section> class="xml-element"></Paper>