File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/i05-2039_concl.xml
Size: 1,844 bytes
Last Modified: 2025-10-06 13:54:38
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2039"> <Title>The influence of data homogeneity on NLP system performance</Title> <Section position="5" start_page="229" end_page="230" type="concl"> <SectionTitle> 5 Discussion and future work </SectionTitle> <Paragraph position="0"> The contribution of this work is twofold : We describe a method of representing homogeneity according to a cross-entropic measure of similarity to reference sublanguages, that can be used to profile language ressources. A corpus is represented by the distribution of the similarity coefficients of the smaller subsets it contains, and atypical therefore heterogeneous data may be characterized by the lower occurrences of their values.</Paragraph> <Paragraph position="1"> We further observe that marginalizing such atypical data in order to restrict the domain on which a corpus-based NLP system operates does not yield better performance, either in terms of perplexity when the system is based on stochastic language models, or in terms of objective translation quality when the system is a grammar-based EBMT system.</Paragraph> <Paragraph position="2"> An objective for future work is therefore to study corpus adaptation with Out-of-Domain data. While (Cavagli`a 2002) also acknowledged that for minimal sizes of training data, the best NLP system performance is reached with homogeneous ressources, we would like to know more precisely why and to what extent mixing In-Domain and Out-of-Domain data yields better accuracy. Concerning the representation of homogeneity, other experiments are needed to tackle the multidimensionality of sublanguage varieties less implicitly. We would like to consider multiple sublanguage references to untangle the dimensions of register variation in spoken and written language.</Paragraph> </Section> class="xml-element"></Paper>