File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2039_metho.xml
Size: 14,597 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2039"> <Title>The influence of data homogeneity on NLP system performance</Title> <Section position="3" start_page="0" end_page="227" type="metho"> <SectionTitle> 2 A framework for corpus homogeneity 2.1 Previous work on corpus similarity and </SectionTitle> <Paragraph position="0"> homogeneity A range of measures for corpus similarity have been put forward in past literature : (Kilgarriff and Rose 98; Kilgarriff 2001) investigated on the similarity of corpora and compared &quot;Known Similarity Corpora&quot; (KSC) using perplexity and cross-entropy on words, word frequency measures, and a kh2-test which they found to be the most robust. However(as acknowledged in (Kilgarriff and Rose 98)), using KSC requires that the two corpora chosen for comparison are sufficiently similar that the most frequent lexemes in them almost perfectly overlap. However (Liebscher 2003) showed by comparing frequency counts of different large Google Group corpora that it is not usually the case.</Paragraph> <Paragraph position="1"> Measuring homogeneity by counting word / lexeme frequencies introduces additional difficulties as it assumes that the word is an obvious, well-defined unit, which is not the case in the Chinese (Sproat and Emerson 2003) or Japanese language (Matsumoto et al., 2002), for instance, where word segmentation is not trivial. (Denoual 2004) showed that similarity between corpora could be quantified with a coefficient based on the cross-entropies of probabilistic models built upon reference data. The approach needed no explicit selection of features and was language independent, as it relied on character based models (as opposed to word based models) thus bypassing the word segmentation issue and making it applicable on any electronic data.</Paragraph> <Paragraph position="2"> The cross-entropy HT(A) of an N-gram model p constructed on a training corpus T, on a test corpus A = {s1,..,sQ} of Q sentences with si = {ci1..ci|si|} a sentence of |si |characters is:</Paragraph> <Paragraph position="4"> where pij = p(cij|cij[?]N+1..cij[?]1).</Paragraph> <Paragraph position="5"> We therefore define a scale of similarity between two corpora on which to rank any third given one. Two reference corpora T1 and T2 are selected by the user, and used as training sets to compute N-gram character models. The cross-entropies of these two reference models are estimated on a third test set T3, and respectively named HT1(T3) and HT2(T3) as in the notation in Eq. 1. Both model cross-entropies are estimated according to the other reference , i .e ., HT1(T2) and HT1(T1), HT2(T1) and HT2(T2) so as to obtain the weights W1 and W2 of references T1 and</Paragraph> <Paragraph position="7"> After which W1 and W2 are assumed to be the weights of the barycentre between the userchosen references. Thus</Paragraph> <Paragraph position="9"> is defined to be the similarity coefficient between reference sets 1 and 2, which are respectively corpus T1 and corpus T2 . Given the previous assumptions, I(T1) = 0 and I(T2) = 1; furthermore, any given corpus T3 yields a score between the extrema I(T1) = 0 and I(T2) = 1 This framework may be applied to the quantification of the similarity of large corpora, by projecting them to a scale defined implicitly via the reference data selection. In this study we shall specifically focus on a scale of similarity bounded by a sublanguage of spoken conversation on the one hand, and a sublanguage of written style media on the other.</Paragraph> <Paragraph position="10"> We build upon this previous work in order to represent intra-corpus homogeneity.</Paragraph> <Section position="1" start_page="226" end_page="227" type="sub_section"> <SectionTitle> 2.2 Representing corpus homogeneity </SectionTitle> <Paragraph position="0"> Corpora are collected sets of documents usually originating from various sources. Whether a corpus is homogeneous in content or not is scarcely known besides the knowledge of the nature of the sources. As homogeneity is multidimensional (see (Biber 1988) and (Biber 1995) for considerations on the dimensions in register variation for instance), one cannot trivially say that a corpus is homogeneous or heterogeneous : different sublanguages show variations that are lexical, semantic, syntactic, and structural (Kittredge and Lehrberger 1982).</Paragraph> <Paragraph position="1"> In this study we wish to implicitly capture such variations by applying the previously described similarity framework to the representation of homogeneity. Coefficients of similarity may be computed for all smaller sets in a corpus, the distribution of which shall depict the homogeneity of the corpus relatively to the scale defined implicitly by the choice of the reference data.</Paragraph> <Paragraph position="2"> Homogeneity as depicted here is relative to the choice of reference training data, which implicitly embrace lexical and syntactic variations in a sublanguage (which are by any means not unidimensional, as argued previously). We focus as in (Denoual 2004) on a scale of similarity bounded by a sublanguage of spoken conversation on the one hand, and a sublanguage of written style media on the other.</Paragraph> </Section> <Section position="2" start_page="227" end_page="227" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> Reference data is needed to set up a scale of similarity, and implicitly bound it.</Paragraph> <Paragraph position="1"> For the sublanguage of spoken conversation we used for both English and Japanese the SLDB (Spontaneous Speech Database) corpus, a multi-lingual corpus of raw transcripts of dialogues described in (Nakamura et al., 1996).</Paragraph> <Paragraph position="2"> For the sublanguage of written style media, we used for English a part of the Calgary2 corpus, containing several contemporary English literature pieces3, and for Japanese a corpus of collected articles from the Nikkei Shinbun newspaper4. null The large multilingual corpus that is used in our study is the C-STAR5 Japanese / English part of an aligned multilingual corpus, the Basic Traveller's Expressions Corpus (BTEC).</Paragraph> <Paragraph position="3"> A prerequisite of the method is that levels of data transcriptions are strictly normalized, so that the comparison is not made on the transcription method but on the underlying signal itself.</Paragraph> </Section> <Section position="3" start_page="227" end_page="227" type="sub_section"> <SectionTitle> 3.2 Homogeneity in the BTEC </SectionTitle> <Paragraph position="0"> The BTEC is a collection of sentences originating from 197 sets (one set originating from one phrasebook) of basic travel expressions. Here we examine the distribution of the similarity coefficients assigned to its subsets.</Paragraph> <Paragraph position="1"> The corpus may be segmented in a variety of manners, however we wish to proceed in two intuitive ways : firstly, by keeping the original subdivision, i .e ., one phrasebook per subset ; secondly, at the level of the sentence, i .e ., one sentence per subset . Figure 1 shows the similarity coefficient distributions for Japanese and English at the sentence and subset level, and Table 1 shows their means and standard deviations.</Paragraph> <Paragraph position="2"> The difference in means and standard deviation larity coefficient distributions in Japanese and English. null values can be explained by the fact that all phrasebooks do not have the same size in lines6. The distribution of similarity coefficients at the line level, however similar to the distribution at the phrasebook level, suggests in its irregularities that it is indeed safer to use a larger unit to estimate crossentropies. Moreover, we wish not to tamper with the integrity of the original subsets, that is to keep the integrity of phrasebook contents as much as possible.</Paragraph> <Paragraph position="3"> On the phrasebook level, the similarity coefficient has a low correlation on both the average phrasebook length (0.178) and the average line length (0.278) (which does not make it a too &quot;shallow&quot; profiling method). On the other hand, correlation is high between the coefficients in Japanese and English (0.781), which is only to be expected intuitively.</Paragraph> </Section> </Section> <Section position="4" start_page="227" end_page="229" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="227" end_page="229" type="sub_section"> <SectionTitle> 4.1 Method </SectionTitle> <Paragraph position="0"> This work wishes to reassess the assumption that, for a same amount of training data, a corpus-based NLP system performs better when its data tends to be homogeneous. Here we use the representation of homogeneity defined by the similarity coefficient scale to select data that tends to be homogeneous to an expected task. Experiments shall be performed both on randomly selected data, and on data selected according to their similarity coefficient. The closer the coefficient of the training data is to the coefficient of the expected task, the better.</Paragraph> <Paragraph position="1"> We assume that the task is sufficiently represented by a set of data from the same domain as the large bicorpus used, the BTEC. Experiments are performed on a test set of 510 Japanese sentences which are not included in the ressource.</Paragraph> <Paragraph position="2"> level (thick line), respectively for Japanese and English. chosen and homogeneous BTEC data.</Paragraph> <Paragraph position="3"> These sentences shall first be used for language model perplexity estimation, then as input sentences for the EBMT system. The task is found to have a coefficient of I0 = 0.331. The average co-efficient for a BTEC phrasebook being 0.330, the task is found to be particularly in the domain of the ressource. We examine the influence of training data size first on language model perplexity, then on the quality of translation from Japanese to English by an example-based MT system.</Paragraph> <Paragraph position="4"> 4.1.1 Language model perplexity Even if perplexity does not always yield a high correlation with NLP systems performance, it is still an indicator of language model complexity as it gives an estimate of the average branching factor in a language model. The measure is popular in the NLP community because admittedly, when perplexity decreases, the performance of systems based on stochastic models tends to increase.</Paragraph> <Paragraph position="5"> We compute perplexities of character language models built on variable amounts of training data first randomly taken from the Japanese part of the BTEC, and then selected around the expected task coefficient I0 (thresholds are determined by the amount of training data to be kept). Cross-entropies are estimated on the test set, and all estimations are performed five times for the random data selections and averaged. Figure 3 shows the character perplexity values for increasing amounts of data from 0.5% to 100% of the BTEC and interpolated. As was expected, perplexity decreases as training data increases and tends to have an asymptotic behaviour when more data is being used as training.</Paragraph> <Paragraph position="6"> built on increasing amounts of randomly chosen BTEC and homogeneous Japanese data.</Paragraph> <Paragraph position="7"> While homogeneous data yield lower perplexity scores for small amounts of training data (up to 15% of the ressource - roughly 1.5 Megabytes of data), beyond this value perplexity is slightly higher than for a model trained on randomly selected data. Except for the smaller amounts of data, there seems to be no benefit in using homogeneous rather than random heterogeneous training data for model perplexity. On the contrary, excessively restricting the domain seems to yield higher model perplexities.</Paragraph> <Paragraph position="8"> quality In this section we experiment on a Japanese to English grammar-based EBMT system, HPATR (described in (Imamura 2001)), which parses a bicorpus with grammars for both source and target language, and translates by automatically generating transfer patterns from bilingual trees constructed on the parsed data. Not being a MT system based on stochastic methods, it is used here as a task evaluation criterion complementary to language model perplexity. Systems are likewise constructed on variable amounts of training data, and evaluated on the previous task of 510 Japanese sentences, to be translated from Japanese to English.</Paragraph> <Paragraph position="9"> Because it is not feasible here to have humans judge the quality of many sets of translated data, we rely on an array of well known automatic evaluation measures to estimate translation quality : * BLEU (Papineni et al. 2002) is the geometric mean of the n-gram precisions in the output with respect to a set of reference translations. It is bounded between 0 and 1, better scores indicate better translations, and it tends to be highly correlated with the fluency of outputs ; * NIST (Doddington 2002) is a variant of BLEU based on the arithmetic mean of weighted n-gram precisions in the output with respect to a set of reference translations.</Paragraph> <Paragraph position="10"> It has a lower bound of 0, no upper bound, better scores indicate better translations, and it tends to be highly correlated with the adequacy of outputs ; * mWER (Och 2003) or Multiple Word Error Rate is the edit distance in words between the system output and the closest reference translation in a set. It is bounded between 0 and 1, and lower scores indicate better translations. null Figure 2 shows BLEU, NIST and mWER scores for increasing amounts of data from 0.5% to 100% of the BTEC and interpolated. As was expected, MT quality increases as training data increases and tends to have an asymptotic behaviour when more data is being used in training. Here again except for the smaller amounts of data (up to 3% of the BTEC in BLEU, up to 18% in NIST and up to 2% in mWER), using the three evaluation methods, translation quality is equal or higher when using random heterogenous data. If we perform a mean comparison of the 510 paired score values assigned to sentences, for instance at 50% of training data, this difference is found to be statistically significant between BLEU, NIST, and mWER scores with confidence levels of 88.49%, 99.9%, and 73.24% respectively.</Paragraph> </Section> </Section> class="xml-element"></Paper>