File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1139_metho.xml
Size: 20,628 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1139"> <Title>Linguistic profiling of texts for the purpose of language verification</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Linguistic profiling </SectionTitle> <Paragraph position="0"> In linguistic profiling, the occurrences in a text are counted of a large number of linguistic features, either individual items or combinations of items. These counts are then normalized for text length and it is determined to what extent (calculated on the basis of the number of standard deviations) they differ from the mean observed in a profile reference corpus. For each text, the deviation scores are combined into a profile vector, on which a variety of distance measures can be used to position the text relative to any group of other texts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Language verification </SectionTitle> <Paragraph position="0"> Linguistic profiling makes it possible to identify (groups of) texts which are similar, at least similar in terms of the profiled features (cf. van Halteren, 2004). We have found that the recognition process can be vastly improved by not only providing positive examples (in the present case native texts) but also negative examples (here the non-native texts). So we expect that, given a seed corpus containing both native and non-native texts, linguistic profiling should be able to distinguish between these two types of texts.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Features </SectionTitle> <Paragraph position="0"> As previous research has shown (see e.g Biber 1995), there are a great many linguistic features that contribute to marked structural differences between texts. These features mark 'basic grammatical, discourse, and communicative functions' (Biber, 1995: 104). They comprise features referring to vocabulary, lexical patterning, syntax, semantics, pragmatics, information content or item distribution through a text. Here we restrict ourselves to lexical features.</Paragraph> <Paragraph position="1"> Sufficiently frequent tokens, i.e. those that were observed to occur with a certain frequency in some language reference corpus, are used as features by themselves. In the present case these are items that occur at least five times in the written texts from the BNC Sampler (BNC, 2002). For less frequent tokens, we determine a token pattern consisting of the sequence of character types. For example, the token Uefa-cup is represented by the pattern &quot;#L#6+/CL-L&quot;, where the first &quot;L&quot; indicates low frequency, 6+ the size bracket, and the sequence &quot;CL-L&quot; a capital letter followed by one or more lower case letters followed by a hyphen and again one or more lower case letters. For lower case words, the final three letters of the word are also included in the pattern. For example, the token altercation is represented by the pattern &quot;#L#6+/L/ion&quot;. These patterns were originally designed for English and Dutch and will probably have to be extended for use with other languages.</Paragraph> <Paragraph position="2"> Furthermore, for this specific task, we wanted to avoid recognizing text topics rather than nativeness, and decided to mask content words.</Paragraph> <Paragraph position="3"> Any high frequency word classified primarily as noun, verb or adjective (see below), which had a high document bias (cf. van Halteren, 2003) was replaced by the marker #HC# followed by the same type of pattern we use for low frequency words, but always without the final three letters.</Paragraph> <Paragraph position="4"> This occludes topical words like brain or injury, while leaving more functional words like case or times intact.</Paragraph> <Paragraph position="5"> In addition to the form of the token, we also use the syntactic potential of the token as a feature. We apply the first few modules of a morphosyntactic tagger (in this case the tagger described by van Halteren, 2000) to the text, which determine which word class tags could apply to each token. For known words, the tags are taken from a lexicon; for unknown words, they are estimated on the basis of the word patterns described above. The most likely tags (with a maximum of three) are combined into a single feature. Thus still is associated with the feature &quot;RR-JJ-NN1&quot; and forms with the feature &quot;NN2-VVZ&quot;. Note that the most likely tags are determined exclusively on the basis of the current token; the context in which the token occurs is not taken into account. The modules of the tagger which are normally used to obtain a context dependent disambiguation are not applied.</Paragraph> <Paragraph position="6"> On top of the individual token and tag features we use all possible bi- and trigrams. For example, the token combination an attractive option is associated with the complex feature &quot;wcw= #HF#an#HC#JJ#HC#6+/L&quot;. Since the number of features quickly grows too big to allow for efficient processing, we filter the set of features. This done by requiring that a feature occur in a set minimum number of texts in the profile reference corpus (in the present case a feature must occur in at least two texts). A feature which is filtered out contributes to a rest category feature. Thus, the complex feature above would contribute to &quot;wcw=<OTHER>&quot;.</Paragraph> <Paragraph position="7"> The lexical features currently also include features that relate to utterance length. For each utterance two such features are determined, viz. the exact length (e.g. &quot;len=15&quot;) and the length bracket (e.g. &quot;len=10-19&quot;).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Classification </SectionTitle> <Paragraph position="0"> When offered a list of positive and negative texts for training, and a list of test texts, the system first constructs a featurewise average of the profile vectors of all positive texts. It then determines a raw score for all text samples in the list. Rather than using the normal distance measure, we opted for a non-symmetric measure which is a weighted combination of two factors: a) the difference between text score and average profile score for each feature and b) the text score by itself. This makes it possible to assign more importance to features whose count deviates significantly from the norm. The following distance formula is used:</Paragraph> <Paragraph position="2"> are the values for the i th feature for the text sample profile and the positive average profile respectively, and D and S are the weighting factors that can be used to assign more or less importance to the two factors described. The distance measure is then transformed into a score by the formula</Paragraph> <Paragraph position="4"> The score will grow with the similarity between text sample profile and positive average profile.</Paragraph> <Paragraph position="5"> The first component serves as a correction factor for the length of the text sample profile vector.</Paragraph> <Paragraph position="6"> The order of magnitude of the score values varies with the setting of D and S, and with the text collection. In order to bring the values into a range which is suitable for subsequent calculations, we express them as the number of standard deviations they differ from the mean of the scores of the negative example texts.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Language verification </SectionTitle> <Paragraph position="0"> In order to test the feasibility of language verification by way of linguistic profiling, we need data which is guaranteed to be written by native and non-native speakers respectively. Moreover, the texts (native and non-native) should be as similar as possible with respect to the genre they represent. For the present study, therefore, we opted for the student essays in the Louvain Corpus of Native English Essays (LOCNESS) and the International Corpus of Learner English (ICLE; Granger et al., 2002).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 LOCNESS and ICLE </SectionTitle> <Paragraph position="0"> ICLE is a collection of mostly argumentative essays written by advanced EFL students from various mother-tongue backgrounds. The essays each are some 500-1000 words long (unabridged) and although they 'cover a variety of topics, the content is similar in so far as the topics are all non-technical and argumentative (rather than narrative, for instance)' (cf. Granger, 1998:10). The size of the national sub-corpora is approx. 200,000 words per corpus. With the data metadata are available as they have been collected via a learner profile questionnaire.</Paragraph> <Paragraph position="1"> The LOCNESS in various respects is comparable to ICLE. It is a 300,000-word corpus mainly of essays written by English and American university students. A small part of the corpus (60,000 odd words) is constituted by British English A-level essays. Topics include transport, the parliamentary system, fox hunting, boxing, the National Lottery, and genetic engineering.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Training and test texts </SectionTitle> <Paragraph position="0"> In order to be able to control for language variation between British and American English, we opted for only the British part of LOCNESS.</Paragraph> <Paragraph position="1"> Because this totalled only some 155,000 words, we decided to hold out about one third as test material and use the other two thirds for training. In order to have as little overlap as possible in essay type and topic between training and test material, we used sub-corpora 2, 3 and 8 of the A-level essays and sub-corpus 3 of the university student essays for testing.</Paragraph> <Paragraph position="2"> For the ICLE texts, we chose to use each tenth text for training purposes. The remaining texts were used for testing.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 General results </SectionTitle> <Paragraph position="0"> In the first step of training, we selected the features to be profiled. We used all features which occurred in more than one training text, i.e. about 470K features. In the second step, we selected the system parameters D and S for two classification models: similarity to the native texts (D=1.0, S=0.0) and similarity to the non-native texts (D=1.2, S=0.2). The selection was based on the quality of classifying half of the training texts with the system having been trained on the other half.</Paragraph> <Paragraph position="1"> The verification results for the test set of A-level texts are shown in Figure 1. The further the texts are plotted to the right, the more similar their profile is to the mean profile for the A-level training texts. The further the texts are plotted towards the top, the more similar their profile is to the mean profile for the ICLE training texts.</Paragraph> <Paragraph position="2"> Most of the texts form a central cluster in the bottom right quadrant. A small gap separates them from a group of five near outliers, while there are two far outliers. We decided to use the limits of the central cluster as our classification separator, accepting that 10% of the LOCNESS texts would be rejected. We added the separation line to the plot. In order to create a reference frame linking this figure to the following ones, we add a second line, along the core of the cluster of the LOCNESS texts. Even though the core of the clusters in the successive figures may shift, this line remains constant, as does the plotting area.</Paragraph> <Paragraph position="3"> Figure 1. Text classification of the LOCNESS test texts in terms of similarity to native texts (horizontal axis) and similarity to non-native text (vertical axis). The separation line (top right to bottom left) divides the plot area in a native part (bottom right) and a non-native part (top left). The second line (top left to bottom right) is a reference line which allows comparison between this Figure and verification results differ per nationality. A more detailed examination of such variation, however, is beyond the scope of the present paper.</Paragraph> <Paragraph position="4"> The two dimensions, the degree of similarity to native texts and the degree of similarity to non-native texts, are strongly (negatively) correlated. Still, there are also clear differences, so that both dimensions contribute substantially to the quality of the separation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Distinguishing features </SectionTitle> <Paragraph position="0"> When examining some of the features that emerge from studies reported in the literature as salient in describing different language varieties, we find that none of these dominates the classification. Table 1 shows the influence of each feature in terms of its contribution (expressed as millionths of the total influence, so e.g. 3173 corresponds to 0.3% of the total influence) to the decision to classify a text as native or non-native.</Paragraph> <Paragraph position="1"> The second and third column show the influence of the words (or word combinations) by themselves, which is extremely low. However, when examining all patterns containing these words, the fourth and fifth columns, their usefulness becomes visible.</Paragraph> <Paragraph position="2"> Previous studies into the use of intensifying adverbs have shown an overuse of the token very.</Paragraph> <Paragraph position="3"> Thus it is a likely candidate to be considered as a marker of non-native language use. The second column in the Table confirms this, but the contribution is a mere 0.001%. The picture changes when we consider all patterns in which very occurs, it appears that there is indeed a difference in use of the token by natives and nonnatives. However, there are as many patterns that point to nativeness as there are that point to nonnativeness. Furthermore, the patterns provide a sizeable contribution in the classification either way.</Paragraph> <Paragraph position="4"> classification of allegedly salient features Although the expected features (or rather features related to expected word or word combinations) have a visible contribution, their influence is still only a small part of the total influence. In fact, all features have only very little influence. The most influential single feature is ccc=#HF#AT--#HF#NN1--#HF#CC--RRx13, one of the representations of the, followed by a single common noun, followed by and, a pattern unlikely to be spotted by humans. It contributes 0.06% of the influence classifying texts as non-native. Only 137 features in total contribute more than 0.01% either way. Classification by linguistic profiling is a matter of myriads of small hints rather than a few pieces of strong evidence. This is probably also what makes it robust against high text variability and sometimes small text sizes.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Domain Shifts </SectionTitle> <Paragraph position="0"> Now that we have seen that language verification is viable within the restricted domain of student essays, we may examine whether it survives the shift to a new domain. We tested this on two corpora: the FLOB corpus and (small) internet corpus that was especially collected for this purpose.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 FLOB </SectionTitle> <Paragraph position="0"> The Freiburg LOB Corpus, informally known as FLOB (Hundt et al., 1998) is a modern counterpart to the much used Lancaster-Oslo/Bergen Corpus (LOB; Johansson, 1978) It is a one-million word corpus of written (educated) Modern British English. The composition of FLOB is essentially the same as that of LOB: it comprises 500 samples of 2,000 words each. In all, 15 text categories (A-R) are distinguished. These fall into four main classes: newspaper text (A-C), miscellaneous informative prose (D-H), learned and scientific fiction texts (categories A-J) Of these texts, the learned and scientific class (J) is closest to the ICLE and LOCNESS texts, and we should expect that the FLOB texts of this category are all accepted. This is indeed the case, as can be seen in Figure 3, which shows the classification of these texts. Only 1 text is rejected (1.25%). This seems to confirm that we are indeed recognizing something like '(near-)native English'.</Paragraph> <Paragraph position="1"> As soon as we shift the domain of the texts, however, the native texts are no longer distinguished as clearly. The larger the domain shift, the more texts are rejected. Within the non-fiction portion of FLOB, the system rejects 2.3% of the newspaper texts (categories A-C) and 8.7% of the miscellaneous and informative prose texts (D-H). This leads to an overall reject rate of 5.6% for the non-fiction texts (Figure 4), which is still reasonably acceptable. When shifting to fiction texts (K-R), the reject rate jumps to 39.2% (Figure 5), indicating that a new classifier would have to be trained for a proper handling of fiction texts.</Paragraph> <Paragraph position="2"> Figure 5. Text classification of the FLOB fiction texts (categories K-R)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Capital-Born </SectionTitle> <Paragraph position="0"> Since our original goal was the filtering of internet texts, we compiled a small corpus of such texts. We chose texts which were present as HTML. These, we expected, were likely to be rather abundant, while they would have been subjected to a relatively low degree of editing.</Paragraph> <Paragraph position="1"> Thus they would constitute likely candidates for filtering. In order to be able to decide whether the texts were native-written or not, we searched autobiographical material, as indicated by the phrase I was born in CITY, with CITY replaced by a name of a capital city. The initial set of documents appeared to be of a reasonable size.</Paragraph> <Paragraph position="2"> However, after filtering out webpages by multiple authors (e.g. guest books), fictional autobiographies (e.g. a joke page about Al Gore), texts judged likely to be edited possibly with the help of a native speaker (e.g. a page advertising Russian brides), misclassified city names (e.g.</Paragraph> <Paragraph position="3"> authors from Paris, Texas should not be assumed to be French) and texts outside the desired length of 500-1500 words, we ended up with a mere 20 native British English texts and 18 non-native texts. We nicknamed the corpus &quot;Capital-Born corpus&quot;.</Paragraph> <Paragraph position="4"> When classifying these texts with the A-level versus ICLE classifier, we see that they cluster tightly, outside the area plotted so far, and showing no useful separation of native and non-native texts. This implies that if we want a filter for such texts, we have to train a new classifier.</Paragraph> <Paragraph position="5"> Figure 6. Text classification of internet texts (for a description see section 4.2) We did train such a new classifier, using only the odd-numbered Capital-Born texts and classified the even-numbered ones, using the same parameters D and S as above. We repeated the process with the two sets switching roles. Figure 6 shows a superposition of the classifications in the two experiments. The native texts appear as plus signs (+), the non-native texts as minus signs (-). Note that we adjusted the separation and support lines in order to bring them in line with the data. Only a rough separation is visible, with 2 out of 20 native texts misclassified and 6 out of 18 non-native texts. Still, given the extremely small size of the training sets and the variety of non-native nationalities, these results are rather promising. It appears that even internet texts can be filtered for nativeness, as long as a restricted, and more sizeable, seed corpus can be constructed.</Paragraph> </Section> </Section> class="xml-element"></Paper>