File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3012_metho.xml

Size: 10,337 bytes

Last Modified: 2025-10-06 14:09:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3012">
  <Title>Corpus representativeness for syntactic information acquisition</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 Experimental corpus description
</SectionTitle>
    <Paragraph position="0"> We have used a corpus of technical specialized texts, the CT. The CT is made of subcorpora belonging to 5 different areas or domains: Medicine, Computing, Law, Economy, Environmental sciences and what is called a General subcorpus made basically of news. The size of the subcorpora range between 1 and 3 million words per domain. The CT corpus covers 3 different languages although for the time being we have only worked on Spanish. For Spanish, the size of the subcorpora is stated in Table 1. All texts have been processed and are annotated with morphosyntactic information.</Paragraph>
    <Paragraph position="1"> The CT corpus has been compiled as a test-bed for studying linguistic differences between general language and specialized texts. Nevertheless, for our purposes, we only considered it as documents that represent the language used in particular knowledge domains. In fact, we use them to simulate the scenario where a user supplies a collection of documents with no specific sampling methodology behind.</Paragraph>
    <Paragraph position="2"> 3 Measuring syntactic behavior: the case of adjectives We shall first motivate the statement that parsing lexicons require tuning for a full coverage of a particular domain. We use the term &amp;quot;full coverage&amp;quot; to describe the ideal case where we would have correct information for all the words used in the (unknown a priori) set of texts we want a NLP application to handle. Note that full coverage implies two aspects. First, type coverage: all words that are used in a particular domain are in the lexicon. Second, that the information contained in the lexicon is the information needed by the grammar to parse every word occurrence as intended.</Paragraph>
    <Paragraph position="3"> Full coverage is not guaranteed by working with 'general language' dictionaries. Grammar developers know that the lexicon must be tuned to the application's domain, because general language dictionaries either contain too much information, causing overgeneration, or do not cover every possible syntactic context, some of them because they are specific of a particular domain. The key point for us was to see whether texts belonging to a domain justify this practice.</Paragraph>
    <Paragraph position="4"> In order to obtain objective data about the differences among domains that motivate lexicon tuning, we have carried out an experiment to study the syntactic behavior (syntactic contexts) of a list of about 300 adjectives in technical texts of four different domains. We have chosen adjectives because their syntactic behavior is easy to be captured by bigrams, as we will see below.</Paragraph>
    <Paragraph position="5"> Nevertheless, the same methodology could have been applied to other open categories.</Paragraph>
    <Paragraph position="6"> The first part of the experiment consisted of computing different contexts for adjectives occurring in texts belonging to 4 different domains. We wanted to find out how significant could different uses be; that is, different syntactic contexts for the same word depending on the domain. We took different parameters to characterize what we call 'syntactic behavior'.</Paragraph>
    <Paragraph position="7"> For adjectives, we defined 5 different parameters that were considered to be directly related with syntactic patterns. These were the following contexts: 1) pre-nominal position, e.g. 'importante decision' (important decision) 2) post-nominal position, e.g. 'decision importante' 3) 'ser' copula  predicative position, e.g. 'la decision es importante' (the decision is important) 4) 'estar' copula predicative position, e.g. 'la decision esta interesante/*importante' (the decision is interesting/important) 5) modified by a quantity adverb, e.g. 'muy interesante' (very interesting). Table 1 shows the data gathered for the adjective &amp;quot;paralelo&amp;quot; (parallel) in the 4 different domain subcorpora. Note the differences in the position 3 ('ser' copula) when observed in texts on computing, versus the other domains.</Paragraph>
    <Paragraph position="8">  The observed occurrences (as in Table 1) were used as parameters for building a vector for every lemma for each subcorpus. We used cosine distance  (CD) to measure differences among the occurrences in different subcorpora. The closer to 0, the more significantly different, the closer to 1, the more similar in their syntactic behavior in a particular subcorpus with respect to the general subcorpus. Thus, the CD values for the case of 'paralelo' seen in Table 1 are the following:  Copulative sentences are made of 2 different basic copulative verbs 'ser' and 'estar'. Most authors tend to express as 'lexical idyosincracy' preferences shown by particular adjectives as to go with one of them or even with both although with different meaning.</Paragraph>
    <Paragraph position="9">  Cosine distance shows divergences that have to do with large differences in quantity between parameters in the same position, whether small quantities spread along the different parameters does not compute significantly. Cosine distance was also considered to be interesting because it computes relative weight of parameters within the vector. Thus we are not obliged to take into account relative frequency, which is actually different according to the different domains.</Paragraph>
    <Paragraph position="10"> What we were interested in was identifying significant divergences, like, in this case, the complete absence of predicative use of the adjective 'paralelo' in the computing corpus. The CD measure has been sensible to the fact that no predicative use has been observed in texts on computing, the CD going down to 0.7. Cosine distance takes into account significant distances among the proportionality of the quantities in the different features of the vector. Hence we decided to use CD to measure the divergence in syntactic behavior of the observed adjectives. Figure 1 plots CD for the 4 subcorpora (Medicine, Computing, Economy) compared each one with the general subcorpus. It corresponds to the observations for about 300 adjectives, which were present in all the corpora. More than a half for each corpus is in fact below the 0.9 of similarity. Recall also that this mark holds for the different corpora, independently of the number of tokens (Economy is made of 1 million words and Medicine of 3).</Paragraph>
    <Paragraph position="11">  The data of figure 1 would allow us to conclude that for lexicon tuning, the sample has to be rich in domain dependent texts.</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Frequency and CD measure
</SectionTitle>
    <Paragraph position="0"> For being sure that CD was a good measure, we checked to what extent what we called syntactic behavior differences measured by a low CD could be due to a different number of occurrences in each of the observed subcorpora. It would have been reasonable to think that when something is seen more times, more different contexts can be observed, while when something is seen only a few times, variations are not that significant.</Paragraph>
    <Paragraph position="1">  frequency for every adjective. For being able to do it, we took the difference of occurrences in two subcorpora as the frequency measure, that is, the number resulting of subtracting the occurrences in the computing subcorpus from the number of occurrences in the general subcorpus. It clearly shows that there is no regular relation between different number of occurrences in the two corpora and the observed divergence in syntactic behavior. Those elements that have a higher CD (0.9) range over all ranking positions: those that are 100 times more frequent in one than in other, etc. Thus we can conclude that CD do capture syntactic behavior differences that are not motivated by frequency related issues.</Paragraph>
    <Paragraph position="2"> 5 Corpus size and syntactic behavior We also wanted to see the minimum corpus size for observing syntactic behavior differences clearly. The idea behind was to measure when CD gets stable, that is, independent of the number of occurrences observed. This measure would help us in deciding the minimum corpus size we need to have a reasonable representation for our induced lexicon. In fact our departure point was to check whether syntactic behavior could be compared with the figures related to number of types (lemmas) and number of tokens in a corpus. Biber 1993, Sanchez and Cantos, 1998, demonstrate that the number of new types does not increase proportionally to the number of words once a certain quantity of texts has been observed.</Paragraph>
    <Paragraph position="3">  different subcorpus In our experiment, we split the computing corpus in 3 sets of 150K, 350K and 600K words in order to compare the CD's obtained. In Figure 3, 1 represents the whole computing corpus of 1,200K for the set of 300 adjectives we had worked with before.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3M GEN
</SectionTitle>
    <Paragraph position="0"> As shown in Figure 3, the results of this comparison were conclusive: for the computing corpus, with half of the corpus, that is around 600K, we already have a good representation of the whole corpus. The CD being superior to 0.9 for all adjectives (mean is 0.97 and 0.009 of standard deviation). Surprisingly, the CD of the general corpus, the one that is made of 3 million words of news, is lower than the CD achieved for the smallest computing subcorpus. Table 3 shows the mean and standard deviation for all de subcorpora  What Table 3 suggests is that according to CD, measured as shown here, the corpus to be used for inducing information about syntactic behavior does not need to be very large, but made of texts representative of a particular domain. It is part of our future work to confirm that Machine Learning Techniques can really induce syntactic information from such a corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML