XML Viewer - w04-0204

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0204_metho.xml
Size: 19,877 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0204">
  <Title>On the use of automatic tools for large scale semantic analyses of causal connectives</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The connectives under investigation are all so-called backward causal connec-
</SectionTitle>
    <Paragraph position="0"> tives, i.e. they express an underlying causal relation of the type CONSEQUENCE -CAUSE, in which the connective introduces the CAUSE segment.</Paragraph>
    <Paragraph position="1"> a) the position of the connective in the sentence (number and type of words preceding the connective), b) the number, position and order of finite verbs in the segment, c) the presence or absence of punctuation markers, especially commas.</Paragraph>
    <Paragraph position="2"> For example, a sentence beginning with the connective omdat can either be preposed (P-Q) (example 3), or medial (Q-P), if Q and P are given in different sentences (example 4).</Paragraph>
    <Paragraph position="3">  (3) Omdat de verdachte niet eerder was veroordeeld, bleef de gevangenisstraf geheel voorwaardelijk.</Paragraph>
    <Paragraph position="4"> 'Because the suspect had not been convicted before, the sentence was entirely probational.' null (4) Maar er zijn meer pro- null gramma's die de moeite waard zijn en die toch niet worden bekeken. Omdat ze onvindbaar zijn tussen de ramsj.</Paragraph>
    <Paragraph position="5"> 'But there are more [TV] programmes that are worth watching and still are not being watched. Because they are hard to trace among the rubbish.' null To extract these segments correctly, a number of rules enter into play. For example,  a) If CONN = omdat, doordat or aangezien; and b) If CONN in initial position, look for first finite verb [vf], if vf appears in segment &lt;...vf, vf ...&gt; or &lt;... vf vf ...&gt;, then cut before second vf, and segment containing CONN is P, the other one is Q.</Paragraph>
    <Paragraph position="6"> c) If CONN in initial position and there is only one vf, then segment containing CONN is P,  and previous sentence is Q.</Paragraph>
    <Paragraph position="7"> Other rules are used to determine whether the CONN is in initial position or not. In addition to examples (2-3), example (5) also illustrates a case of initial connective, even though a word precedes the connective.</Paragraph>
    <Paragraph position="8"> (5) En omdat in Nederland de voertaal nog steeds het Nederlands is, worden de meeste schoolvakken ook in die taal gedoceerd.</Paragraph>
    <Paragraph position="9"> 'And because Dutch is still the main language in the Netherlands, most subjects are taught in that language.' This resulted in 21 heuristic rules, the adequacy of which was hand-checked on large samples of the data. In the end, 1.4% of the data were lost because one of the segments was missing or because none of the procedures could work out the identification of P and Q. Ultimately we were able to identify the causal segments for 14181 sentences. Four syntactic environments can be distinguished, involving a preposed construction &lt;Conn P Q.&gt; as in examples (2, 3, 5) above, and three types of medial constructions: null a) &lt;Q conn P.&gt; corresponds to a construction in which Q and P are linked by a connective within the same sentence (example 1); b) &lt;Q. Conn P.&gt; corresponds to constructions in which the previous sentence functions as Q (examples 4); and c) &lt;Prev. Q conn P.&gt; corresponds to constructions for which the Q-segment is anaphoric with the preceding sentence, thus requiring this previous sentence for the semantic interpretation, as in example (6), in which the Q &amp;quot;dat komt&amp;quot; (litt. 'that comes') picks up the semantic information from the previous sentence and links it to the P-segment introduced by the connective.</Paragraph>
    <Paragraph position="10"> (6) De Europese economie raakt hopeloos achterop bij de Amerikaanse en Japanse. Dat komt doordat Europa niet meedoet op nieuwe groeimarkten.</Paragraph>
    <Paragraph position="11"> 'The European economy is falling hopelessly behind the American and Japanese economy. This is because Europe is not participating in new growth markets.' Actually, only 7.1% of the sentences investigated belong to the preposed construction type. However, important divergences exist between the connectives: want is never used in preposed position, omdat in 10.41% of the cases, and doordat in 14.32% of the cases, a figure which rises to 43.5% of the cases for aangezien. It is interesting to point out that this is in total agreement with previous small-scale corpus research on this matter.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Lemmatisation and the construction of
</SectionTitle>
      <Paragraph position="0"> the LSA semantic space The first automatic technique that will be presented is Latent Semantic Analysis (LSA), a mathematical technique for extracting a very large &amp;quot;semantic space&amp;quot; from large text corpora on the basis of the statistical analysis of the set of co-occurrences in a text corpus. Landauer et al. (1998) stress that this technique can be viewed from two sides. At a theoretical level, it is meant to be used to develop simulations of the cognitive processes running during language comprehension, including, for instance, a computational model of metaphor treatment (Kintsch, 2000 ; Lemaire et al., 2001), but also to analyse the coherence of texts (Foltz et al., 1998 ; Pierard et al., 2004). At a more applied level, it is a technique which enables to infer and to represent the meaning of words on the basis of their actual use in text so that the similarity of the meaning of words, sentences or paragraphs can be estimated (Bestgen, 2002; Choi et al., 2001). It is this latter aspect which draws our attention here.</Paragraph>
      <Paragraph position="1"> The point of departure of the analysis is a lexical table (Lebart and Salem, 1992) containing the frequencies of every word in each of the documents included in the text material, a document being a text, a paragraph, or a sentence. To derive semantic relations between words from the lexical table the analysis of mere co-occurrences will not do, the major problem being that even in a large corpus most words are relatively rare. Consequently the co-occurrences of words are even rarer. This fact makes such co-occurrences very sensitive to arbitrary variations (Burgess et al., 1998 ; Kintsch, 2001). LSA resolves this problem by replacing the original frequency table by an approximation producing a kind of smoothening effect on the associations. To this end, the frequency table undergoes a singular value decomposition and it is then recomposed on the basis of only a fraction of the information it contains. Thus, the thousands of words from the documents have been substituted by linear combinations or 'semantic dimensions' with respect to which the original words can be situated again. Contrary to a classical factor analysis the extracted dimensions are very numerous and non-interpretable.</Paragraph>
      <Paragraph position="2"> All original words and segments can then be placed into this semantic space. The meaning of each word is represented by a vector, thus indicating the exact location of the word in this multidimensional semantic space. To calculate the semantic proximity between two words, the cosine between the two vectors that represent them is calculated. The more two words are semantically similar, the more their vectors point in the same direction, and consequently, the closer their cosine will be to 1 (coinciding vectors). A cosine of 0 shows an absence of similarity, since the corresponding vectors point in orthogonal directions. It is also possible to calculate the similarity between 'higher order' elements, i.e. between sentences, paragraphs, and entire documents, or combinations of those, even if this higher order element isn't by itself an analysed element. The vector in question corresponds to the centroid of the words composing the segment under investigation. The centroid results from the weighted sum of the vectors of these words (Deerwester et al., 1990). This makes it possible to calculate the semantic proximity between any two sentences, viz. whether present in the original corpus or not, whether the original corpus had been segmented in sentence length documents or not.</Paragraph>
      <Paragraph position="3"> To perform the LSA analyses, we used the Dutch newspaper corpus to build the semantic space. To this end, the data set, which had been lemmatised with MBLEM (Memory Based Lemmatiser) (Van den Bosch &amp; Daelemans, 1999), was cut into article-length segments. Elimination of all digits, special characters, punctuation marks, and of a list of 222 stopwords (words occurring in &amp;quot;any&amp;quot; context, like determiners, auxiliaries, conjunctions, ...), brought the total number of words back to approximately 6.5 million. For the input lexical table, the documents were articles of minimally 24 words and maximally 523 words, i.e. all articles minus the 10% shortest and minus the 10% longest ones. As to the words, we kept all those that occurred at least ten times in the data set.</Paragraph>
      <Paragraph position="4"> Overall this resulted in a matrix of 36630 terms in 28640 documents. To build the semantic space proper, the singular value decomposition was realized with the program SVDPACKC (Berry, 1992; Berry et al., 1993), and the 300 first singular vectors were retained. In the present research we will use this technique to evaluate the semantic proximity between P&amp; Q, and between the causal segments and the prior or subsequent sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Dictionaries and lexical categorisation
</SectionTitle>
      <Paragraph position="0"> The second technique used to test the linguistic hypotheses is alternatively called 'word count strategy' (Pennebaker et al., 2003), automatic identification of linguistic features (Biber, 1988) or thematic text analysis (Popping, 2000; Stone, 1997), the aim of which is to determine whether some categories of words (e.g., words of opinion, fact, attitude, etc.) or some grammatical categories (e.g. personal pronouns) occur more often in a given type of text segment. The first step in this kind of analysis is to build a dictionary that contains the categories to be investigated and the corresponding (lemmatised) lexical entries that signal their occurrence. The categories may correspond to grammatical classes, but also to thematic word grouping. The following step consists in searching all the text segments containing these lexical entries in order to account for the frequency of each category in each text segment. These data are put into a matrix that has one row for each text segment and one column for each category, each cell containing the frequency of the respective category in the respective text segment. Finally, this matrix is analysed to determine whether some categories occur more often in a given type of text segment.</Paragraph>
      <Paragraph position="1"> To illustrate this technique, let us assume that we want to test the hypothesis that (nominative) personal pronouns occur more frequent in text segments connected by want than by the other backward causal connectives. In the first step the &amp;quot;Personal-Pronoun&amp;quot; dictionary is built, containing the corresponding lexical entries: ik, jij, je, hij, zij, ze, wij, we, jullie, u. All the text segments containing these lexical entries are then searched in order to account for the frequency of the concept &amp;quot;Personal-Pronoun&amp;quot; in each text segment. These data are put into a matrix which is analysed to determine whether the concept &amp;quot;Personal-Pronoun&amp;quot; occurs more often with want-segments than with the other causal segments.</Paragraph>
      <Paragraph position="2"> The two main difficulties we are confronted with when using this technique in the present studies are (i) the reduced size of the analysed text segments (one sentence or even less), and (ii) the difficulty, or even impossibility, to build an exhaustive list of words belonging to a category like fact, opinion, attitude, etc. With respect to the first difficulty, we believe that the reduced size of the segments will be compensated by the large number of segments of each type being analysed. The second difficulty is addressed below where we propose a number of ways to extend the category lists automatically.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Perspective shift
</SectionTitle>
      <Paragraph position="0"> There are a number of claims in the literature that some connectives co-occur with perspective shifts between the causal segments, while others do not.</Paragraph>
      <Paragraph position="1"> Perspectivisation accounts for the fact that there are more sources of information than the speaker alone. In relation to our connectives, perspectivisation has been claimed to play a role in the meaning differences between want (introducing a perspective shift) and omdat (no perspective shift). However, the various corpus studies on this matter have not univocally confirmed this hypothesis (Degand, 2001; Oversteegen, 1997). We would like to explore this matter further by comparing the semantic tightness of the segments related by our connectives. This will be done by calculating the semantic proximity between Q and P for each of the connectives. Our hypotheses are as follows: Hypothesis 1: The cosine between Q and P related by monophonic connectives (omdat) should be higher than the cosine between Q and P related by polyphonic connectives (want).</Paragraph>
      <Paragraph position="2"> Hypothesis 2: The cosine between the prior sentence and the subsequent sentence should be higher for monophonic connectives than for polyphonic connectives.</Paragraph>
      <Paragraph position="3"> Cos. Q &amp; P Cos. Prior Subse- null LSA-analysis. Two ANOVAs were performed.</Paragraph>
      <Paragraph position="4"> The first one had the connectives as independent variable and the semantic proximity between the causal segments as dependent variable. It shows that hypothesis 1 is borne out (F(3, 10505) = 11.36, p &lt; 0.0001): the causal segments related by the (monophonic) connective omdat are semantically closer than the segments related by the (polyphonic) connective want. The results furthermore show that doordat and aangezien should be described in terms of monophonic connectives. The second ANOVA, with the connectives as independent variable and the semantic proximity between the prior and subsequent sentences as dependent variable, confirms hypothesis 2 (F(3, 10505) = 25.75, p &lt; 0.0001): the monophonic connectives aangezien, doordat and omdat go along with topic continuity (or at least semantic proximity) between the prior and subsequent sentence to the causal construction, while this is less the case for the connective want.</Paragraph>
      <Paragraph position="5"> To confirm that these results are indeed related to the issue of perspectivisation, this LSA-analysis was completed with a thematic text analysis to test for the presence vs. absence of perspective indicators. To this end we built a &amp;quot;Perspective&amp;quot; dictionary of perspective-indicating elements (Spooren, 1989) such as intensifiers, emphasisers, attitudinal nouns and adjuncts, etc. (Caenepeel, 1989). The dictionary was composed of two subcategories: a) communication markers, like (nonambiguous) verbs and adverbs of saying and thinking, e.g. report, tell, confirm, require, according to,...</Paragraph>
      <Paragraph position="6"> b) markers of the speaker's attitude, like linguistic elements expressing an expectation or a denial of expectation, intensifiers and attitudinals, and evaluative words, e.g.</Paragraph>
      <Paragraph position="7"> probably, must, horrible, fantastic, ...</Paragraph>
      <Paragraph position="8"> To build the dictionary, we used a Dutch thesaurus (Brouwers, 1997) and extracted all (unambiguous) lemmas corresponding to one of the above-mentioned categories. Multi-word expressions or separable verbs were not included in the lists. The lists were composed on two native speaker's judgements with a good knowledge of the literature on perspectivisation.</Paragraph>
      <Paragraph position="9"> The idea of the thematic text analysis was to confirm that the break in semantic tightness occurring with want-segments, as revealed by the LSAanalysis, could indeed be interpreted in terms of a perspective shift. We would therefore expect that the causal segments related by the connective want show diverging perspectivisation patterns, and that this will not be the case for the segments related by omdat, doordat, aangezien. This is reformulated in hypothesis 3.</Paragraph>
      <Paragraph position="10"> Hypothesis 3: If the causal segments are related by the connective want, the Q-segment contains perspective signals, the P-segment does not. The causal segments related by the connectives omdat, doordat, aangezien do not present such a shift.</Paragraph>
      <Paragraph position="11">  The results displayed in Table 3 show that the hypothesis is borne out for the subcategory of attitudinal markers: want-segments display a higher amount of attitudinal markers in Q than in P (F(1, 5588) = 26.84, p &lt; 0.0001). For the other connectives this is not the case. For the communication markers, the hypothesis is not borne out. Actually, only omdat displays a higher amount of communication markers in Q (F(1, 6746) = 6.53, p &lt; 0.01). While this latter result might seem counter to expectation, it actually goes in the direction of prior observations that omdat-relations frequently display the explicit introduction of speech acts (Degand, 2001; Pit 2003).</Paragraph>
      <Paragraph position="12"> All together, these results offer new interesting insights into the discourse environment of (Dutch) causal connectives. On the one hand, we have shown with the LSA analysis that the proximity between Q and P is lower for want-relations than for the other connectives and that this is also the case for the semantic proximity between the sentences prior and subsequent to the causal relations. We therefore concluded that the connective want is a marker of thematic shift. On the other hand, the TTA analysis revealed that the Q-segments in want-relations display a higher amount of attitudinal markers. In our view, the presence of these markers leads to the conclusion that the connective want is indeed a marker of perspective shift, i.e.</Paragraph>
      <Paragraph position="13"> the break in semantic tightness should be interpreted as a perspective break, as has often been suggested in the literature. Furthermore, the additional results for want (absence of communication markers in Q) also suggest that markers expressing the speaker's attitude should be clearly distinguished from those that explicit the speaker's speech act (verbs of saying) or designate him/her explicitly as the source of the speech act (adverbs like aldus, volgens, ... 'according to').</Paragraph>
      <Paragraph position="14"> The polyphony/monophony distinction overlaps with the coordination/subordination distinction between want vs. the other connectives. The question arises which of those two factors is responsible for the results obtained. One route to follow is to compare our results with a language like English in which a same connective (because) has both monophonic and polyphonic uses, or with a language like French where a polyphonic connective like puisque is subordinating. The latter topic is object of ongoing research.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML