File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0204_intro.xml
Size: 4,255 bytes
Last Modified: 2025-10-06 14:02:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0204"> <Title>On the use of automatic tools for large scale semantic analyses of causal connectives</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Techniques and Tools </SectionTitle> <Paragraph position="0"> The techniques used have to fulfil two tasks: they are needed to extract the relevant linguistic material from the corpus, that is to say the four connectives with their context of use; and they are used to analyse the retrieved elements in order to test a number of linguistic hypotheses concerning the meaning and use of these connectives. Our main objective is to show that with the use of these techniques only fairly straightforward annotation tools are needed to perform quite profound semantic analyses on massive quantitative data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 POS-Tagging and the identification of </SectionTitle> <Paragraph position="0"> the causal segments The extraction of the relevant linguistic material was fulfilled by automatic syntactic analysis techniques. As a basis for our analyses we worked with the first six months of a Dutch newspaper corpus of more than 30 million words2. This material was POS-tagged using MBT (Memory Based Tagger) (Daelemans et al.,1996). We then discarded the items with few content words: sports results, television programs, crosswords and puzzles, stock exchange reports, service information from the newspaper editor, etc. We also 'cleaned' the corpus material of irregularities caused by the incompatibility between the source file and the tagging program (mostly nonsense words generated by the program). This eventually led to a data set of approximately 16,500,000 words.</Paragraph> <Paragraph position="1"> The POS-tagging permitted to segment the corpus in sentences and to label the words grammatically. Second, POS-tagging allowed us to locate and extract the connectives from the sentences in which they occurred. Concretely, we extracted all sentence-length segments on the basis of the tag <UT> ('utterance'). We then did a search on the four connectives tagged as <conj> by the parser.</Paragraph> <Paragraph position="2"> Table 1 displays the frequencies of the retrieved connectives. These figures do not include a number of sentences that were eliminated because they were potentially problematic for the analysis. This was for instance the case for sentences containing more than one connective out of our list of four.3 2 We used the year 1997 of &quot;De Volkskrant&quot; a Dutch national daily newspaper. The corpus is distributed on CD-rom.</Paragraph> <Paragraph position="3"> 2 These cases were eliminated in order to be sure of the exact influence of the connective and about the exact contribution of the context. data set The extracted sentences were then analysed in terms of a series of heuristics to identify the CAUSE (P) and CONSEQUENCE segments (Q)4. From a syntactic point of view, the connectives doordat, omdat and aangezien can occur in two basic types of causal constructions: medial (Q CONNECTIVE P), see example (1), and preposed ones (CONNECTIVE P, Q), see example (2). The connective want only appears in medial constructions.</Paragraph> <Paragraph position="4"> (1) Een gezamenlijk beleid is nodig omdat in het najaar in het Japanse Kyoto wereldwijd wordt onderhandeld over het klimaat.</Paragraph> <Paragraph position="5"> 'A common policy is necessary because worldwide negotiations will take place in the autumn in the Japanese city of Kyoto.' (2) Iedere strenge winter heeft gevolgen voor de kerkorgels', zegt dr. A.J.</Paragraph> <Paragraph position="6"> Gierveld van de Gereformeerde Organistenvereniging. Doordat het hout krimpt, kunnen er kieren ontstaan waardoor lucht ontsnapt.</Paragraph> <Paragraph position="7"> 'Every hard winter has consequences for the church organs&quot;, Dr. A.J. Gierveld of the Reformed Organists Union says. Because the wood shrinks, crocks may show, through which air escapes.' The heuristics to identify the CAUSE (P) and CONSEQUENCE (Q) segments were primarily based on</Paragraph> </Section> </Section> class="xml-element"></Paper>