File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/j93-3004_abstr.xml
Size: 4,176 bytes
Last Modified: 2025-10-06 13:47:54
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-3004"> <Title>Squibs and Discussions Co-occurrence Patterns among Collocations: A Tool for Corpus-Based Lexical Knowledge Acquisition</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> One of the main problems for applied natural language processing is gaps in the lexicon, including missing words and word senses, and inadequate descriptions of word use in context. Traditional lexicography has similar concerns. The availability of large, on-line text corpora provides a straightforward tool for enlarging the stock of words included in a lexicon. The identification of additional word senses and uses is more problematic, however.</Paragraph> <Paragraph position="1"> Much recent lexicographic work employs concordances generated from text corpora for these purposes. While this approach provides a more solid empirical basis than traditional lexicographic approaches (which depend on the manual collection and sorting of citation index cards), concordances can actually provide too much data. For example, a concordance for the word certain produced on an 11.6 million-word subsample of the Longman/Lancaster Corpus generated 3,424 entries; a concordance for the word right from the same subcorpus generated 7,619 entries. Simply determining the number of different senses in a database of this size is a daunting task; to accurately group different uses or rank them in order of importance is not really feasible without the use of additional tools.</Paragraph> <Paragraph position="2"> One such tool is to simply sort concordance lines according to their different collocational patterns. Entries can be sorted according to their collocates on both the left and the right. Many of these collocational pairs show a strong relation to a particular word sense (e.g., contrast right ear and right away), and thus analysis of collocational relations has become an important tool for lexical knowledge acquisition (see Sinclair 1991; Smadja 1991; Zernik 1991).</Paragraph> <Paragraph position="3"> In addition, there are statistical tools that can help determine the relative strength of collocational relations. For example, Church and Hanks (1990) describe the use of the mutual information index for this purpose (cf. Calzolari and Bindi 1990). Church et al. (1991) further describe the use of t-scores to assess the extent of the differences between the collocational patterns of nearly synonymous words. These tools are important in that the strongest collocational associations often represent different word senses, and thus 'they provide a powerful set of suggestions to the lexicographer for what needs to be accounted for in choosing a set of semantic tags' (Church and Hanks 1990, p. 28).</Paragraph> <Paragraph position="4"> However, such tools do not directly characterize word senses or even provide any direct indication of the number of different senses that a word has. 1 Further, these * Dept. of English, Northern Arizona University, P.O. Box 6032, Flagstaff, AZ 86011-6032; biber@nauvax.ucc.nau.edu. 1 Church and Hanks (1990; Church et al. 1991) thus emphasize the importance of human judgment used in conjunction with these tools.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 19, Number 3 tools are not designed to assess the relations among various collocations, addressing the question of which clusters of collocations reflect similar underlying senses. 2 The present paper discusses the use of factor analysis (a multivariate statistical technique) as a tool for such research questions. In particular, this technique contributes three types of information not provided by other complementary techniques: 1) an indication of the number of major senses and/or uses associated with a word; 2) an indication of the way that various collocational patterns relate to one another in marking word senses and uses; and 3) a fuller analysis of the senses themselves, based on interpretation of the shared bases underlying the groupings of collocations.</Paragraph> </Section> class="xml-element"></Paper>