File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1116_intro.xml
Size: 6,323 bytes
Last Modified: 2025-10-06 14:06:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1116"> <Title>What grammars tell us about corpora: the case of reduced relative clauses</Title> <Section position="4" start_page="0" end_page="134" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Most linguistic work until the 1950s studied language use. which required attention to detail and exceptions, and led to the development of data-driven theories and to the use of corpora to model naturally occurring language. Later on. linguists mostly studied grammars, which focussed on generalities and regularities, and led to the formulation of strong theories and to the study of similarity across languages. Some of the current &quot;empirical&quot; approaches integrate tlle corpus-based lessons with the depth of insight that the study of grammar has brought to the study of language.</Paragraph> <Paragraph position="1"> U.S.A.</Paragraph> <Paragraph position="2"> suzanne~ruccs, rutgers, edu Empirically-induced models that learn a linguistically meaningflll grammar (Collins, 1997) seem to give tile best practical results in statistical natural language processing. One of the reasons wily these models perform so well compared to probabilistic context-free grammars is that they incorporate detailed lexical knowledge at all points in tile derivation (Charniak, 1997). At the same time they perform better than string-based approaches because they retain structural knowledge, such as phrase structure, subcategorization and long distance dependencies. So they are equally capable of modelling the fine lexical idiosyncrasies and tile more general syntactic regularities.</Paragraph> <Paragraph position="3"> Given an annotated training corpus, such methods learn its distributions (the lexical cooccurrences), which requires being given the correct space of events in the model--that is, the grammar--accurately enough that they can parse new instances of the same corpus. The success of such models suggests that a statistical model nmst have access to tile appropriate linguistic features to make accurate predictions. We might want to ask the question: what happens if what one wants to do with annotated text is not to annotate more text. but to perform some other task'? Are. the same insights valid, so that annotated text can be used to help in other tasks, for instance generation or translation? Can we use annotated text to investigate properties of language(s) systematically? In other words, can we use annotated text as a repository of information? The answer is a qualified yes.</Paragraph> <Paragraph position="4"> In this paper we look at one type of information that is plentiflflly present in a corpus-syntactic preferences--and we argue that corpora can be very usefifl even for tasks that do not invoh'e parsing directly, but that mak- null ing corpora useful for other tasks might require more a priori information than expected. Precisely, we ask the following question: are the percentages of occurrence of linguisticallydefined units in a large corpus in accord with what is known about preferences for these units collected in other ways, such as unedited sentence production, experimental findings, or intuitive native speakers' judgments? This question is relevant as there is evidence in the literature of human parsing preferences that is in apparent disagreement with predictions of preferences derived from frequencies in a corpus (Brysbaert et al., 1998). Beside the interest ill modelling human performance (which is, however, not the focus of the current paper), it is important to investigate the sources of this disagreement between production prefermine data (frequencies in a text) and perception data (parsing preferences by humans), if the plentiflfl informatioIl stored in text is to be used successflfily. Distributional properties of texts if ,mderstood, can be used to approximate resolution of ambiguity in several tasks which involve deeper natural language understanding: a generation system can use distributional properties to reproduce users' preference data; automatic translation can use monolingual distributions to model cross-linguistic variation accurately, and automatic lexical acquisition can use distributional properties of text to bootstrap a process of organisation of lexical information.</Paragraph> <Paragraph position="5"> The method we use to address the question is as follows. We present a large in-depth corpus-based case study (65 million words of WSJ) to investigate (1) how the structure of a grammar is reflected in a corpus, and (2) how probability functions defined according to a grammar fit native speakers' linguistic behaviour in syntactic disambiguation. We look at the well-known ,:;use of tile ambiguity between a main clause and a reduced relative construction, which arises because regular verbs in English present an ambiguity between the simple past and the past particle (the -ed form). We measure the probability distributions of several linguistic features (transitivity, tense, voice) over a sample of optionally intransitive verbs. We do this by hypothesizing and testing several probability functions over the sample. In agreement with recent results on parsing with lexicalised probabilistic grammars (Collins, 1997; Srinivas, 1997; Charniak, 1997), our main result is that statistics over lexical features best correspond to independently established truman intuitive preferences and experimental findings.</Paragraph> <Paragraph position="6"> We discuss several consequences. Methodologically, this result casts light oll the relationship between different ways of collecting preference information. It shows that some apparently contradictory results that have been discussed ill the literature can be reconciled. The cruciM factor is the level of specificity one looks at. Theoretically, not all lexical features axe equally good predictors of linguistic behaviour, and they vary in their ability to correctly classify linguistic phenomena. Finally, from the point of view of language engineering, this results provides a strong indication on what units nfight port better across tasks, and what are the features that would be most useflfl in a syntactically annotated corpus..</Paragraph> </Section> class="xml-element"></Paper>