File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/j93-2002_abstr.xml
Size: 7,442 bytes
Last Modified: 2025-10-06 13:47:46
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-2002"> <Title>From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax</Title> <Section position="2" start_page="0" end_page="244" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> This paper presents a study in the automatic acquisition of lexical syntax from naturally occurring English text. It focuses on discovering the kinds of syntactic phrases that can be used to represent the semantic arguments of particular verbs. For example, want can take an infinitive argument and hope a tensed clause argument, but not vice versa: (1) a.</Paragraph> <Paragraph position="1"> b.</Paragraph> <Paragraph position="2"> c.</Paragraph> <Paragraph position="3"> d.</Paragraph> <Paragraph position="4"> John wants Mary to be happy.</Paragraph> <Paragraph position="5"> John hopes that Mary is happy.</Paragraph> <Paragraph position="6"> *John wants that Mary is happy.</Paragraph> <Paragraph position="7"> *John hopes Mary to be happy.</Paragraph> <Paragraph position="8"> This study focuses on the ability of verbs to take arguments represented by infinitives, tensed clauses, and noun phrases serving as both direct and indirect objects. These lexical properties are similar to those that Chomsky (1965) termed subcategorization frames, but to avoid confusion the properties under study here will be referred to as syntactic frames or simply frames.</Paragraph> <Paragraph position="9"> The general framework for the problems addressed in this paper can be thought of as follows. Imagine a language that is completely unfamiliar; the only means of studying it are an ordinary grammar book and a very large corpus of text (or transcribed speech). No dictionary is available. How can easily recognized, surface grammatical * Department of Cognitive Science, Johns Hopkins University, Baltimore MD 21218; michael@mail.cog.jhu.edu (~) 1993 Association for Computational Linguistics Computational Linguistics Volume 19, Number 2 facts be used to extract from a corpus as much syntactic information as possible about individual words? The scenario outlined above is adopted in this paper as a framework for basic research in computational language acquisition. However, it is also an abstraction of the situation faced by engineers building natural language processing (NLP) systems for more familiar languages. The lexicon is a central component of NLP systems and it is widely agreed that current lexical resources are inadequate. Language engineers have access to some but not all of the grammar, and some but not all of the lexicon. The most easily formalized and most reliable grammatical facts tend to be those involving auxiliaries, rnodals, and determiners, the agreement and case properties of pronouns, and so on. These vary little from speaker to speaker, topic to topic, register to register. Unfortunately, this information is not sufficient to parse sentences completely, a fact that is underscored by the current state of the parsing art. If sentences cannot be parsed completely and reliably then the syntactic frames used in them cannot be determined reliably. How, then, can reliable, easily formalized grammatical information be used to extract syntactic facts about words from a corpus? This paper suggests the following approach: Do not try to parse sentences completely. Instead, rely on local morpho-syntactic cues such as the following facts about English: (1) The word following a determiner is unlikely to be functioning as a verb; (2) The sequence that the typically indicates the beginning of a clause. Do not try to draw categorical conclusions about a word on the basis of one or a fixed number of examples. Instead, attempt to determine the distribution of exceptions to the expected correspondence between cues and syntactic frames. Use a statistical model to determine whether the cooccurrence of a verb with cues for a frame is too regular to be explained by randomly distributed exceptions.</Paragraph> <Paragraph position="10"> The effectiveness of this approach for inferring the syntactic frames of verbs is supported by experiments using an implementation called Lerner. In the spirit of the problem stated above, Lerner starts out with no knowledge of content words--it bootstraps from determiners, auxiliaries, modals, prepositions, pronouns, complementizers, co-ordinating conjunctions, and punctuation. Lerner has two independent components corresponding to the two strategies listed above. The first component identities sentences where a particular verb is likely to be exhibiting a particular syntactic frame. It does this using local cues, such as the that the cue. This component keeps track of the number of times each verb appears with cues for each syntactic frame as well as the total number of times each verb occurs. This process can be described as collecting observations and its output as an observations table. A segment of an actual observations table is shown in Table 4. The observations table serves as input to the statistical modeler, which ultimately decides whether the accumulated evidence that a particular verb manifests a particular syntactic frame in the input is reliable enough to warrant a conclusion.</Paragraph> <Paragraph position="11"> To the best of my knowledge, this is the first attempt to design a system that autonomously learns syntactic frames from naturally occurring text. The goal of learning syntactic frames and the learning framework described above lead to three major differences between the approach reported here and most recent work in learning grammar from text. First, this approach leverages a little a priori grammatical knowledge using statistical inference. Most work on corpora of naturally occurring language Michael R. Brent From Grammar to Lexicon either uses no a priori grammatical knowledge (Brill and Marcus 1992; Ellison 1991; Finch and Chater 1992; Pereira and Schabes 1992), or else it relies on a large and complex grammar (Hindle 1990, 1991). One exception is Magerman and Marcus (1991), in which a small grammar is used to aid learning. 1 A second difference is that the work reported here uses inferential rather than descriptive statistics. In other words, it uses statistical methods to infer facts about the language as it exists in the minds of those who produced the corpus. Many other projects have used statistics in a way that summarizes facts about the text but does not draw any explicit conclusions from them (Finch and Chater 1992; Hindle 1990). On the other hand, Hindle (1991) does use inferential statistics, and Brill (1992) recognizes the value of inference, although he does not use inferential statistics per se. Finally, many other projects in machine learning of natural language use input that is annotated in some way, either with part-of-speech tags (Brill 1992; Brill and Marcus 1992; Magerman and Marcus 1990) or with syntactic brackets (Pereira and Schabes 1992).</Paragraph> <Paragraph position="12"> The remainder of the paper is organized as follows. Section 2 describes the morpho-syntactic cues Lerner uses to collect observations. Section 3 presents the main contribution of this paper--the statistical model and experiments supporting its effectiveness. Finally, Section 4 draws conclusions and lays out a research program in machine learning of natural language.</Paragraph> </Section> class="xml-element"></Paper>