File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/e89-1035_intro.xml
Size: 8,451 bytes
Last Modified: 2025-10-06 14:04:42
<?xml version="1.0" standalone="yes"?> <Paper uid="E89-1035"> <Title>THE SYNTACTIC REGULARITY OF ENGLISH NOUN PHRASES</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> In this paper, we present the results of an analysis of just over 10,000 English noun phrases (NPs) extracted from the Lancaster Oslo/Bergen (LOB) corpus treebank (Sampson, 1987b), a syntactically analysed 50,000 word subset of the 1 million word LOB corpus. The motivation for this research is twofold. Firstly, we wish to use this substantial data-base of naturally occurring constructions to test the accuracy mad adequacy of a (purportedly) wide-coverage sentence grammar (Grover et al., 1987, 1989) which has been developed over the past three years as part of a general-purpose morphological and syntactic analyser for English (hereafter the Alvey Natural Language Tools (ANLT) grammar). 2 The research reported here forms part of an ongoing project to evaluate the complete grammar using data extracted from the LOB corpus (see Briscoe et al., 1987a). Secondly, Sampson (1987a) has analysed a large subset of the same NPs and argued that they provide evidence against any clear-cut distinction between grammatical and 'deviant' sentences in natural language.</Paragraph> <Paragraph position="1"> Sampson suggests that the lack of such a distinction precludes the possibility of successful automated natural language processing (NLP) using a generative grammar.</Paragraph> <Paragraph position="2"> If correct, this conclusion would have profound implications for our own work and the majority of other work in NLP (since the ANLT grammar is a type of generative grammar). Therefore, we wished to assess the evidence which Sampson uses to sutrtx~ his conclusion.</Paragraph> <Paragraph position="3"> The LOB treebank is a manually analysed set of sentences drawn from the lexically analysed and tagged LOB corpus. ~ An analysis consists of a labelled bracketing containing lexical syntactic tags and phrasal or clausal 'hypertags'. Sampson (1987,'221) reports that there are 47 tags and hypertags relevant to the analysis of NPs - 28 lexical tags, 14 hypertags and 5 punctuation tags~ Analyses are assigned to sentences according to the intuitions of the linguist guided by a 'casebook' of precedents (Sampson, 1987b). One important feature of these analyses is that the resulting tree structures are quite 'shallow' in the sense that there are rarely intervening nodes between the topmost node marked NP and the lexical tags themselves. Whilst most NP postmodifiers are treated as independent constituents, NP premodifiers are largely analysed as immediate daughters of the topmost NP node. In addition, punctuation tags are usually attached as immediate daughters of this node.</Paragraph> <Paragraph position="4"> A second significant feature of the LOB treebank analysis scheme is that tags and hypertags are atomic symbols (albeit with mnemonic names designed to indicate aspects of their featural composition).</Paragraph> <Paragraph position="5"> Sampson (1987a:221) treats these 47 tags and hypertags as defining the types of distinct NP: &quot;two or more noun phrases are regarded as tokens of the same type if their respective immediate constituents (ICs) represent the same sequence of possibilities drawn from this 47-member set of constituent-types&quot;. The example he gives of an NP type is DT* *S , F which would be the analysis assigned to an NP consisting of a determiner, plural noun, comma and finite clause. In this example, Sampson has generalised across sets of atomic tags through the use of 'wildcard' symbols, so DT* generalises across DTI, DT$, DTS, DTX, and so forth.</Paragraph> <Paragraph position="6"> He does not explain the extent to which he has generalised types in this fashion; however, since (hyper)tags contain at most four letters representing distinct features there are strict limits on featural decomposition within this framework of analysis.</Paragraph> <Paragraph position="7"> Sampson found that the 8328 NP tokens in his sample fell into 747 distinct NP types (relative to the notion of type just described). However, the crucial point of his argument is that the distribution of tokens amongst types is very wide. Sampson finds that there are a few very common types (such as 1135 tokens of DT* N* ie.</Paragraph> <Paragraph position="8"> determiner followed by noun) and a large number of distinct types with very few tokens (such as 468 types represented by a single token). Sampson examines the shape of the constituent type/token curve which results from analysing each type frequency relative to the most frequent type in the corpus. Sampson (1987a:225) concludes that this analysis provides &quot;no evidence at all of a two-way partition of noun phrase types into a group of high-frequency, well-formed constructions and a group of unique or rare 'deviant' constructions; instead noun phrase types in the sample appear to be scattered continuously across the frequency spectrum.&quot; Furthermore, he suggests that the evidence from NPs supports his claim that &quot;the range of constructions occurring in authentic texts seems so endlessly diverse - 256 that the enterprise of formulating watertight generative grammars appears doomed to failure&quot; (1987b:219).</Paragraph> <Paragraph position="9"> The last step in Sampson's argument from the distribution of tokens amongst NP types to the failure of the generative paradigm is not made completely explicit.</Paragraph> <Paragraph position="10"> However, we believe that a legitimate way of reconstructing it is as follows. Suppose that we convert each NP type as defined above into a phrase-structure rule of a generative grammar (so DT* *S , F becomes NP -> DT* *S, F and so forth). Now consider the form that such a grammar will take: there will be a small number of quite general rules which will be used frequently and a very large number of particular rules used very infrequently. Crucially, for any corpus considered, many of the particular rules will be motivated by just one token in the data. Thus, these rules are not rules in any genuine sense since they express no generalisations over the data. Furthermore, this suggests that the task of the generative linguist (in search of watertight grammars) will never be complete because each new set of data will bring with it the need for further highly idiosyncratic 'rules' of this kind.</Paragraph> <Paragraph position="11"> Whilst it seems likely that &quot;all grammars leak&quot; slightly, one clear problem with Sampson's argument is that his evidence only bears on one particular and implausible generative grammar, rather than on the paradigm as a whole. It may well be that the generalisations which can be expressed in terms of a phrase-structure grammar employing a finite set of (nearly) atomic categories are not those appropriate to elegant description of natural language syntax (Chomsky, 1957; Gazdar et al., 1985). In addition, the strategy of adopting 'shallow' analyses in which each phrase-structure rule will have many daughter categories will tend to reduce the applicability of each rule. In these respects, the ANLT grammar is a more conventional generative grammar, based on recent monostratal approaches to syntactic description. Syntactic categories are feature complexes and unification is employed as the method of grammatical combination. Syntactic generalisations are expressed in terms of partially specified immediate dominance rules, linear precedence rules and a variety of metagrammatical statements concerning feature defaults, propagation, optional pre/postmodification, and so forth. 4 In addition, the particular analysis of NPs adopted recognises a number of intermediate nominal categories (such as N-bar), as well as recursion within these categories, and this ensures that most individual rules mention fewer daughters than would be typical in the analysis used in the description of the LOB treebank. For these reasons, we felt that a fairer test of Sampson's claims would be to evaluate the same corpus of NPs with respect to the ANLT grammar. In addition, this exeereise would provide valuable information concerning the real adequacy of the account of English NPs incorporated into this grammar.</Paragraph> </Section> class="xml-element"></Paper>