File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/83/p83-1001_intro.xml

Size: 14,809 bytes

Last Modified: 2025-10-06 14:04:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P83-1001">
  <Title>CONTEXT-FREENESS AND THE COMPUTER PROCESSING OF HUMAN LANGUAGES</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
0. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Many computationally inclined linguists appear to think that in order to achieve adequate gr~----rs for human languages we need a hit more power than is offered by context-free phrase structure grammars (CF-PSG's), though not a whole lot more. In this paper, I am concerned with the defense of a more conservative view: that even CF-PSG's should be regarded as too powerful, in three computationally relevant respects: weak generative capacity, strong generative capacity, and time complexity of recognition. All three of these matters should be of concern to theoretical linguists; the study of what mathematically definable classes human languages fall into does not exhaust scientific linguistics, hut it can hardly he claimed to he irrelevant to it. And it should be obvious that all three issues also have some payoff in terms of certain computationally interesting, if rather indirect, implications.</Paragraph>
    <Paragraph position="1"> I. WEAK GENERATIVE CAPACITY Weak generative capacity (WGC) results are held by some linguists (e.g. Chomsky (1981)) to be unimportant. Nonetheless, they cannot be ignored by linguists who are interested in setting their work in a context of (even potential) computational implementation (which, of course, some linguists are not). To paraphrase Montague, we might say that linguistically (as opposed to psycholinguistically) there is no important theoretical difference between natural languages and high-level programming languages. Mediating programs (e.g. a compiler or interpreter), of considerable complexity, will be needed for the interpretation of computer input in either Prolog or Japanese. In the latter case the level of complexity will be much higher, but the assumption is that we are talking quantitatively, not qualitatively. And if we are seriously interested in the computational properties of either kind of language, we will be interested in their language-theoretic properties, as well as properties of the grammars that define them and the parsers that accept them.</Paragraph>
    <Paragraph position="2"> The most important language-theoretic class considered by designers of programming languages, compilers, etc. is the context-free languages (CFL's). Ginsburg (1980, 7) goes so far as to say on behalf of formal language theorists, &amp;quot;We live or die on the context-free languages.&amp;quot;) The class of CFL's is very rich. Although there are simply definable languages well known to be non-CF, linguists often take CFL's to be non-CF in error.</Paragraph>
    <Paragraph position="3"> Several examples are cited in Pullum and Gazdar (1982). For another example, see Dowry, Wall and Peters (1980; p.81), where exercise 3 invites the reader to prove a certain artificial language non-CF. The exercise is impossible, for the language i__% a CFL, as noted by William H. Baxter (personal communication to Gerald Gazdar).</Paragraph>
    <Paragraph position="4"> From this point on, it will he useful to be able to refer to certain types of formal language by names. I shall use the terms defined in \[i) thru (3), among others.</Paragraph>
    <Paragraph position="5"> (i) Triple Counting Languages: languages that can be mapped by a homomorphism onto some language of the form  ~ b n ~1 nZl~ (2) String Matching Languages: languages that can be mapped by a homomorphism onto some language of the form {xxlx is in some infinite language A} (3) String Contrasti~ Languages:  languages that can be mapped by a homomorphism onto some language of the form {xcy\[x and y are in some infinite language A and x ~ y} Programming languages are virtually always designed to be CF, except that there is a moot point concerning the implications of obligatory initial declaration of variables as in ALGOL or Pascal, since if variables (identifiers) can be alphanumeric strings of arbitrary length, a syntactic guarantee that each variable has been declared is tantamount to a syntax for a string matching language. The following view seems a sensible one to take about such cases: languages like ALGOL or Pascal are CF, but not all ALGOL or Pascal programs compile or run. Programs using undeclared variables make no sense either to the compiler or to the CPU. But they are still programs, provided they conform in all other ways to the syntax of the language in question, just as a program which always goes into an infinite loop and thus never gives any output is a program. Aho and Ullmann (1977, 140) take such a view: the syntax of ALGOL...does not get down to the level of characters in a name. Instead, all names are represented by a token such as i d, and it is left to the bookkeeping phase of the compiler to keep track of declarations and uses of particular names.</Paragraph>
    <Paragraph position="6"> The bookkeeping has Co be done, of course, even in the case of languages like LISP whose syntax does not demand a list of declarations at the start of each program.</Paragraph>
    <Paragraph position="7"> Various efforts have been made in the linguistic literature to show that some human language has an infinite, appropriately extractable subset that is a triple counting language or a string matching language. (By appropriately extractable I mean isolable via either homomorphism or intersection with a regular set.) But all the published claims of this sort are fallacious (Pullum and Gazdar 1982). This lends plausibility to the hypothesis that human languages are all CF. Stronger claims than this (e.g. that human languages are regular, or finite cardinality) have seldom seriously defended. I now want to propose one, however.</Paragraph>
    <Paragraph position="8"> I propose that human languages are never profligate CYL's in the sense given by the following definition.</Paragraph>
    <Paragraph position="9">  (i) A CFL is profligate if all CF-PSG's generating it have nonterminal vocabularies strictly larger than their terminal vocabularies.</Paragraph>
    <Paragraph position="10"> (ii) A CFL is profligate if it is the image of a profligate language under some homomorphism.</Paragraph>
    <Paragraph position="11"> \[OPEN PROBLEM: Is profligacy decidable for an arbitrary CFL? I conjecture that it is not, but I  have not been able to prove this.\] Clearly, only an infinite CPL can be profligate, and clearly the most commonly cited infinite CFL's are not profligate. For instance, {!nbn~n ~ 0} is not profligate, because it has two terminal symbols but there is a grammar for it that has only one nonterminal symbol, namely S. (The rules are: (S --&gt; aSb, S --&gt; e}.) However, profligate CFL's do exist. There are even regular languages that are profligate: a simple example (due to Christopher Culy) is (A* + ~*).</Paragraph>
    <Paragraph position="12"> More interesting is the fact that some string contrasting languages as defined above are profligate. Consider the string contrasting language over the vocabulary {~, k, K} where A = (A + ~)*. A  string xcv in (~ + b)*~(~ + A)* will be in this language if any one of the following is met: (a) ~ is longer than Z; (b) K is shorter than ~; (c) ~ is the same length as ~ but there is an  such that the ith symbol of K is distinct from the ith symbol of ~.</Paragraph>
    <Paragraph position="13"> The interesting Condition here is (c). The grammar has to generate, for all ~ and for all pairs &lt;u, v&gt; of symbols in the terminal vocabulary, all those strings in (a + b)*c(a + b)* such that the ~th symbol is ~ and the ~th symbol after ~ is Z. There is no bound on l, so recursion has tO be involved.</Paragraph>
    <Paragraph position="14"> But it must be recursion through a category that preserves a record of which symbol is crucially going to be deposited at the ~th position in the terminal string and mismatched with a distinct symbol in the second half. A CF-PSG that does this can be constructed (see Pullum and Gazdar 1982, 478, for a grammar for a very similar language). But such a grammar has to use recursive nonterminals, one for each terminal, to carry down information about the symbol to be deposited at a certain point in the string. In the language just given there are only two relevant terminal symbols, but if there were a thousand symbols that could appear in the ~ and ~ strings, then the vocabulary of recursive nonterminals would have to be increased in proportion. (The second clause in the definition of profligacy makes it irrelevant whether there are other terminals in the language, like g in the language cited, that do not have to participate in the recursive mechanisms just referred to.) For a profligate CFL, the argument that a CF-PSG is a cumbersome and inelegant form of grammar might well have to be accepted. A CF-PSG offers, in some cases at least, an appallingly inelegant hypothesis as to the proper description of such a language, and would be rejected by any linguist or programmer. The discovery that some human language is profligate would therefore provide (for the first time, I claim) real grounds for a rejection of CF-PSG's on the basis of strong generative capacity (considerations of what structural descriptions are assigned to strings) as opposed to weak (what language is generated).</Paragraph>
    <Paragraph position="15"> However, no human language has been shown to be a profligate CFL. There is one relevant argument in the literature, found in Chomsky (1963). The argument is based on the nonidentity of constituents allegedly required in comparative clause constructions like (4).</Paragraph>
    <Paragraph position="16"> (4) She is more competent as \[a designer of programming languages\] than he is as \[a designer of microchips\].</Paragraph>
    <Paragraph position="17"> Chomsky took sentences like (5) to be ungrammatical, and thus assumed that the nonidentity between the bracketed phrases in the previous example had to be guaranteed by the grammar.</Paragraph>
    <Paragraph position="18"> (5) She is more competent as \[a designer of programming languages\] than he is as \[a designer of programming languages|.</Paragraph>
    <Paragraph position="19"> Chomsky took this as an argument for non-CF-ness in English, since he thought all string contrasting languages were non-CF (see Chomsky 1963, 378-379), but it can be reinterpreted as an attempt to show that English is (at least) profligate. (It could even be reconstituted as a formally valid argument that English was non-CF if supplemented by a demonstration that the class of phrases from which the bracketed sequences are drawn is not only&amp;quot; infinite but non-regular; of. Zwicky and Sadock.) However, the argument clearly collapses on empirical grounds. As pointed out by Pullum and Gazdar (1982, 476-477), even Chomsky now agrees that strings like (5) are grammatical (though they need a contrastive context and the appropriate intonation to make them readily acceptable to informants). Hence these examples do not show that there is a homomorphism mapping English onto some profligate string contrasting language.</Paragraph>
    <Paragraph position="20"> The interesting thing about this, if it is correct, is that it suggests that human languages not only never demand the syntactic string comparison required by string matching languages, they never call for syntactic string comparision over infinite sets of strings at all, whether for symbol-by-symbol checking of identity (which typically makes the language non-CF) or for specifying a mismatch between symbols (which may not make the language non-CF, but typically makes it profligate). null There is an important point about profligacy that&amp;quot; I should make at this point. My claim that human languages are non-profligate entails that each human language has at least one CF-PSG in which the nonterminal vocabulary has cardinality strictly less than the terminal vocabulary, but not that the best granzaar to implement for it will necessarily meet this condition. The point is important, because the phrase structure grammars employed in natural language processing generally have complex nouterminals consisting of sizeable feature bundles. It is not uncommon for a large natural language processing system to employ thirty . or forty binary features (or a rough equivalent in terms of multi-valued features), i.e. about as many features as are employed for phonological description by Chomsky and Halle (19681. The GPSG system described in Gawron et al. (1982) has employed features on this sort of scale at all points in its development, for example. Thirty or forty binary features yields between a billion and a trillion logically distinguishable nonterminals (if all values for each feature are compatible with all combinations of values for all other features).</Paragraph>
    <Paragraph position="21"> Because economical techniques for rapid checking of relevant feature values are built into the parsers normally used for such grammars, the size of the potentially available nonterminal vocabulary is not a practical concern. In principle, if the goal of capturing generalizations and reducing the size of the grammar formulation were put aside, the nonterminal vocabulary could be vastly reduced by replacing rule schemata by long lists of distinct rules expanding the same nonterminal.</Paragraph>
    <Paragraph position="22"> Naturally, no claim has been made here that profligate CFL's are computationally intractable. No CFL's are intractable in the theoretical sense, and intractability in practice is so closely tied to details of particular machines and programming environments as to be pointless to talk about in terms divorced from actual measurements of size for grammars, vocabularies, and address spaces. I have been concerned only to point out that there is an interesting proper subset of the infinite CFL's within which the human languages seem to fall.</Paragraph>
    <Paragraph position="23"> One further thing may be worth pointing out.</Paragraph>
    <Paragraph position="24"> The kind of string contrasting languages I have been concerned with above are strictly nondeterministic. The deterministic CFL's (DCFL's) are closed under complementation. But the cor~ I _nt  of (6) {xcvJx and ~ are in (&amp; + ~)* and ~ # ~}</Paragraph>
    <Paragraph position="26"> string matching language.</Paragraph>
    <Paragraph position="27"> (7)a. {xcvl~ and ~ are in (~ + b)* and x = ~} b. {xcx\[x is in (a + b)*} If (7a) \[=(Yb)\] is non-CF and is the complement of (6), then (6) is not a DCFL.</Paragraph>
    <Paragraph position="28"> \[OPEN PROBLEM: Are there any nonregular profligate DCFL's?\]</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML