File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0115_metho.xml
Size: 14,785 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W94-0115"> <Title>Learning a Radically Lexical Grammar</Title> <Section position="3" start_page="122" end_page="123" type="metho"> <SectionTitle> 2 Motivation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 2.1 Theo~tical </SectionTitle> <Paragraph position="0"> It is hard to imagine an automatic (or even semi-automatic) grammar or lexicon induction procedure that could manipulate traditional grammatical entities such that human-meaningful results might obtain. Brute-force methods (ie those that exploit the massive raw computing power currently available cheaply) may well produce some useful results (eg Brown et al 1993). However, ff any linguistic insight is to appear as a result, we believe an underpinning in linguistic theory is essential - overall success will result from the combination and integration of that at which computers and human linguists excel. We believe this is only possible if both partners work in the same basic framework - and we also believe that training linguists to read disk sectors will probably be unproductive.</Paragraph> <Paragraph position="1"> Our premise is that: 1) giving grammatical entities structure is useful - for providing a framework to describe them to computers, and, e.g., for generating (automatically) a taxonomy of such entities; 2) such a structure can embody the grammar of a language - all the necessary linguistic knowledge can be incorporated into the structure of the grammatical entity (or category) of each word in a language; 3) given 1 & 2, we can demonstrate that a semi-automatic means for inferring a structural lexicon (equivalent to a grammar) from a corpus of natural language is possible.</Paragraph> <Paragraph position="2"> To illustrate 1 & 2, we suggest that the lexicon can be unified with grammar, using the theoretical framework of Categorial Grammar (CG). CG is recognized within linguistic theory as the logical ultimate in 'lexical syntax', a model in which all syntactic information is held in the lexical categories of individual words, and there is no separate component of 'grammar rules' - thus Kartunnen's (1989) 'radical lexicalism' (from which our title derives). A CG induction system has been built which provides evidence for 3.</Paragraph> </Section> <Section position="2" start_page="122" end_page="123" type="sub_section"> <SectionTitle> 2.2 Practical </SectionTitle> <Paragraph position="0"> Two major challenges of NLP systems are to support wide-ranging and developing vocabularies, and to work with more than strictly syntactically correct 'input. We believe that both can be addressed by (at least semi-) automatic lexicon growth, or induction.</Paragraph> <Paragraph position="1"> Imagine an application that operates on 'real-world' textual input. For example, in the medical domain, a system that produces a semantic analysis of free-text hospital discharge summaries. A fixed lexicon is clearly impracticable. Practitioners should not (and indeed will not) accept any artificiafly imposed limited vocabulary - it is simply not appropriate for their task. There are (at least) two imaginable strategies for handling the situation when a previously unknown word is encountered (either when encountering new domains, or extra complexity in a 'known' domain): I) Bail out, and ask a linguist/lexicographer to manually augment the lexicon/grammar rules 2) Have the NLP system itself make sensible and useful suggestions as to the new word's syntactic category (and, potentially, its conceptual structure).</Paragraph> <Paragraph position="2"> Similarly, there are (at least) three potential strategies for dealing with the situation in which a system encounters 'syntactically incorrect' input - but for which an error message is neither useful or appropriate - the input comes from something that has actually been said. These three strategies are: 1) Throw it out; 2) Invent a new rule to cope with the particular scenario; 3) Augment the lexicon with additional categories.</Paragraph> <Paragraph position="3"> We believe that augmenting the lexicon is the only realistic approach to this problem.</Paragraph> <Paragraph position="4"> An experimental prototype of a Categorial Grammar induction system (known as L) has been produced which illustrates and encourages our belief that a lexicon in which richly structured entities are the means of encoding syntactic and semantic knowledge can be 'grown' to meet these demands.</Paragraph> <Paragraph position="6"/> </Section> </Section> <Section position="4" start_page="123" end_page="127" type="metho"> <SectionTitle> 3 Categorial Grammar </SectionTitle> <Paragraph position="0"> The CG induced by L is a simple one on the scale of categorial calculi. It uses three atomic categories, S, N, and NP (we wifl return at the end of this paper to discuss the limitations of this atomic notation); two connectives: / (forward combination) and \ (backward combination); and three combination rules: forward application, backward application, and forward composition. The notation used is consistently result-first, regardless of the direction of the connective. Complex categories are implicitly left-bracketed, i.e. S\NP/NP is equivalent to (S\NP)/NP.</Paragraph> <Paragraph position="1"> Complex categories are functions named by their arguments and outputs; there is no concept of 'verb', but rather of a function from one nominal to a sentence (S\NP, 'intransitive verb'), from two nominals to a sentence (S\NP/NP, 'transitive verb'), and so on. This 'combinatory transparency', their visible information structure, makes CGs well suited to corpus-based induction of a lexicon/grammar.</Paragraph> <Paragraph position="2"> We rely on the property of 'parametric neutrality' (Pareschi 1986) - not only can we determine the result of combining two known categories, but from one category and the result we can determine the second category. Thus: Given NP S\NP -> X then X = 5; Given X S\NP -> S then X = NP; Given NP X -> S then X = S\NP.</Paragraph> <Paragraph position="3"> Each category assignment we induce gives us more information to help with the next (as will be seen below); unlike traditional categories, these have a rich information structure which we can query for help in making further decisions. We can therefore approach a text knowing only the identity of nouns and sentences and the principles of function application and composition, and from these we can induce the complex categories of the other words in the sentence; in other words, we can learn the lexicon, and in it the grammar.</Paragraph> <Section position="1" start_page="123" end_page="125" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> L is a simple Categorial Grammar grammar induction system, based on holding theoretically motivated and empirically demonstrated linguistic information by representing words and their behaviour as complex structured objects in a format readable by both human and machine. The input is a simple text corpus in which the boundaries of sentences and, initially, the identity of a few nouns are known. The output - the result of the induction process - is a set of lexical categories for all the words in the corpus - which, as we have explained, constitutes a grammar for the corpus. This is made possible by the characteristic of 'parametric neutrality' described above. Many words have multiple categories (eg 'toy', as N 'noun' or N/N 'adjective') this is how ambiguity is handled. L successfully infers multiple categories for many words in the corpus, and, critically, the categories it proposes do make cognitive sense to human linguists. This gives us hope that the system could be usefully guided and helped by humans in what is - in the limit - a difficult task: category assignments proposed by the system can be readily evaluated by its user. The lexicon induced by L has been used successfully, as a test, for simple text generation.</Paragraph> <Paragraph position="1"> The system is implemented in CProlog on a SUN 3/50 workstation.</Paragraph> </Section> <Section position="2" start_page="125" end_page="125" type="sub_section"> <SectionTitle> 4.2 The Corpus </SectionTitle> <Paragraph position="0"> The corpus used is a selection of books from the Ladybird Key Words Reading Scheme, a series of books designed to help children learn to read, ordered in a graded sequence which was followed by the system. The system is 'bootstrapped' by a few examples of primitive categories - this is an example of where we feel that best results are obtained by not hog-tying the system for its own sake. Sentence boundaries are given by punctuation.</Paragraph> <Paragraph position="1"> A few nouns are defined in the corpus itself: the first books in the series begin with a sort of 'picture dictionary' of their central characters and objects, and we gave this starting point to the system also. Notice that this fits exactly the use of S and N(P) as atomic categories in CG.</Paragraph> <Paragraph position="2"> We are encouraged here that our approach also has some psychological plausibility. Children learning a language do so by learning the names of things first, then how those names can be fitted together into propositions. That a sequence designed to help human learners should prove suitable for teaching a computer suggests that they may be working along similar lines.</Paragraph> <Paragraph position="3"> An interesting side-effect of this choice is that the corpus is often syntactically odd - it was designed to help children's reading, rather than to teach grammar. However, the success of L on such an 'unsyntactic' (though understandable) corpus gives promise for its application on other 'real-world' corpora - real text and perhaps spoken word. (It must be admitted that L was completely baffled, in reading 'The Three Billy Goats Gruff, by the 'sentence' 'Trip trap, trip trap, trip trap!') - but this is surely allowable at this stage as an extreme case.)</Paragraph> </Section> <Section position="3" start_page="125" end_page="127" type="sub_section"> <SectionTitle> 4.3 Principles of Operation </SectionTitle> <Paragraph position="0"> L works due to a combination of computer processing power and statistical evidence applied to an underlying linguistic theory - in our view, the 'best of both worlds'.</Paragraph> <Paragraph position="1"> It is a simple system; this simplicity itself we regard as a significant achievement.</Paragraph> <Paragraph position="2"> In the description which follows, some example output from a simple text-based interface to L is included to illustrate the various processes involved.</Paragraph> <Paragraph position="3"> The system has a few &quot;boot-strapped' primitive categories, as explained above. A multi-pass iterative approach is used to analyse and further annotate the corpus - the first pass uses just the few identified categories. On each pass, L assigns categories to more words of the corpus. (The strategy bears some resemblance to island-based parsing, which similarly begins with the point(s) of greatest certainty in an input string and works outwards from them.) Each pass has three parts: 1) The system selects which word to try to categorise in this pass. It uses statistical evidence to choose for analysis the word which occurs in the corpus with the most consistent, already categorised, (immediate) nearest neighbours. Clearly a precondition for this approach is to have at least some categorised words in the corpus - having some bootstrapped categories enables L to embark in a sensible direction.</Paragraph> <Paragraph position="4"> 2) Assign a category. If the word to be assigned is the last remaining uncategorised word in a sentence, then the principle of parametric neutrality is applied. Due to the compositional, recursive nature of Categorial Grammar categories, L can always find a category to fit. For example: Pass 4 ...</Paragraph> <Paragraph position="5"> missing link completion assigning 's\np/np' to 'likes' with a confidence of 14/16 EXAMPLE: original sentence : Peter likes the ball.</Paragraph> <Paragraph position="6"> before this pass : np likes np now reduced to : s Note that the assignment found is only applied to those instances in the corpus which are both sanctioned by CG, and 'deemed appropriate' by L itself- in this case, 14 out of 16 occurrences. This mechanism allows L the opportunity of assigning multiple different categories to a word, to cope with ambiguity.</Paragraph> <Paragraph position="7"> Note also that the information given by the 'confidence' measure is more than the probability which would be expressed by reducing it to 'one in ...'. 16/16 indicates a common (in this small corpus) and unambiguous word, while a word given I/I has been found a category on its one appearance; a word with a rating of 10/15 is established as regularly ambiguous, but 2/3 could prove on further exposure to be mainly regular, with only one occurence of an alternative category. These degrees of probability are taken into account by the algorithm which assigns categories in new text.</Paragraph> <Paragraph position="8"> If the word is not the last in the sentence to be categorised, a more complex approach is required. The argument and direction of the resulting category is obtained as a by-product of the previous stage - the result is determined by an * examination of the 'behaviour' of the resulting category in the corpus. In this context, 'behaviour' is defined as the pattern of neighbours' categories in the corpus so far. For example: Pass 5 ...</Paragraph> <Paragraph position="9"> assigning 'np/n' to 'jane's' with a confidence of 5/6</Paragraph> <Paragraph position="11"> jane's shop.</Paragraph> <Paragraph position="12"> before this pass : Here is jane's n now reduced to : Here is np In this case, the word jane's is chosen because, informally, it frequently occurs before a N. This tells us that the category to assign must have an argument of N, and the direction must be forward - in other words a x/N. This category is completed with a NP because this is a reasonable behavioural match with the rest of the corpus NP often appears after is. Note that statistical evidence is again essential.</Paragraph> <Paragraph position="13"> 3) Having obtained a putative category, this is then applied to the occurrences of the word in the corpus as long as it is sanctioned by the semantics of CG. In this way, ambiguity is captured - only those occurrences which 'fit' are categorised at each pass.</Paragraph> </Section> </Section> class="xml-element"></Paper>