File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/88/c88-2111_abstr.xml
Size: 9,100 bytes
Last Modified: 2025-10-06 13:46:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2111"> <Title>Using a Logic Grammar to Learn a Lexicon</Title> <Section position="2" start_page="0" end_page="524" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The basic idea is as follows: a logic grammar \[1\] can be viewed as the definition of a relation between a string and a parse-tree. You can run it two ways: finding' the parse-trees that correspond to a given string (parsing), or finding the strings that correspond to a given parse-tree (generating). However, if we view the lexicon as part of this relation, we get new possibilities. More specifically, we can compute the lexicons that correspond to a given string; this can in a natural way be viewed as a formalization of &quot;lexicon learning from example sentences&quot;. In terms of the &quot;explanation-based learning&quot; paradigm, this makes the associated parse-tree the &quot;explanation&quot; (See diagram 1).</Paragraph> <Paragraph position="1"> In what comes below, we are going to consider the following questions: 1) We are learning from positive-only examples~ What can't be learned like this? 2) The basic structural constraint, the thing that makes it all work, is the assumption that a word can usually only be interpreted as one part of speech. If we assume that this is always going to be true, then things really go pretty well (Section 2). However, this rule is broken sufficiently often that a realistic system has to able to deal with it. How? 3) How important is the order in which examples are presented? Can the system select a good order itself, if it is important? 4) What kind of complexity features are there? How scalable is it in terms of number of sentences, number of grammar rules, number of words to learn? 2. Learning with the &quot;one entry per word&quot; assumption. This is the simplest variant of the idea: assume that there is one entry per word, and represent the lexicon as an association-list (alist) with one entry for each word. Each sentence now constrains the possible values of these entries to be ones. which allow it to be parsed; the hope is that a conjunction of a suitably large number of such constraints will be enough to determine the lexicon uniquely.</Paragraph> <Paragraph position="2"> In concrete Prolog programming terms, what this means is the following. In the initial lexicon, the entries are all uninstantiated. We use this to parse the first sentence, which fills in some entries; the resulting partially instantiated lexicon is sent to the second sentence, which either refutes it or instantiates it some more, and the process is repreated until we get to the end. If at any stage we are unable to parse a sentence, we just backtrack. If we want to, we can continue even after we've got to the end, to generate all possible lexicons that are consistent with the input sentences and the grammar (and in fact we ought to do this, so as to know which words are still ambiguous). This procedure can be embodied as a one-page Prolog program (see diagram 2), but despite this it is still surprisingly fast on small examples (a grammar with 15-30 rules, 10-15 sentences with a total of 30-40 words to learn). We performed some experiments with this kind of setup, and drew these i conclusions: 1) Certain things can't be learned from positive-only examples. For example (at least with the grammars we have tried), it is impossible to determine whether belongs is a verb which takes a PP complement with preposition to, or is an intransitive verb which just happens to have a PP modifier in all the sentences where it turns up. However, things of this kind seem fairly rare.</Paragraph> <Paragraph position="3"> 2) Order is fairly critical. When examples are presented at random, a run time of about 100 seconds for a 10-12 sentence group is typical; ordering them so that not too many new words are introduced at once drops this to about 5 seconds, a factor of 20. This gets worse with more sentences, since a lot of work can be done before the system realizes it's got a wrong hypothesis and</Paragraph> <Paragraph position="5"> member(W,S)),L).</Paragraph> <Paragraph position="6"> lex lookup(WordvLex,Class) :~&quot; member(\[Word, Class\],Lex).</Paragraph> <Paragraph position="7"> % Example grammar: s(L) ---~> np(L),vp(L) .</Paragraph> <Paragraph position="8"> np(L) .... > det (L) , noun (L) . vp(L) .... > iv(L).</Paragraph> <Paragraph position="9"> vp(L) -.-> tv(L),np(L) .</Paragraph> <Paragraph position="10"> det(L) .~-> \[D\], {lex lookup(D,Lrdet) }. noun(L) --> \[N\], {lex lookup(N, Lrnoun) } . iv (L) ~.-> \[V\] r {lex .lookup (V, L, iv) } . tv(L) ~-> \[V\], {lex lookup (V, L, tv) } . Diagram 2 3) A mo~:e important complexity point: structural ambiguities needn't be lexical ambiguities; in other wo~'ds, it is quite possible to parse a sentence in two distinct ways which still both demand the same lexical entries (in practice, the most common case by far is NP/VP ~l'.:tachment ambiguity). Every such ambiguity introduce:; a spurious duplication of the lexicon, and since these.,, multiply we get an exponential dependency on the number of sentences. We could conceivably have tried to construct a grammar which doesn't produce this kind of ambiguity (cf. \[2\], pp. 64-71), but instead we reorganized the algorithm so as to collect aftex' each step the set of all possible lexicons compatible with the input so far. Duplicates are then eliminated from this, and the result is passed to the next step. Although the resulting program is actually considerably xnore expensive for small examples, it wins in the long run. Moreover, it seems the right method to build on when we relax the &quot;one entry per word&quot; assumption. 3deg ~.emov;\[ng the &quot;one curry per word&quot; assumption. We doxft actually remove the assumption totally, but just weaken it; for each new. sentence, we now assume that, of tlle words already possessed of one or more entries, a'~ most one may have an unknown alternate. .Multiple entries are sufficiently rare to make this reasonable. 9o we extend the methods from the end of section 2; first we try and parse the current sentence by h~king up known entries and filling in entries fox&quot; words we so far know nothing about. If we don't get a~y result this way, we try again, this time with the added possibility of once assuming that a word which already has known entries in fact has one more.</Paragraph> <Paragraph position="11"> Tids is t~sually OK, but sometimes produces strange i'esults, as witness the following example. Suppose the first three sentences are John drives a car, John drives well, and John drives. Aftex' the first sentence, the system gaesses that drives is a transitive verb, and it is able to maintain this belief after the second sentence if it also assumes that well is a pronoun. However, the third sentence forces it to realize that drives can also be an intransitive verb. Later on, it will presumably meet a sentence which forces well to be an adverb; we now have an anomalous lexicon where well has an extra entry (as pronoun), which is not actually used to explain anything any longer. To correct situations like this one, a two-pass method is necessary; we parse through all the sentences a second time with the final lexicon, keeping count of which entries are actually used. If we find some way of going through the whole lot without using some entry, it can be discarded.</Paragraph> <Paragraph position="12"> 4. Ordering the sentences As remarked above, order is a critical factor; if words are introduced too quickly, so that the system has no d~ance to disambiguate them before moving on to new ones, then the number of alternate lexicons grows exponentially. Some way of ordering the sentences automatically is essential.</Paragraph> <Paragraph position="13"> ()ur initial effort in this direction is very simple, but still seems reasonably efficient; sentences are pre-ordered so as to minimize the number of new words introduced at each stage. So the first sentence is the one that contains the smallest number of distinct words, the second is the one which the smallest number of words not present in the first one, and so on. We have experimented with this approach, using groups of between 20 and 40 sentences and a grammar containing about 40 rules. If the sentences are randomly ordered, the number of alternate lexicons typically grows to over 400 within the first 6 to 10 sentences; this slows things down to the point where further progress is in practice impossible. Using the above strategy, we get a fairly dramatic improvement; the number of alternates remains small, reaching peak values of about 30. This is sufficient to be able to process the groups within sensible times (less than 15 seconds per sentence average). In the next two sections, we discuss the limitations of this method and suggest some more sophisticated alternatives.</Paragraph> </Section> class="xml-element"></Paper>