File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/w94-0115_concl.xml

Size: 5,116 bytes

Last Modified: 2025-10-06 13:57:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0115">
  <Title>Learning a Radically Lexical Grammar</Title>
  <Section position="6" start_page="128" end_page="129" type="concl">
    <SectionTitle>
5 Discussion and Prospects
</SectionTitle>
    <Paragraph position="0"> L, as an experimental prototype, has demonstrated the practicability of a radically lexical approach for managing some of the major challenges for NLP. It learns the grammar of (the words of) new text and represents what it has learned in structured linguistic entities which are readable equally by computer and computational linguist. Its only starting point is the general rule of function application and the identity of a few nouns and sentences. We firmly believe that this minimalist methodology - assume as little as possible, and use principles which are as general as possible - is sound.</Paragraph>
    <Paragraph position="1"> L copes with that bugbear of NLP ambiguity; it successfully infers multiple categories for many words in the corpus, and, critically, the categories it has so far proposed have been hand-checked and do make cognitive sense to human linguists.</Paragraph>
    <Paragraph position="2"> This makes us optimistic as to the ease with which such systems and their users will be able to co-operate. The lexicon induced by L has been used successfully, as a test, for simple text generation.</Paragraph>
    <Paragraph position="3"> Obviously there is a great deal of work still to be done. The simple atomic S, N, NP category notation used by L cannot represent finer-grained morphological information such as number, case, gender, and tense. This is clearly shown by the first sentence L generated: 'I likes r. Our first priority is therefore to introduce a more complex notation, probably using bundles of attribute/value pairs in the style of Unification Categorial Grammar (Zeevat 1988). (Clearly our comments at the beginning of this paper about the value of structured representations apply afortiori here also.) We expect that a minimal explicit seeding of the corpus with these values will allow the system to 'learn' them also.</Paragraph>
    <Paragraph position="4"> Secondly, lexical semantics has not yet been addressed. The move to a feature/value notation will also provide a framework in which this is possible. The seeding and learning will be a more difficult task, which awaits investigation. A log of which particular words regularly co-occur should at least help in automatically establishing broad semantic fields again, we expect a judicious balance of statistical and theoretical information to be appropriate here.</Paragraph>
    <Paragraph position="5"> Thirdly - once we have established a notation adequate to these demands we will grow the lexicon/grammar by processing further texts of increasing complexity - first within the Ladybird series, but going on to real-world text, possibly in a medical domain. We do not anticipate serious difficulty in progressing through the Ladybird series, but we are well aware that there is a significant step from there to 'adult' text, at which point scaling up may well not be trivial.</Paragraph>
    <Paragraph position="6"> We have mentioned in passing that it is encouraging from a psychological perspective to find that a text corpus designed to aid human learning should prove well suited to machine learning. Of course learning to read a known language is a different task  from learning a language. However we hope eventually to explore the potential of our approach for modelling children's learning, and perhaps the use of its text generation ability in producing teaching material. The success of our approach - at least at prototype level - should be contrasted with other attempts at grammar induction. Some, typically, use traditional atomic 'grammatical categories' with no inherent information content, mapped in complex ways (which must also be learned) onto a large set of 'grammar rules'. Others 'learn' columns of numbers which could equally well describe the co-occurrence of birdtracks in snow with various garden shrubs. To quote Pustejovsky et at (1993:354), 'statistical results themselves reveal nothing, and require careful and systematic interpretation by the investigator to become linguistic data.' L is inspired and informed by an independently motivated and respected theory of natural language, and depends for its realization on a corpus of real-world text. Our theoretical understanding of how words combine gives us a principled way into a text corpus; statistical evidence suggests and confirms the behaviour of words and therefore their&amp;quot; lexical/grammatical categories. NLP appears currently to be split by civil war between theorists with sound principles but no real data and statisticians with volumes of data but no linguistic principles. There will only be significant progress with real prospects in NLP when the theory-driven and empiricist approaches respect each other and work together.</Paragraph>
    <Paragraph position="7"> We hope we have shown one way in which this can be done.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML