File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/87/e87-1011_intro.xml

Size: 10,725 bytes

Last Modified: 2025-10-06 14:04:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="E87-1011">
  <Title>A Multi-Purpose Interface to an On-line Dictionary</Title>
  <Section position="2" start_page="0" end_page="63" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A growing mass of work at present, both within the narrower field of computational linguistics and in the wider context of building knowledge-based systems, is focussed on making use of the lexical resources to be found in a number of (monolingual) dictionaries of the style exemplified by e.g. The Longman Dictionary of Contemporary English, The Collins English Dictionary, or Webster's Seventh New Collegiate Dictionary. These contain a wealth of information relevant to a wide range of natural language processing functions -- a fact which is hardly surprising, given that such sources typically (and almost by definition) contain the results of substantial efforts to collate and analyse data about real language, to elicit collocational and distributional properties of words and to apply common principles of defining their meaning.</Paragraph>
    <Paragraph position="1"> The availability of dictionary sources on-line, in the form of machine-readable dictionaries (henceforth MRDs) and encyclopaedias, makes it possible to view these as repositories of large amounts of information, linguistic and extra-linguistic, which can be brought to bear at various points of the natural language understanding process. Developments in hardware, as well as research in computational linguistics, offer the technology both to process lexical resources and to extract from them what is relevant to computer programs concerned with various natural language processing activities. A number of recent projects have extracted data from publishers' tapes and subsequently used it to support activities such as syntactic parsing, speech synthesis, lexical disambiguation, semantic interpretation in context, spelling correction and machine translation. The common denominator in these projects is that the end product incorporates in some form appropriate fragments derived from the machine-readable source.</Paragraph>
    <Paragraph position="2"> There are essentially two different modes in which MRDs can be used (see Boguraev, 1987, for more details). The predominant technique to date involves an arbitrary amount of pro-processing, typically in batch, of the on-line source. Those parts of the dictionary entries which contain useful data for the task at hand are extracted and suitably represented in a form directly usable by a client program. Such a model of dictionary use does not in any way rely on the original source being available at the time the language processing application is active, and thus a batch derivation of the appropriate information is a suitable way of transforming the raw data into a usable repository of lexical knowledge.</Paragraph>
    <Paragraph position="3"> But the reality of trying to adapt information, originally packaged according to lexicographic and typographic conventions for visual presentation and not at all intended for automated natural language processing, suggests a different model of dictionary use.</Paragraph>
    <Paragraph position="4"> The non-trivial task of developing suitable procedures for pre-processing of the machine-readable source typically requires careful analysis of the properties of the particular MILD, and is best aided by having fast interactive access to appropriate fragments of it.</Paragraph>
    <Paragraph position="5"> In addition, many research projects of a more experimental nature focus on investigating the ways in which the availability of an MRD can aid the development of particular natural language processing systems. The assumption is that an analysis of the accumulated data in the dictionary will reveal regularities which can then be exploited for the task at hand. Just one example, from a number of projects currently under way in Cambridge, illustrating this point is the work of Alshawi (1987), who has analysed definition texts across an entire dictionary to produce a ~definition grammar ~ together with an associated technique for parsing the natural language descriptions of words into semantic structures.</Paragraph>
    <Paragraph position="6"> Such projects depend critically not only on the availability of a machine-readable equivalent of a published dictionary, but also on a software system capable of providing fast interactive access into the on-line source through various access routes. Operational natural language processing systems clearly will have well-defined requirements as far as their lexicons are concerned, and once the format of lexical resources has been settled, retrieval of individual entries can be implemented fairly efficiently using standard computational and linguistic techniques (see e.g. Russell et al., 1986). The placing of a dictionary on-line, however, with the intention of making it available to a number of different research projects which need to locate and  collate dictionary samples satisfying a wide range of constraints, requires an efficient and flexible system for management and retrieval of linguistic data.</Paragraph>
    <Paragraph position="7"> This is not the computationally straightforward issue it appears to be, as conventional database management systems (DBMS) are not well suited for on-line dictionary support, particularly when the entire dictionary is viewed as a lezical knowledge b~e, more complex in structure and facing more taxing demands in a natural language research environment. This paper addresses the problem in greater detail, by placing it into the wider context of research into computational linguistics and highlighting those issues which pose a challenge for the current DBMS wisdom. We propose a solution adequate to handle most of the lexical requirements of current systems, which is generalisable to a range of IVIRDs, and describe a particular implementation for single user workstations used in a number of on-going research projects at the universities of Cambridge and Lancaster.</Paragraph>
    <Paragraph position="8"> 2 The nature of the problem Several factors put the task of mounting a machine-readable dictionary as a proper development tool beyond the scope of current DBMS practice and make its conversion into a database of e.g. a standard relational kind quite difficult.</Paragraph>
    <Paragraph position="9"> Firstly, there is the nature of the data in a dictionary: typically, it contains far too much free text (definitions, examples, cross-reference pointers, glosses on usage and so forth) to fit easily into the concept of structured data. On the other hand, the highly structured and formallsed encoding of other types of information (found in e.g. the part of speech, syllabification or pronunciation fields) makes a dictionary equally unsuitable for on-line access by information retrieval methods.</Paragraph>
    <Paragraph position="10"> The second factor is due to the nature of the only source of machine-readable dictionaries so far available namely the publishers' typesetting tapes, originally constructed for the production of a printed version.</Paragraph>
    <Paragraph position="11"> The organisation of data there, aimed at visual presentation, carries virtually no explicit structure; a tape is simply a character stream containing an arbitrary mixture of typesetting commands and real data. This not only introduces the difficult problem of ~parsing ~ a dictionary entry (addressed in detail by e.g. Kazman, 1986), but also raises the issue of devising a suitable representation for the potentially huge amount of linguistic data; one which does not limit in any way the language processing functions that could be supported or constrain the complexity of the computational counterpart of a dictionary entry.</Paragraph>
    <Paragraph position="12"> Finally, there is the nature of the data structures themselves. A text processing application, typically written in Lisp or Prolog, requires that its lexicai data is represented in a compatible form, say Lisp s-expressions of arbitrary complexity. Therefore, even if we choose to remain neutral with respect to representation details, we still face the problem of interfacing to a vast number of symbolic s-expressions, held in secondary storage. This problem arises from the unsuitability of conventional data models for handling the complex data structures underlying any sophisticated symbolic processing. Partly, this is due to the inherent restrictions such models impose on the class of data structure they can represent easily -- namely records of fixed format. But more importantly, conventional database systems make strong assumptions about the status and use of data they have to hold: databases are taken to consist of a large number of data records taken from a small number of rigidlydefined classes. It is not clear that a lexical ~knowledge base ~, derived from a dictionary and intended to support a wide range of language processing applications, fits this model well.</Paragraph>
    <Paragraph position="13"> Some solutions to these problems will no doubt be offered by dedicated efforts to develop special purpose data models, capable of computationally representing a dictionary and amenable to flexible and effcient DBMS support. The work, at the University of Waterloo, on computerising the Ozford English Dictionary (Tompa, 1986) is a good example here; similarly, the desire to be able to mount computerised dictionaries on-llne for in-house research motivates Byrd's work on a general purpose dictionary access method (Byrd etal., 1986). In the short run, alternative approaches reduce the complexity of the problem by limiting themselves to applying the machine readable source of a dictionary to a small class of similar tasks, and building customlsed interfaces offering relatively narrow access channels into the on-line data. Thus IBM's WordSmith system (Byrd and Chodorow, 1985) is concerned primarily with providing a browsing functionality which supports retrieval of words ~close ~ to a given word along the dimensions of spelling, meaning and sound, while a group at Bell Labs has several large dictionaries on-line used only for research on stress assignment (Church, 1985). Alshawi et al.</Paragraph>
    <Paragraph position="14"> (1985) have used a machine-readable source directly for syntactic analysis of texts; however, the approach taken there -- namely that of simple pro-indexing by orthography -- does not generalise easily for applications which require the rapid locating and retrieval of entries satisfying more than one selection criterion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML