XML Viewer - e87-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/e87-1011_metho.xml
Size: 20,380 bytes
Last Modified: 2025-10-06 14:12:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="E87-1011">
  <Title>A Multi-Purpose Interface to an On-line Dictionary</Title>
  <Section position="3" start_page="63" end_page="64" type="metho">
    <SectionTitle>
3 System functionality
</SectionTitle>
    <Paragraph position="0"> The motivation for the design described here is divided equally between the diverse nature of MRD-based projects in Cambridge and Lancaster and the unique properties of the particular dictionary that they use. The suitability of the Longr~an Dictionarv of Con- temporary English (LDOCE) for research into computationa~ linguistics has been discussed at length elsewhere (see, in particular, Michiels, 1982); below we will outline several projects undertaken in Cambridge as a context for highlighting its particularly useful characteristics insofar as they are relevant to this paper. null LDOCE carries special lexicai and linguistic information which is useful for a number of natural language processing tasks.</Paragraph>
    <Paragraph position="1"> I. The dictionary is unique in tagging word senses with grammar code8 which provide very elaborate syntactic subcategorisation information; a procedure has been developed for mapping the grammar codes into feature clusters (in the style of e.g. Generalised Phrase Structure Grammar), subsequently to be used by a syntactic parser (Boguraev and Briscoe, 1987, describe this in detail). The transformation program is about to be integrated in a software system for gram- null mar support and development, both as a lexicon enerator and as a tool for grammar debugging oguraev and Ritchie, 1987).</Paragraph>
    <Paragraph position="2"> 2. The pronunciation information in LDOCE has provided the basis for a study, in the larger context of speech recognition, of the implications of the phonetic structure of the English lexicon for different methods for lexical access (Carter, 1986). Again in the context of speech recognition, we intend to tackle the problem of word identification from a lattice of phonemes by constructing a parser that uses information about both phoneme collocations and syntactic predictions derived from independent analyses of the phonetic and grammar coding fields in the dictionary.</Paragraph>
    <Paragraph position="3"> 3. Furthermore, LDOCE carries special tags, known as subjecf, and boz codas, which encode semantic notions like the overall context in which a word sense is likely to appear (e.g. politics, religion, language) and selectional restrictions on verbs, nouns and compound phrases; we intend to use this information for further guidance during the word recognition process. Independently, an algorithm has been developed for analysing the semantic content proper of the dictionary entries by converting the definition texts in LDOCE into fragments of semantic networks (Alshawi, 1987); this opens opportunities for building a comprehensive and robust semantic component which could then be incorporated into any of the projects mentioned above.</Paragraph>
    <Paragraph position="4"> It is clear that in order to make full use of the computerised LDOCE, we need a dictionary access system with proper DBMS functionality, capable of efficient retrieval of entries satisfying selection criteria applying at various levels of linguistic description. The design of the system described here allows precisely such heterogeneous requests. What we offer is a software environment buffering the user from the typically baroque and idiosyncratic format of the raw dictionary source and allowing, via a carefully crafted interface, multiple entry points and arbitrarily complex access paths into the on-line lexical knowledge base.</Paragraph>
  </Section>
  <Section position="4" start_page="64" end_page="64" type="metho">
    <SectionTitle>
4 Requirements for the dictionary
</SectionTitle>
    <Paragraph position="0"> database Three main requirements can be identified if the data-base is to perform the functions intended for it. Firstly, the source tape of the dictionary must be converted into a format to which fast access can be coupled. This involves, at the very least, overall segmentation of the original character stream into records corresponding to gross lexical categories such as head word, pronunciation and part of speech. This may be a highly complex task, as in Kazman's (1986) project to restructure the text of the OED, or it may be conceptually fairly straightforward, as in the case of LDOCE where considerable segmentation is already present.</Paragraph>
    <Paragraph position="1"> But in either case, given that the on-line dictionary is intended to support more than one application, a more elaborate structuring of the entries' individual records might turn out to be unsuitable for further unforeseen use. Fortunately, it is clear from work with computerlsed dictionaries in general that once an application has located the relevant fragment of a dictionary entry, local ~parsing ~ into whatever format is needed can be fast and reliable, and can therefore be done %n the fiy~ by functions which manipulate individual entries on demand and have no permanent effect on the underlying source. Thus we should aim at incorporating the segmented version of the source intact into the database, to serve directly as its ~bottom layeff' in the sense that all access paths ultimately point to complete dictionary entries, which are then returned as the results of queries.</Paragraph>
    <Paragraph position="2"> Secondly, it should be possible to execute queries involving information of as many different types as possible. Even if the machine-readable source used is a comparatively structured one such as LDOCE, the creation of access paths will involve, for at least some types of information, the non-trivlal (but fast) construction of an intermediate and temporary representation by means of the local parsing already mentioned. For example, subcategorisation information is often specified in a rather elliptical form in LDOCE, for the sake of human readability; this must be made explicit by a parsing process, as described in Boguraev and Briscoe (1987). Also, it is desirable to impose a phonologically motivated structure on pronunciations, which are typically given as a string of phonemes and stress markers. This will allow the user to specify a constraint on, say, ~the onset of the second syllable of the word s , whose position in the phoneme string will not be the same for all words. The straight indexing approach used by e.g. Boguraev and Briscoe (1987) for headword-based access cannot in general provide sufficiently flexible access routes.</Paragraph>
    <Paragraph position="3"> Thirdly, the user or client program should be free to specify different types of constraint in any combination. We cannot assume in advance that information of a given type will always be present in great enough quantities to allow efficient retrieval. For example, if the system is being used by an automatic speech recogniser, then at one point in the signal significant information on pronunciation may be available, but few syntactic or semantic constraints may be present; at another point, the situation may be reversed, with the speech signal itself yielding little phonological information but with an expectation-driven parser providing quite specific higher-level constraints. In each case, the stronger, more specific constraints must be used for access, and the weaker ones only for checking the entries retrieved. To achieve this, the system must clearly be able to estimate in advance what the most efficient search strategy will be. This ability to perform maximally efficient searches given many different kinds of constraint will also be important if the database is being used interactively to investigate properties of the language. If the system's claim to be interactive is to be justified, it must be able to tell the user in advance roughly how long a prospective query would take to evaluate, and roughly how many entries would be returned as a result.</Paragraph>
  </Section>
  <Section position="5" start_page="64" end_page="65" type="metho">
    <SectionTitle>
5 Design and implementation
</SectionTitle>
    <Paragraph position="0"> The design and implementation of the database system described here reflects the three requirements just identified.</Paragraph>
    <Paragraph position="1"> The machine-readable source of LDOCE serves as the bottom layer of the database after undergoing a  &amp;quot;lispification s process described in detail in Boguraev and Briscoe (1987). This process preserves all the information, lexical and typographic, on the tape, and involves little restructuring, serving primarily to reformat the source in a bracketed form in which it can be much more easily read by Lisp programs. The link between the user or client program and the lisplfied dictionary is provided by a pointer file and a constraint file whose nature and motivation will be described below. null</Paragraph>
    <Section position="1" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
5.1 Analysing dictionary entries
</SectionTitle>
      <Paragraph position="0"> Information of six different types is analysed for the construction of access paths: semantic features classifying the meanings of words and their dependents; semantic subject area; grammatical part of speech; grammatical subcategorisation; British English pronunciations; and definition texts. All these types can be mixed together in constructing search queries. Entries can also be accessed by spelling patterns.</Paragraph>
      <Paragraph position="1"> The codes used for the first three of these types of information have a fairly simple structure, and are hence trivial to extract. The fourth, subcategorisation, is indicated by a complex and highly discriminatory set of codes; the extraction of these codes from the elliptical form in which they occur in LDOCE is described in Boguraev and Briscoe (1987). We will therefore discuss here only the structuring of pronunciations and the treatment of definition texts.</Paragraph>
      <Paragraph position="2"> Pronunciations are represented in the dictionary as strings of phonente8 and primary and secondary stress markers. Syllable boundaries are not reliably indicated. Therefore, in order to allow the syllable-based access that a speech recogniser would probably require, pronunciation fields are parsed into syllables and, within a syllable, into onset, peak and coda, usin~ the phonotactic constraints given in Gimson (1980} and employing a rnazintal onset principle (Selkirk, 1978) where these yield ambiguous syllable boundaries. Thus for example the internal syllable boundary in the pronunciation of ~constraint&amp;quot; is placed before the ~s ~. The parser used for analysing pronunciations is a special-purpose one whose (very simple) grammar is incorporated into its code. This allows pronunciations to be parsed many times faster than by a general-purpose parser with a declarative grammar.</Paragraph>
      <Paragraph position="3"> It also allows constraints on relationships between syllable constituents to be relaxed when necessary. For example, the LDOCE pronunciation of &amp;quot;bedouin&amp;quot; is degbeduin~ which violates the constraint that a syllable whose peak is u (as in &amp;quot;put s) cannot have a null coda; this constraint is therefore relaxed to obtain a parse.</Paragraph>
      <Paragraph position="4"> The strategy used for indexing entries according to the words their definition texts was designed to reflect the fact that it is the semantic content of these words that is likely to be of interest to the user. This has two main consequences: (1) It is more appropriate to take root forms of words as keys than to treat inflectional variants differently, because it is the root that holds most of the semantic content. Indeed, the inflection used with a particular word often depends on the largely arbitrary choice of syntactic constructions used in the definition.</Paragraph>
      <Paragraph position="5"> Thus for example, entries whose definitions contain any of the words ~fllm ~, ~films ~ and &amp;quot;filmed ~ should all be indexed under &amp;quot;film s .</Paragraph>
      <Paragraph position="6"> (2) Closed class words are unlikely to be useful as keys because their semantic content is limited and often highly context-dependent. In addition, many of them occur too often to be sufficiently discriminating for efficient lookup. Therefore only open class words are made available as keys.</Paragraph>
      <Paragraph position="7"> The task of deriving root forms of words is made much easier by the fact that LDOCE's definition texts are constructed largely from a set of two thousand basic words. When other words are used, they (or, in the case of inflectional variants, their root forms) are shown in a special font. Accurate root extraction for words not so marked can therefore be accomplished simply by stripping off affixes (which are themselves in the basic word list) and applying a few simple rules for spelling changes until a basic word is found. All irregular forms of basic words are stored explicitly.</Paragraph>
      <Paragraph position="8"> Distinguishing open and closed class words is also straightforward; a 1/st of closed class words was derived by performing a database lookup using those grammar codes and categories that represent closed classes.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
5.2 Constructing access paths
</SectionTitle>
      <Paragraph position="0"> Once the relevant information has been extracted from an entry, constructing acess paths is straightforward in the grammatical, semantic and definition text cases: a list of entry pointers is constructed for every code and every suitable definition word found in the dictionary. Pronunciations, however, are treated differently.</Paragraph>
      <Paragraph position="1"> To achieve flexibility and efficiency, a pointer list is formed for every distinct syllable in every position in which it occurs (e.g. second syllable in a three-syllable word).</Paragraph>
      <Paragraph position="2"> When the whole dictionary has been analysed, a pointer file is created containing all the entry pointer lists and, just before each llst, its length. As described below, this allows the system to estimate the work involved in evaluating a query without actually having to read the (sometimes very long) list itself.</Paragraph>
      <Paragraph position="3"> The next stage is to construct the constraint file.</Paragraph>
      <Paragraph position="4"> This file takes the form of a discrimination net which links every possible constraint on an entry (e.g. a sub-ject area, a grammar code or a constituent of a syllable) to one or, in the pronunciation case, several lists in the pointer file.</Paragraph>
    </Section>
    <Section position="3" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
5.3 Constructing search queries
</SectionTitle>
      <Paragraph position="0"> A menu-driven graphical interface is provided by means of which the user can construct a search query in the form of a tree whose terminal nodes are constraint values, dis\]unctions of them, or wild cards. The menus are derived automatically from the constraint file, so that only queries with some chance of being satisfied can be constructed. For example, if the user is constructing a specification of a syllable, the tree at one point may be as in Figure 1.</Paragraph>
      <Paragraph position="1">  If the user selects the CODA node, the resulting menu, shown in Figure 2, allows him to specify the coda ~pst ~, but not, for example ~psm ~. (In this menu, and in terminal nodes of the PRONUNCIA-TION subtree of Figure 1, ~*~ matches any sequence of symbols; ~?&amp;quot; matches any single symbol; and all other symbols have the phonetic values defined for them in LDOCE).</Paragraph>
      <Paragraph position="2">  A tree can be constructed either from a WORD node alone, or by instructing the system to build a tree from the entry for a specified word, and then editing it. Once the tree is built, either a partial search (to gather statistics) or a full search (to retrieve entries) can be requested.</Paragraph>
      <Paragraph position="3"> In a partial search, the system follows each constraint to the pointer list(s) it leads to, and sums the lengths of these lists \[as recorded explicitly in the pointer file) to display the approximate number of dictionary entries that satisfy it. It also indicates which constraints it would use to look up candidate entries in a full search, which ones it would merely apply as tests to those candidates, and, to allow the user to decide whether or not to order a full search, about how long the process would take. It makes the lookup/test choice using figures for the expected time taken to read (a) a pointer from the constraint file and (b) a complete entry from the dictionary. The most e~cient search strategy involves using the most specific few constraints as lookup keys (more specific keys ultimately yielding fewer entries). The optimal number of constraints to use is found by balancing the number of pointers that will have to be read, which increases with the number of lookup keys, against the expected number of entries that will have to be read, which decreases. (A.n entry will only be read if there is a pointer to it in every pointer list. Therefore if lookup keys are used, returning pointer lists of lengths LI, L=, ... L,, then the expected number of entries to be read, assuming statistical independence between lists, is LIL2...L,/D&amp;quot;-l, where D is the number of entries in the dictionary. This decreases with a because Li cannot exceed D, and is in fact normally very much smaller).</Paragraph>
      <Paragraph position="4"> In a full search, these statistics and choices are not only displayed but are also acted on. The pointer lists for the lookup constraints are intersected, the number of pointers resulting is displayed and, at the user's option, the corresponding entries are read from the dictionary, the test constraints are applied to them, and the surviving entries are displayed. Applying tests to a dictionary entry involves reanalysing the relevant parts of it in the same way as when the database is constructed.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="65" end_page="67" type="metho">
    <SectionTitle>
6 An example
</SectionTitle>
    <Paragraph position="0"> As an example, suppose the user wishes to see all entries for three-syllable nouns which describe movable solid objects, whose second syllable has a schwa as peak, and whose third syllable has a coda that is a voiced stop. He constructs the tree in Figure 3 overleaf, and selects the ~partial search&amp;quot; option. This returns the information shown in Figure 4.</Paragraph>
    <Paragraph position="1">  Because of the expected large number of entries in the result and the time that would be taken to read them, the user decides to look only at the entries for such words whose definitions contain the word &amp;quot;camera ~. He adds the relevant constraint to the tree (the system checking, as he does so, that %amera ~ is a valid key) and orders another partial search. This time, the statistics are more manageable. A full search is therefore ordered, in which the definition word %amera ~ is used as the only lookup key, and the other constraints are this time all used as tests.</Paragraph>
    <Paragraph position="2"> This returns the entries for the words %lapperboard ~ and &amp;quot;Polaroid ~, shown in Figure 5.</Paragraph>
    <Paragraph position="3"> clztp.per.board /'l,.l~ep~ba:d II -ar~ora/ n (subj ~aP--, box .... #) (when starting to film a scene for the cinema) a board on which ttte ,letails of \[Ite scene to be filmed are written, held up in front of t~te camera Po.lar.oi4 /'p~ul~roid,:,~ t.z're~k i \[V\] (subj st--. box .... z---x'., a material :vith w~ic}t -g'ass ;.s trea:ed in or,~er &amp;quot;,o :na~.e ti~ht shine le~s brightly throw~h it, used in making 3ut\[,3L.~.~zEa.~.:ar '?,rifi,:l+3WS, etc. 2 \[C\] C:;u~j ~,3--. box .... s---.~', also ( .,~,,~i,') Pol.~roid cam,e.r~t / 0&amp;quot;&amp;quot; &amp;quot;&amp;quot; / -a type of camera that produces a finished pl'totograpPt only se,:on,:ts after ~he picture has been tal..e-t Figure 5</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML