File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1104_metho.xml
Size: 11,297 bytes
Last Modified: 2025-10-06 14:11:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1104"> <Title>Nagao, M. et al. &quot;An Attempt to Computerize Dictionary</Title> <Section position="2" start_page="0" end_page="442" type="metho"> <SectionTitle> 2. Prelirainary Design Considerations </SectionTitle> <Paragraph position="0"> Linguists and lexicographers are latecomers to the field of database applications. Database software has been available since the early 1960's. The early 1970's brought a wide variety of commercial products and a consolidation on the conceptual side, which ultimately led to standardization, design philosophies, and specifications of &quot;normal forms&quot;. At that time lexicographers still used the concept of an archive when talking about new technologies, such as Barnart (1973), Chapman (1973), and Lehmann (1973) at the 1972 International Oon/grence on Le$1~ojzraphy in EnC/tsh.</Paragraph> <Paragraph position="1"> Similarly, in the late 1970'% we witnessed preparations for a Stanford Computer Archive of Language Materials. There is nothing wrong with the idea of an archive. But a database is something different. By now, the expression &quot;database&quot; should only be used as a technical term. Perhaps &quot;data bank&quot; may be used instead of &quot;database&quot; when talking about files of data, or archives in a conventional sense. The As~oosal:an /or lh'terary and .L:nym'~hv Oorayult'ny may have had this clarification in mind when naming its specialist group &quot;Structured Data Bases&quot;.</Paragraph> <Paragraph position="2"> Although hierarchical data models and network models had been available since the early 1960s, and relational architectures since the early 1970s (Codd 1970), software implementations were not generally accessible in university computing eentres due to high cost, and lack of special support. Although the Mfinster computing centre had the hierarchical IMS software, a product of IBM, it was not made available for our project. Looking back from today, that may not have been a handicap for at least two reasons: lexical relationships are only rarely hierarchical in a natural sense, and, more importantly, hierarchical systems do not have a common standard. There is no migration path from one software product to another. Since a Shakespeare database will have a rather long life cycle, and was meant to be a model for similar projects, the requirement of a standard model seemed to be imperative. The process of standardization has been proceeding more rapidly for the CODASYL network model than for any other architecture.</Paragraph> <Paragraph position="3"> In the early 1980s there was just this model that fulfilled our requirements, and this is basically true even today.</Paragraph> <Paragraph position="4"> Beginning with the early 1980'$ lexical symposia and conferences had an ample share of papers reporting on ongoin/~ research which used the database concept in a variety of ways. In 1981 Na~ao et al. reported on &quot;An Attempt to Computerize Dictionary Data Bases&quot; (198~). At the same conference a University of Bonn group (Brustkern and Hess 1982) presented &quot;The BonnIex Lexicon System&quot;, which two years later evolved into a &quot;Cumulated Word Data Base for the German Language&quot; \[Brustkern and Schulze 1983). A list of similar projects could easily be extended.</Paragraph> <Paragraph position="5"> One might have expected that the logical design of lexical databases would have built on structural ~ where we typically find entities and relationships, and in general, set theoretic notions, which can directly be translated into conceptual data-structures.</Paragraph> <Paragraph position="6"> Surprisingly, in many designs, linguistic considerations did not seem to have played a major role. Instead, the authors simulate conventional lay-out and typesetting arrangements of printed dictionaries. An example is the widespread dictionary usage to print one &quot;Headword&quot; in bold type and then use special symbols, such as the tilde, to refer to the headword, or parts of it, thus saving space for the treatment of further lexical items with the same spelling. Nagao et al. (1982} very faithfully transfered this and other lay-out details into their design. But should a conventional &quot;Headword&quot; and its dependencies be a serious candidate for a database entity? Are the reasons that led dictionary publishers to accept certain lay-out techniques at all relevant for an electronic database? These questions seem not to have been raised. The design seems to have become a paradigm case of an imitation design, where a new technology replicates design features of an older technology.</Paragraph> <Paragraph position="7"> The basic misunderstanding is the false identification of a mere presentation in a printed dictionary with an underlying lexical information structure.</Paragraph> <Paragraph position="8"> If the &quot;Headword&quot; is not a relevant database entity, which entity should be taken instead? There is only one serious candidate: the lemma. The lemma is a well defined linguistic notion. It is also weI\[ known in computational work due to various automatic or semi-automatic lemmatization algorithms. It is an abstract notion in the sense that printed dictionaries and database systems need a lemma-name to refer to it. Language specific conventions usually govern the choice of a lemma-name. Latin verbs, for example, are customarily lemmatized using the first person singular present form as lamina-name. A \[emma is the set of all its inflected word-forms. It thus comprises a complete inflectional paradigm. Some lemmata have defective paradigms or suppletive paradigms. Conventional dictionaries quite often include paradigmatic information in their front matter. The user has to relate specific cases to these examples. A database can relate these explicitly. A natural way to do this is by a one-to-many relationship between lemma and word-form. In an author dictionary word-forms will be further related to the text, and its internal structure. A machine-readable dictionary is just a starting point for a structured lexical database. \[n the Bonn &quot;Word Data Base for the German Language&quot; (Brustkern and Schulze 1983b} there is but one database entity, &quot;Lexical Entry&quot;, which seems to correspond to the lemma rather than to a &quot;Headword&quot;. The authors speak about the &quot;microstructure&quot; and the &quot;macrostructure&quot; in respect to &quot;Lexical Entries&quot;, but only the former is discussed in detail. The later is only mentioned once: &quot;Special characteristics of the macrostructure (other than alphabetical order} are to be made explicit in the logical structure of the data base&quot; (Brustkern and Schulze 1983b}. &quot;Macrostructure&quot; is rarely visible in a conventional alphabetic dictionary, although we are used to &quot;synonyms&quot; and &quot;antonyms&quot;, dictionary &quot;senses&quot;, and labels that identify technical jargon, or special terminologies in individual dictionary entries. In the design of a lexical database it is useful to make these various relations between lemmata explicit. In this manner a user gets more information than by consulting a printed dictionary. The information he gets is related and structured in unexpected ways.</Paragraph> </Section> <Section position="3" start_page="442" end_page="442" type="metho"> <SectionTitle> 3. A Sample Schema </SectionTitle> <Paragraph position="0"> There are various ways to approach the problem of schema design. For the 5'habe~peare Diits'anarj~ Morphology Database, now an integrated part of the overall architecture, both object-class methods and query-assertion methods lead to the current schema (cf. Figure 1). There are four base object-classes (entities}: lemmata, segments, allamoryh,~, and raaryheme.~ having cardinality values between 2,500 and 40,000 records. Queries were to allow for a direct retrieval on three levels: the conventional level of the lemma, the level of allomorphs, and the morphemic level. This i:~ achieved by a virtual record, defined as a subschema (cf. Figure 2}. In this way the database design mirrors a structural morphological analysis directly. The concept of a moryho/a~7#~a/ /amt/jf defined as a set of lemmata which has at least one morpheme in common is thus immediately accessible for database queries.</Paragraph> <Paragraph position="1"> The ultimately Latin prefix { IN- } has, for example, database links to allomorphs such as { im- } in the lemma impure, { il- } in the lemma illegitimate, or { it- } in the lemma irregular. In Shakespeare's vocabulary there are almost 200 lemmata which belong to this { IN'- } family. A statistical survey of morphological families in Shakespeare, reveals characteristic &quot;family types&quot;. Since morphological descriptions are directly accessible for a study of patterns such as nominal compounds, conversions, or derivations, listings of morphologically similar lemmata supplement family /-~. ~-/</Paragraph> </Section> <Section position="4" start_page="442" end_page="444" type="metho"> <SectionTitle> VIRTUAL RECORD SECTION. VIRTUAL RECORD MORPHEME-TO-LEMMA; BASE RECORD IS SEGMENT; MORPHEME OWNS ALLOMORPH VIA MORPHEME-TO--ALLOMORPH; ALLOMORPH OWNS SEGMENT VIA ALLOMORPH-TO-SESMENT; LEMMA OWNS SEGMENT VIA LEMMA-TO-SEGMENT. </SectionTitle> <Paragraph position="0"> listings in a study of the morphological articulation of Shakespeare's vocabulary. The database has access to various additional and specialized kinds of morphological information such sound symbolism, popular etymology, or contamination. Furthermore, morphological information is by design linked with etymological information. Morphological families which are etymologically related can be grouped together under one etymon. One example for such an etymological grouping is given in Figure 4. The phenomenon of etymologically homogeneous or disparate word-formation, which has traditionally been of some interest in Shakespearean studies czn be analysed directly. These materials are currently being prepared for the forthcoming first volume of SHAD.</Paragraph> <Paragraph position="1"> Any lexical database design should account for external links with other lexical databases (Neuhaus 1985). Here again, a common standard is essential The /grama retard is a natural interface in these external relations.</Paragraph> <Paragraph position="2"> Standardization of the lemma concept may therefore be a first step for systematic database connections.</Paragraph> <Paragraph position="3"> n. untruth 40ldeng.</Paragraph> <Paragraph position="4"> n. true-love 10 800 n. true 36 1300 pp. true-hearted 3 1471 pp. truer-hearted 1 1471 n. truepenny 1 1519 pp. true-born 2 1589 pp. true-anointed 1 1590 pp. true-derived 1 1592 pp. true-disposing I 1592 pp. true-divining 1 1593 pp. true-telling 1 1593 pp. true-devoted 1 1594 adj. honest-true 1 1596 pp. true-begotten 1 1596 pp. true-bred 3 1596 pp. true-fixed 1 1599 pp. true-meant I 1604</Paragraph> </Section> class="xml-element"></Paper>