File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1103_metho.xml

Size: 19,882 bytes

Last Modified: 2025-10-06 14:11:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1103">
  <Title>A Word Database for Natural Language Processing</Title>
  <Section position="3" start_page="0" end_page="436" type="metho">
    <SectionTitle>
2 Description of the Dictionary
</SectionTitle>
    <Paragraph position="0"> Within such an environment, a fairly large-sized and detailed dictionary is needed. Aspects of its design, the structure in the database, and the editing and querying facilites will be discussed (cf. also Barnett (1985)). The expected size of the dictionary within the scope of the project is estimated to be some 20,000 entries. Its current size is some 12,000 entries. null</Paragraph>
    <Section position="1" start_page="435" end_page="435" type="sub_section">
      <SectionTitle>
2.1 Word Database
</SectionTitle>
      <Paragraph position="0"> Because we must be able to handle a large number of words in this project, we felt that it would be necessary to administrate them in a more appropriate form than the usual file organization and that a relational database would be the best tool for dealing with lexical information because of the following advantages:  currency capabilities which preventusers working on the same table from getting in each other's way.; * and, within the realm of this project, the possibility to link to the Natural Language Analyzer.</Paragraph>
    </Section>
    <Section position="2" start_page="435" end_page="435" type="sub_section">
      <SectionTitle>
2.2 Scope
</SectionTitle>
      <Paragraph position="0"> The scope of the information contained in our dictionary is geared towards the processing of natural language by computer. Lexical information must therefore be more detailed and more explicit than in standard dictionaries intended for humans. Also, a computer dictionary is of no value unless it matches the grammar and the needs of the semantic processing. null We started with the coding of morphological and syntactic information, since we felt to be on rather stable ground there. We will report on some of the difficulties we encountered - many of them not unknown to theoretical linguistics - in the next section.</Paragraph>
      <Paragraph position="1"> Semantic information is coded primarily in the form of meaning rules, but we have not included these in our lexical database yet, as we are still experimenting with different kinds of information and representations before we go to large-scale coding. We also hope that, at least to some extent, the acquisition of such information can be automated (cf. the approach taken by Wirth (1984)).</Paragraph>
    </Section>
    <Section position="3" start_page="435" end_page="436" type="sub_section">
      <SectionTitle>
2.3 Sources
</SectionTitle>
      <Paragraph position="0"> For the purpose of our particular application, we need to cover the vocabulary occurring in German traffic law.</Paragraph>
      <Paragraph position="1"> However, to meet the goal of general applicability, it is also necessary to include the core of the general German vocabulary. We will try therefore to code the relevant legal words based on texts from this very domain. In addition and this is the greater problem - we must try to define which words pertain to the common vocabulary.</Paragraph>
      <Paragraph position="2"> As a first step, we have compiled a preliminary list of the 4000 most frequent German words from an existing frequency dictionary (Meier (1967)) and news texts.</Paragraph>
      <Paragraph position="3"> Later, we will make frequency counts of representative samples of texts to arrive at a more reliable list of words.</Paragraph>
      <Paragraph position="4"> (r) IBM Germany has a dictionary of 70,000 entries containing morphological and hyphenation information.</Paragraph>
      <Paragraph position="5">  * The vocabulary of the application area, i.e. from the legal domain, stems from the following sources: * A collection of relevant court decisions (from our study partner), * A number of accident descriptions collected from newspapers, * A few word lists used for document retrieval from both the Legal and Public Relations departments of IBM Germany.</Paragraph>
      <Paragraph position="6"> * We plan to investigate to what extent machine-readable dictionaries or legal texts can be used for an automatic or semi-automatic acquisition of lexical and grammatical information and of common-sense knowledge.</Paragraph>
      <Paragraph position="7"> 2.4 Layout of the Dictionary Relation  In our word database, every word constitutes an entry, and most columns in the entry contain information concerning a particular word. Even though semantic aspects are not coded in this particular version, one may regard the codes as a representation of a word's morphological and syntactic meaning. Some words have more than one entry: to code multiple entries becomes necessary when different grammatical feature sets have to be assigned to one lemma. All words are contained in a single table or relation. One could also envisage a separate table for every part of speech; however, this would be rather inconvenient, as it would be impossible to compare grammatical phenomena across different categories. Also it may be desirable to look at words of the same root but belonging to different parts of speech. With this necessity in mind, we designed an overall, general relation which would contain all words. In order to treat the words individually and according to their specific needs, a so-called &amp;quot;view&amp;quot; was defined for each part of speech. The present structure of the relation is described in Figure 1.</Paragraph>
    </Section>
    <Section position="4" start_page="436" end_page="436" type="sub_section">
      <SectionTitle>
2.5 Tools and Aids
</SectionTitle>
      <Paragraph position="0"> To facilitate coding and to ensure its accuracy, we use the following tools: Editing: A Dictionary Editor (a menu-driven program running under ISPF (IBM 1982) interacting with the SQL/DS database) was developed to facilitate adding, updating, deleting, and checking of entries at the terminal.</Paragraph>
      <Paragraph position="1"> Under this editor, a specific set of menus and help panels was implemented for nouns, verbs, and adjectives. Whereas the main menus contain only short hints to the grammatical information as a sort of reminder to the lexicographer, help menus give more detailed examples for the individual codes. Subpanels, as extensions to the main panel for input, and error messages also assist the lexicographer. Codes arc verified by the Dictionary Editor to keep down the error rate. Queries, Reports, and Files: Independently of the Dictionary Editor, the lexicographers work with the standard database interfaces ISQL (IBM 1983) and QMF (IBM 1983, 1984) to query, extract, recover, and to view the contents of the word database. QMF is used to select information and to format, display, and write reports. - The style of data display is easy to understand, so that persons who are not experts in data processing but are competent linguists can examine and alter the linguistic description according to General Description Field in Database Noun Verb Adjective Lermna (may not be empty)</Paragraph>
      <Paragraph position="3"> Fig.i: Structure of the Word Relation their particular needs.</Paragraph>
      <Paragraph position="4"> It is part of the work of a lexicographer to account for all grammatical constructions within which a word may appear. To achieve this, the lexicographers consult standard dictionaries such as Duden (1976-1981) and Brockhaus Wahrig (1980.. 1984), but most importantly, they consult the texts mentioned above in the form of a concordance which we generate dynamically.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="436" end_page="438" type="metho">
    <SectionTitle>
3 Syntactic Information
</SectionTitle>
    <Paragraph position="0"> The restrictions on co-occurrence with other words (or phrases) is what we consider to be syntactic information.</Paragraph>
    <Paragraph position="1"> Here we include information on government (or valency) and on adverbials (or attributes) which serve to subclassify the various parts of speech. Our work is based on the work of Fillmore (1968), Gross (1984), Heidolph et al. (1980), Steini~ (1969), Bierwisch (1963), Helbig and Schenkel (1975), Sommerfeldt and Schreiber (1977, 1980), and on Zoeppritz (1984).</Paragraph>
    <Paragraph position="2"> For the practical work on a dictionary, it is of utmost importance to make fully explicit the criteria for the different classifications used. Such criteria are notoriously difficult to extract from theoretical as well as practice..oriented works.</Paragraph>
    <Section position="1" start_page="436" end_page="437" type="sub_section">
      <SectionTitle>
3,1 Government
</SectionTitle>
      <Paragraph position="0"> German verbs can be classified according to the objects they govern (accusative (A), dative (D), genitive (G), prepositional object (P), predicate noun or adjective (N)). We decided to include also the subject (N) among the complements go..</Paragraph>
      <Paragraph position="1"> verned by the verb. A similar classification can be carried out for adjectives (they govern cases as well as prepositions) and for nouns (which govern prepositional attributes; genitive attributes are not coded but admitted for every noun).</Paragraph>
      <Paragraph position="2"> While some of the complements may be missing in a sentence, others must be regarded as obligatory, and this must be coded in the dictionary as well. So we code two features indicating the maximum and the minimum of complements of a given word. This is illustrated by the adjective iiberlegen (superior) which gets the maximal code DP and the minimal  There are a number of problems in determining what complements can be governed by a given verb. We will discuss here the problcm with datives, and in the section on adverbials we will discuss the problem of how to distinguish between adverbials and prepositional objects.</Paragraph>
      <Paragraph position="3">  It was noted by case grammarians (but also to some extent in traditional grammar) that datives perform different functions (&amp;quot;semantic roles&amp;quot;), and that only some of them should be regarded as subeategorizing the class of verbs; the others are sometimes called &amp;quot;free&amp;quot; datives which are exemplified by ethical: Wirf mir die Vase nicht weg.</Paragraph>
      <Paragraph position="4"> (Be sure not to throw the vase away) possession: Paul brach mir den Arm.</Paragraph>
      <Paragraph position="5"> (Paul broke my arm) benefactive: Paul fibersetzte mir den Brief (Paul translated the letter for me) responsibility: Die Vase ist mir zerbrochen.</Paragraph>
      <Paragraph position="6"> (The vase broke during the time I had it) (cf. also Heidolph et al. (1980) for a discussion and Wegener (1985)).</Paragraph>
      <Paragraph position="7"> Free datives are never obligatory, but all other criteria so far are only semantically motivated and are - particularly in the case of the benefactive - not very well defined. But still, free datives can be taken into account by the grammar and can thus be attached to any suitable verb.</Paragraph>
    </Section>
    <Section position="2" start_page="437" end_page="438" type="sub_section">
      <SectionTitle>
3.2 Complement Clauses
</SectionTitle>
      <Paragraph position="0"> In accordance with Bierwisch (1963) and Heidolph et al.</Paragraph>
      <Paragraph position="1"> (1980), we consider complement clauses as filling the positions of nominal complements. The complement clauses we consider are daft (that) clauses, ob (whether) clauses, and infinitive clauses (pure infinitives and infinitive clauses introduced by zu (to)). Bierwisch was first in subcategorizing German verbs according to the implied subjects of the infinitive clauses they govern. Consider the following examples in English John permitted Paul to leave John persuaded Paul to leave This is problematic, however, with some verbs in German when the infinitive clause contains modal verbs. Consider Er flehte sie an zu gehen (he begged her to leave) with the implied subject sie, and Er flehte sie an, gehen zu dfirfen (He begged her to be permitted to leave) with the implied subject er. This shift of implied subject must be coded in the lexicon (for verbs like diirfen), so that the code can be used by the syntax rules for complex verbs. A second problem concerns cases where an implied subject cannot be found in the matrix clause as in Paul ordnete an, den Saal zu rfiumen (Paul ordered the room to be cleared) Es ist verboten, den Rasen zu betreten (it is forbidden to walk on the lawn) There are two different phenomena involved: The dative governed by verbieten, would be the implied subject, but  happens to be omitted, whereas anordnen does not govern a candidate for implied subject. If there is no suitable candidate in the context, we get a generic interpretation, and we code our complement features accordingly.</Paragraph>
    </Section>
    <Section position="3" start_page="438" end_page="438" type="sub_section">
      <SectionTitle>
3.3 Adverbials
</SectionTitle>
      <Paragraph position="0"> Our approach to adverbials is closely related to the one taken by Steinitz (1969) and Heidolph et al. (1980), which to us - seems far better motivated than e.g. the classification in Brockhaus Wahrig (1980-1984). Certain types of adverbials we consider to be governed by certain verbs, nouns and adjectives, and these are hence used for subcategorization.</Paragraph>
      <Paragraph position="1"> They include adverbials of place: Paul wohnt in Heidelberg (Paul lives in Heidelberg) direction: Paul geht nach Heidelberg (Paul goes to Heidelberg) modality: Paul benimmt sich schlecht (Paul behaves badly) measure: der Vortrag dauert eine Stunde (the lecture lasts one hour) It has been our tendency to code adverbials only when they are obligatory, but this certainly does not cover all the information necessary.</Paragraph>
      <Paragraph position="2"> A further problem concerns the decision between adverbial and prepositional object. Criteria for this distinction have been described by Steinitz (1969) and Heidolph (1980): they mainly involve observations on the role prepositions play their variability and whether they have retained their meaning. Consider Paul stood on the table Paul insisted on the table In the first case, we could have near, under, by, etc. instead of on whereas in the second case, we do not have a choice.</Paragraph>
    </Section>
    <Section position="4" start_page="438" end_page="438" type="sub_section">
      <SectionTitle>
3.4 Coding Example
</SectionTitle>
      <Paragraph position="0"> The following example shows the test questions and corresponding coding decisions for verbs and adjectives. The sample form is iiberlegen, that appears as verb with separable prefix in the meaning 'to cover', as verb with inseparable prefix in the meaning 'to reflect', and as an adjective meaning 'superior'.</Paragraph>
      <Paragraph position="1"> Tests for iiberlegen 'to cover': Prefix: separable or not? legt fiber -- hat fibergelegt -- fiberzulegen Full government: Paul legt Maria eine Jacke fiber Dative can be left out: Paul will eine Jacke fiberlegen Accusative cannot be left out: *Paul legt der Maria fiber *Paul legt fiber Coding for ftberlegen 'to cover'  The accusative can be replaced by zu-infinitive, dad and ob-clauses: Paul fiberlegt (sich), Maria zu besuchen Paul fiberlegt (sich), dab er Maria besuchen will Paul iiberlegt (sich), ob er Maria besuchen will The implied subject of infinitive clauses is the main clause subject: Paul fiberlegt sich, Maria zu besuchen Paul besucht Maria Coding for iitberlegen 'to :reflect': Word class: VERB Stem: iiberleg Government: nominative dative accusative Clauses: infinitive ,as accusative implied subject is nominative dad-clause as accusative ob-clause as accusative Reflexive: dative Obligatory: nominative accusative Testing for iiberlegen 'superior': Full government: Paul ist Maria irn Weitsprung fiberlegen der uns allen im Weitsprung fiberlegene Paul Prepositional can be left out: Paul ist Maria/iberlegen der uns allen fiberlegene Paul Dative can be omitted as well: Paul ist fiberlcgen deg- der fiberlegene Paul Zu-infinitives (marginally) and dad-clauses are possible in lieu of the prepositional. Clauses inuoduced by ob are not allowed, not even with negation.</Paragraph>
      <Paragraph position="2"> Paul ist Maria darin fiberlegen, dab er welter springen kann *Paul ist Maria (nicht) darin fiberlegen, ob er weiter springen kann Diese Schrift ist anderen darin /iberlegen, leichter lesbar zu sein The subject of infinitive clauses is the head of the adjective: der darin, springen zu kSnnen,/iberlegene Paul Paul ist fiberlegen Die Schriff ist der anderen darin /iberlegen, besser lesbar zu scin Die Schrift ist t)esser lesbar The preposition may not be omitted: *Paul ist Maria/iberlegen, welter springen zu k6nnen The adjective can be used in predicative and attributive position: Paul ist fiberlegen .-- der fiberlegcne Paul Coding for i~berlegen 'superior'.</Paragraph>
      <Paragraph position="3"> Word class: ADJECTIVE Stem: iiberlegen Government: dative prepositional Prepositions: in bei Clauses: infinitive as prepositional daft-clause as prepositional Restrictions: no restrictions to predicative or attributive use</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="438" end_page="439" type="metho">
    <SectionTitle>
4 Semantic D~ovmation
</SectionTitle>
    <Paragraph position="0"> When dealing with semantic information wc should distinguish between the information needed for obtaining a (if' possible) disambiguated logical form and the information needed to draw inferenccs from this logical form, even though - at least in part o this information may be idcntical.</Paragraph>
    <Paragraph position="1"> Among the former we include a concept lattice (or hierarchy) and selection restrictions which both are special cases of the meaning rules wc use to represent common sense and domain knowledge. (For a dctailcd discussion of our approach to knowledge reprcscntation, el. Guenthncr / Lehmann / Sch6nfeld, 1986). All of this information we encode as Prolog tcrms, and we also store these in SQL/DS, but separate from the word relation described above.</Paragraph>
    <Paragraph position="2"> We give an example here of thc conccpt hierarchy currcntly used in our Natural Language Analyzcr. (&amp;quot;bt&amp;quot; stands for broader term, the period in the third rule indieatcs a compound tcrm): bt( angek lagt, mens oh).</Paragraph>
    <Paragraph position="3"> bt(fahrz:mg, fortbewegungsmJttel).</Paragraph>
    <Paragraph position="4"> bt( fortbewegungsmittel,hergestelit, objekt).</Paragraph>
    <Paragraph position="5">  This hierarchy is used in conjunction with the selection restrictions listed below to disambiguate sentences, to recover ellipses, and to resolve anaphoric references. (The format of the selection restrictions is: Lemma, reading, state vs. event, list of restrictions for the respective complements, including an indication whether the verb is distributive (dist) or collective on a given complement): verb(abbremsen,l,event, nom(dist,fortbewegungsmittel).nil).</Paragraph>
    <Paragraph position="6"> verb(abbremsen,2,event, nom(dist,mensch).</Paragraph>
    <Paragraph position="7"> acc(dist,fortbewegungsmittel).nil).</Paragraph>
    <Paragraph position="8"> Wirth (1984) has described a procedure to extend a concept hierarchy and selection restrictions from text on the basis of given sentences. In his procedure, human intervention is still required, and it seems doubtful at this point whether a fully automatic procedure is feasible. Further, one observes a certain discrepancy between linguistic usage and logical behavior of certain words. We are investigating ways to overcome these problems, but a discussion of them has to be left to forthcoming publications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML