File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1021_metho.xml

Size: 16,825 bytes

Last Modified: 2025-10-06 14:11:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="C82-1021">
  <Title>A MULTILAYERED APPROACH TO THE HANDLING OF WORD FORMATION</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I NTROI)UCT I ON
</SectionTitle>
    <Paragraph position="0"> Any linguistic theory has to account for word formation as a way of expressing complex relations, facts or situations, and nearly every theoretical approach contains at least suggestions as to how to handle word formation. Not until recently has word formation become a topic in natural language processing within the framework of artificial intelligence (cf. FININ 1980, McDONALD 1981).</Paragraph>
    <Paragraph position="1"> It is argued here that similar attention should be paid to word formation as has already been paid, for example, to sentence structure. This is justified not only because such phenomena as derivatives and compounds obviously do occur in natural language, but rather because natural language AI systems to a \]arge extent already contain the sort of knowledge needed to understand word formation, and therefore seem to be well suited for investigations in this field (cf. SAMLOWSKI 1975).</Paragraph>
    <Paragraph position="2"> Having been discarded in early days of AI research as too tedious and expensive a task (CERCONE 1974) the analysis of word formation, especially compounding, seems to be a major way to increase linguistic coverage and to reduce vocabulary errors, one of the most frequent sorts of errors in natural language systems (see</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="133" type="metho">
    <SectionTitle>
* THOMPSON 1980)
</SectionTitle>
    <Paragraph position="0"> Generally speaking, the trouble with word formation is that in contrast to sentence structure the relations between constituents are not overtly marked in word formations. In addition, there are seldom explicit clues indicating whether a given word is lexicalized or analyzable, or how to interpret the latter ones.</Paragraph>
    <Paragraph position="1"> Furthermore, derived and compound words incorporate ambiguities on several different linguistic and cognitive levels. Therefore, it is a challenging task for natural language AI research to study how a system can identify, understand, and make use of word formation.</Paragraph>
  </Section>
  <Section position="4" start_page="133" end_page="133" type="metho">
    <SectionTitle>
134 W. HOEPPNER
ANALYSIS OF WORD FORMATION IN HAM-ANS
</SectionTitle>
    <Paragraph position="0"> The approach to the handling of word formation described in this paper is part of the development of the natural language system HAM-ANS (Hamburg Application-Oriented Natural Language System). HAM-ANS, which is based once earlier system HAM-RPM (see--v. HAHN et al. 19~0) provides natural language access to other software systems. &amp;quot;While a natural language interface to a large relational data-base system dealing with fishery data is currently being designed, the other two major application areas of HAM-ANS - a hotel-reservation system and a motion analysis system dealing with a street crossing (see JAMESON eta\]. \]980, MARBURGER eta\]. 198\]) are studied and further developed in an implemented version which covers the complete natura\] language dialogue. Analysis and generation of word formation has been integrated into the system in the context of these two domains of discourse and the examples given b~low are taken from dialogues about these domains.</Paragraph>
    <Paragraph position="1"> The main idea in our approach is that derivatives and compounds cannot be treated appropriately in a separable component whose output is a semantic interpretation (FININ \]980) or a paraphrase (BORGIDA 1975, McDONALD/HAYES-ROTH 1978). Instead, the question of how, or even whether, a word formation is to be analyzed should be decidable on different levels of processing covering the meaning of word constituents, their interpretation in the context of utterances, and their interpretation in situational context.</Paragraph>
    <Paragraph position="2"> It is quite a tradition in theoretical linguistics to discriminate between compounding and derivation as the two basic means of word formation. It is not our concern here to add new arguments in favour of or against such a simple distinction; rather, we use it to delimit the broad field of word formation and to indicate the linguistic data our approach is designed to capture. Leaving aside the differentiation between Iexlcalized and analyzable words for the moment the over-all research objective is to handle those words which can be first segmented into semantically meaningful units and then interpreted by making use of knowledge sources and the inferential capacity of a natural language AI system.</Paragraph>
    <Paragraph position="3"> A common characterization of derived words is that they are formed by combining a free morphemic part with a bound one. The order of these morphemic parts yields the discrimination between prefixation and suffixation on purely structural grounds. There is, however, also a semantic difference between two types of derivation: It appears to be rather difficult to determine the meanings of prefixes and their semantic relation to the free morphemic part of the word in a general way. We will therefore exclude prefixes from our treatment of word formation within HAM-ANS for the time being and concentrate on derivation by means of suffixes and on composition.</Paragraph>
  </Section>
  <Section position="5" start_page="133" end_page="133" type="metho">
    <SectionTitle>
IDENTIFICATION OF DERIVATIVES AND COMPOUNDS
</SectionTitle>
    <Paragraph position="0"> In both English and German derivatives are written as one word delimited by blanks or punctuation marks. Compound words, in these languages are represented differently. A German compound is written as one string of letters, the segmental units of an English one are clearly indicated by a blank or a hyphen. This orthographic difference incorporates a difference in the problems of how to identify compounds in both languages. The ambiguity between English compounds and syntactic constructions, attribution in 'woman doctor', has to be handled within the syntactic analysis (cf. MARCUS 1980) and does not occur in German because of its graphemlc representation of compounds. On the other hand a system analyzing German compounds has to identify the meaningful segments as a first subtask; several approaches in the area of computational lingustics have dealt with the problem of ide~.~ifying segments in isolated compounds (cf. v.HAHN, FISCHER 1975, SCHOTT \]978). Th~ systems rely heavily on graphemic and morphemic rules, the latter using addition~ lexical information. Characteristically the analysis of isolated compounds wil~ 'at best produce more than one segmentation, as e.g. for the word STAUBECKEN which</Paragraph>
  </Section>
  <Section position="6" start_page="133" end_page="133" type="metho">
    <SectionTitle>
MULTILAYERED APPROACH TO WORD FORMATION 135
</SectionTitle>
    <Paragraph position="0"> should be segmented in either STAU-BECKEN (reservoir) or in STAUB-ECKEN (dusty corners), but the determination of the intended meaning lies beyond the scope of these approaches.</Paragraph>
    <Paragraph position="1"> In the system HAM-ANS the starting point of word-formation analysis is contained in the lexical analysis component, its main task being the reduction of inflected word forms and providing lexical information for the subsequent syntactic analysis. Whenever a word is not contained in the lexicon, the system removes possible inflectional suffixes before trying to recognize it as a derivative or a compound. Only if this attempt fails will the user be asked for information about the word.</Paragraph>
    <Paragraph position="2"> Employing the contents of the system's lexicon is certainly a simple way ~o define lexicalized formations. This sharp distinction between lexicalized and analyzable words, as used in the current implementation, does not do full justice to dlbservable degrees of lexicalization; therefore it will yield to an improved conception. The segmentation of words not contained in the lexicon makes use of a table of derivative suffixes, a set of graphemic restrictions and the definitions of basic lexical items stored in the lexicon. Graphemic restrictions incorporate rules for the reduction of vowel mutation often cooccurring with suffixation and for the detection of juncture morphemes.</Paragraph>
    <Paragraph position="3"> In a first step, derivative suffixes are recognized by comparing final segments of the word under inspection with the entries of the suffix table. The analysis of derivatives in HAM-ANS is to a large extent based on work done for different purposes in the area of computational linguistics (HOEPPNER 1980), major deviations being the extensive use of a lexicon and a smaller selection of productive suffixes. Apart from the literal form of the suffixes the entries of the table contain information about gender (for nominal suffixes), part of speech of the derivative and the basic form being derived and expressions of the system's semantic representation language SURF, which later on is integrated into the semantic representation of the whole word. The lexicon serves as a device for ascertaining that the remaining part is a lexical unit known or accessible to the system.</Paragraph>
    <Paragraph position="4"> Having identified a derivative suffix and thus determined the word to be a derivative, the remaining part, however, can recursively turn out to be an analyzable formation, say a compound. So a second step (in the processing of a nonderived word the first step) is the attempt to split the word into two components both of which have to be ultimately transformable into canonical forms, for example by removing vowel mutation or analyzing a derivated part in the way described above.</Paragraph>
    <Paragraph position="5"> Search in the lexicon is performed by constructing a hypothetical first constituent and looking for the most similar lexicon entry. This yields the second constituent as the remainder'which by consulting the lexicon leads to a revision of the initial hypothetical assumption or confirms it.</Paragraph>
    <Paragraph position="6"> In principle these two steps in identifying the structure of compounds and derivatives should interact recursively to allow for the handling of multiple compounding and derivation (for restrictions on multiple derivation see HOEPPNER 1980). In HAM-ANS the analytical capacity at the moment is restricted to compounds with two parts and to singular derivation. This limitation is not so much determined by the identification process but rather by the state of elaboration of those processes which relate and integrate the semantic interpretation of a word formation into the knowledge already available to the system.</Paragraph>
    <Paragraph position="7"> After the system has successfully segmented an initially unknown word, the result of the identification is a structure containing the identified parts together with those grammatical features which in the course of further processing will guide the construction of a semantic interpretation and which provide grammatical information for the whole word. To illustrate this resulting \]exical structure, an example for the word 'STRASSENFEGER' (street cleaner) is given in Fig. I, indicating also the origin of the associated grammatical features (the features and their values are given here in English).</Paragraph>
  </Section>
  <Section position="7" start_page="133" end_page="133" type="metho">
    <SectionTitle>
SEMANTIC INTERPRETATION OF DERIVATIVES AND COMPOUNDS
</SectionTitle>
    <Paragraph position="0"> So that the system needn't analyze an unknown word each ti;,,e it occurs in an utterance, the information gathered so far could be stored in lexical memory, as is done with explicit information given by the user about unanalyzable words. The goal of word-formation analysis, however, is not completed with the segmentation of words and the assignment of features to their parts. A more important step is to relate structural knowledge about derivatives and compounds to conceptual knowledge and to transform lexical structures into semantic structures. The logic-oriented representation language SURF (see JAMESON et al. 1980) is the device in HAM-ANS which expresses semantic relations between parts of utterances and likewise between lexically analyzed words. An interpretation process has accordingly been implemented which maps lexical representations of analyzed words onto expressions of SURF having the same type as that constructed by the parser for simple words of the same class. For example, a compound noun is represented by a 'description' in the same way as a simple noun in a noun phrase would leave the parser. The only difference is that the representation of a compound contains explicit relations between its constituents. An example interpretation of the German compound STUHLBEIN (chair leg) is given in fig. 2, the letter T in the last line standing for the whole-part relation in the system's conceptual semantic network.</Paragraph>
    <Paragraph position="1">  The representation of the simple noun BEIN (leg) would correspond to the first argument of the outermost conjunction, which is likewise a 'description'.</Paragraph>
    <Paragraph position="2"> Let's now take a closer look ~t the way the transformation of a lexical representation into a SURF representation is achieved. As mentioned above the table of derivative suffixes includes one or more SURF~expressions for each suffix. The expression provided for the suffix -ER, in STRASSENFEGER, together with a verb stem leads to a case-frame instantiation with the agent being a male person and an objective case to be filled either by a genitive attribute or a compound constituent as in this example.</Paragraph>
    <Paragraph position="3"> Compounds require a more interesting transformation process to discover relations between their parts. Analyzing the lexical representation, different inference strategies are selected depending on the parts of speech of the constituents. For instance, a compound consisting of two adjectives activates processes trying to establish a coordination of the two concepts (e.g. DUNKELBRAUN (dark brown)). The transformation of nominal compounds applies the system's inferential capacity to detect possible links between the two concepts in the conceptual semantic network.</Paragraph>
    <Paragraph position="4"> In addition to the part-of relation the following relatlons are inspected and used</Paragraph>
  </Section>
  <Section position="8" start_page="133" end_page="133" type="metho">
    <SectionTitle>
MULTILAYERED APPROACH TO WORD FORMATION 137
</SectionTitle>
    <Paragraph position="0"> for the semantic representation in SURF:  - physical object and its material, e.g. HOLZTISCH (wooden table) - property of an object, e.g. HAARFARBE (colour of hair) - physical object in its preferred location, e.g. COUCHTISCH (couch table) - combination of physical objects, e.g. RADIOWECKER (clock radio).  Finally, compounds with a verbal element are transformed by trying to fit the remaining constituents into the slots of the verb's case frame. The example STRASSENFEGER is represented as the instantiated case frame of FEGEN (to sweep) with the noun STRASSE (street) filling the objective slot.</Paragraph>
    <Paragraph position="1"> At this stage the lexical representation and the semantic interpretation of a compound or a derivative are stored in the system's \]exica\] memory for several reasons:  - to eliminate the need for repetition of the whole analysis each time the word occurs, - to form the basis for analogy-driven reso}ution of word formations, - to enable the system to use understandable words while generating utter- null ances from semantic representations (see below).</Paragraph>
    <Paragraph position="2"> An example of a semantic interpretation which is still ambiguous at this processing stage is the one for BILDERRAHMEN (picture frame), which besides a semantic representation expressing a whole-part relation would, by reference to the case frame of the German verb RAHMEN (to frame), be interpreted as an object-verb neminalization (the framing of pictures). Once the appropriate semantic interoretations have been inferred, processing continues with the parsing of the entire input utterance. The ATN grammar of HAM-ANS treats compounds and derivatives in the same way as other words of the same class except that they are more frequently ambiguous, so that the parser more often has to use knowledge of case frame restrictions or attribution congruency to select an appropriate readin 9.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML