File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1091_metho.xml
Size: 11,286 bytes
Last Modified: 2025-10-06 14:11:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1091"> <Title>Towards the Automatic Acquisition of Lexical Data</Title> <Section position="4" start_page="387" end_page="387" type="metho"> <SectionTitle> 3. Knowledge Base </SectionTitle> <Paragraph position="0"> The acquisition system is rule based. Its knowledge base comprises three types of rules: - Rules representing inflectional paradigms. These rules describe the basic types of conjugation and declination in German.</Paragraph> <Paragraph position="1"> - Morphonological rules. The basic inflectional endings are split up into a much larger set by various morphonological rules which alter the endings and stems to make pronunciation easier.</Paragraph> <Paragraph position="2"> - Heuristic rules. While the former two rule types are derived from the German grammar proper, these rules are like plausible guesses. They guide the system to make choices like which category a word belongs to according to knowledge about forms (i.e. all verbs end with -en), actual frequency of classes, etc.</Paragraph> <Paragraph position="3"> These rules are organized in distinct packages. Only rules in active packages are considered. Rules may activate and deactivate rule packages.</Paragraph> </Section> <Section position="5" start_page="387" end_page="388" type="metho"> <SectionTitle> 4. Overall Architecture </SectionTitle> <Paragraph position="0"> According to their different nature, the three mentioned types of rules are processed differently.</Paragraph> <Paragraph position="1"> Knowledge about inflectional types serves to partition the words into disjunct classes. Once the inflectional type has been determined, there are relatively clear guidelines as to the inflection of the word. The inflectional type actually is a subclassification of the word type.</Paragraph> <Paragraph position="2"> One of the crucial points is determining the word type. The system first tries to make use of its basic vocabulary. It checks whether a new word is composed of words already in the lexicon or of an existing word stem together with a derivational ending. There is a rule in German morphology stating that in compound words the morphological class is determined by the last word. On a similar line reasoning about derivational endings is performed, as those may determine word type as well as inflection. As a next heuristic morphological clues are taken into consideration.</Paragraph> <Paragraph position="3"> There exist a number of them, but ambiguities may arise. If this is the case, a third strategy is applied: the system asks the user to type in a short utterance containing the new word. The utterance is analysed by the parser of VIE-LANG rendering information about the word type by means of the phrase type it appears in. In applying this method, the system relies on a simple but important presupposition: the user usually enters an utterance containing the word in a proper linguistic context facilitating determination of its type. We do not argue that the user will always utter the minimal projection, but that he will not violate phrase borders with his utterance.</Paragraph> <Paragraph position="4"> The knowledge about phrase types as well as the basic vocabulary permits unambiguous determination of the word type in most cases, especially as the most irregular forms that are very limited in number (words of the closed word classes: pronouns, articles, auxiliary and modal verbs, etc.) have already been included in the basic lexicon.</Paragraph> <Paragraph position="5"> Once the word type has been determined, the rule package associated with it is activated. Let's suppose the new word is a verb. Then, the verb-package is triggered. Here in turn we find packages for strong and weak inflection. The large number of subclasses is implied by morphonological reasons, whereby the small number of general paradigms is multiplied.</Paragraph> <Paragraph position="6"> Morphonologic rules have exact matching conditions, therefore classification in this part is automated to a large extent. The on\].y problem is deciding for weak or strong inflection first. As exact rules do not exist, heuristics are applied which are based mainly on word frequency.</Paragraph> <Paragraph position="7"> An important feature is the dynamic interaction register: the hypotheses evoked by the heuristic rules require to be confirmed by the user. The system knows which word forms will form sufficient evidence for a certain hypothesis. It will generate these forms and ask the user for confirmation. The forms however depend on the hypotheses. Thus, the user is only asked a minimum of questions. The forms to be asked for are kept in a dynamic interaction register which is updated with every hypothesis and every answer from the user.</Paragraph> </Section> <Section position="6" start_page="388" end_page="388" type="metho"> <SectionTitle> 5. An Example Session </SectionTitle> <Paragraph position="0"> In this chapter we show how a new entry is actually created. The user starts the interaction by entering a new word, e.g. 'abgeben' (to leave). The first thing the system has to do is to decide about the word category. To find out if it is a compound word it will try to split off words first from the beginning then from the end.</Paragraph> <Paragraph position="1"> This will result in recognizing 'ab' as a separable verbadjunct. Of course the 'ab' could be part of a totally different stem like 'Abend' (evening) or 'abet' (but). So the system looks for facts supporting the verb hypothesis. Verbs are usually typed in in infinitive form and this implies the ending '-en' (in a few cases also '-n'). Of course this '-en' could also be part of a stem like 'Magen' (stomach) or 'wegen' (because), but the combination of both verb adjunct 'ab' and ending '-en' on a word belonging to a different category is highly unp\]ausible. So 'abgeben' is split into ab/geb/en.</Paragraph> <Paragraph position="2"> As a next step the lexicon is looked up for 'geb'. If it is found the rest is easy. All the information from 'geb' is simply duplicated; the only additional information to be stored is about the separable 'ab'.</Paragraph> <Paragraph position="3"> This way the new entry may be created without any other help by the user.</Paragraph> <Paragraph position="4"> To continue with our example we will assume that 'geb' is not already contained in the lexicon. That means the system has to figure out a hypothesis concerning the conjugation type of 'abgeben' (either weak or strong). Since weak verbs make up the vast majority of German verbs, this hypothesis is tried first.</Paragraph> </Section> <Section position="7" start_page="388" end_page="388" type="metho"> <SectionTitle> FORM CLASS FM UM PF SY </SectionTitle> <Paragraph position="0"> present tense abgeb 44 0 0 I 502 Weak conjugation is regular, all forms are built from one stem. To confirm weak conjugation it suffices to show the user the Ist person sg past tense. Before doing so all morphonological rules connected to weak conjugation are tried. None applies, so user interaction can start. Ist person sg of past tense in the weak paradigm is 'gebte ab'. To make sure the user knows which form is intended, some context has to be provided. This leads to the phrase 'gestern gebte ich ab' (I leaved yesterday) specifying tense and person.</Paragraph> <Paragraph position="1"> The user recognizes 'gebte' as incorrect and rejects that phrase. This makes the system discard the hypothesis weak and try strong instead.</Paragraph> <Paragraph position="2"> Strong conjugation is more complicated than weak.</Paragraph> <Paragraph position="3"> There may be a maximum of four different stems for present tense, present tense 2nd and 3rd person sg, past tense and PPP. All these possibilities have either to be resolved automatically or asked explicitly from the user. First the system continues to determine the past tense forms. There are three different types of vowel changes in the case of 'e'-stems (e-a-e, e-o-o, e-a-o). They are sorted by frequency, because no other criterion is available. Again all morphonological rules applicab\]e to strong verbs are tried. In our case none applies, so the user is asked again for verification with 'gestern gab ich ab' (I left yesterday).</Paragraph> </Section> <Section position="8" start_page="388" end_page="388" type="metho"> <SectionTitle> FORM CLASS </SectionTitle> <Paragraph position="0"> present tense abgeb 30 pres.t.2nd p.sg past tense abgab 23 past participle</Paragraph> </Section> <Section position="9" start_page="388" end_page="388" type="metho"> <SectionTitle> FM UM PF SY </SectionTitle> <Paragraph position="0"> This time the user confirms, so the system can go on.</Paragraph> <Paragraph position="1"> There are two possibilities for the PPP, and again the more frequent one is tried, and accepted by the user.</Paragraph> <Paragraph position="2"> There is still another irregularity concerning 2nd and 3rd person sg present tense, in most of the cases the stem vowel 'e' becomes 'i'. After verification of this fact the morphological class is finally determined.</Paragraph> <Paragraph position="3"> The system creates three lexical entries 'abgeb', 'abgib' and 'abgab' for present and PPP, 2nd and 3rd person sg present tense and past tense respectively.</Paragraph> <Paragraph position="4"> Now all of the features have to be filled in. PF of 'abgeb' is set to I, since the verbadjunct 'ab' implies the use of the prefix 'ge-' for the PPP. UM is set to 8 for 'abgab', indicating 'umlautung' for the subjunctive mode in the Dast tense. FM of th~ primary entry 'abgeb' is set to 8 as a resu\]t of the combination of classes. Then SY is set to 502 (5 = verb, 0 = present perfect with 'haben', 2 : separable verbadjunct of length 2).</Paragraph> </Section> <Section position="10" start_page="388" end_page="388" type="metho"> <SectionTitle> FORM CLASS FM UM PF SY </SectionTitle> <Paragraph position="0"> present tense abgeb 22 8 0 I 502 pres.t.2nd p.sg abgib 26 0 0 past tense abgab 23 8 0 Next all indicative forms of present and past tense and the PPP are printed and the user is asked for confirmation. This step could actually be skipped but it is another safety measure against faulty entries. In our specific example there is a final step to be done: Since 'geb' was not found in the lexicon, it has to be included, too, for two reasons. First the analysis algorithm otherwise could not handle all those cases where the particle is actually split off in the text, second there may be more compound verbs with 'geb', and their incorporation into the lexicon can then be handled fully automatic. Since the verb stem of a compound verb with separable verbadjunct can always appear as a verb in its own right, this poses no problem. The situation is slightly more difficult with other particles where this is not granted. In those cases the new entry must be marked as internal, so that it does not affect analysis or synthesis.</Paragraph> <Paragraph position="1"> Creation of the new entries is simple anyway. All forms are duplicated, 'abgeb', 'abgib' and 'abgab' are changed to 'geb', 'gib', 'gab' respectively and SY is set to 500 instead of 502.</Paragraph> </Section> class="xml-element"></Paper>