File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0802_metho.xml
Size: 26,754 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0802"> <Title>A Modern Computational Linguistics Course Using Dutch</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Finite State Methods </SectionTitle> <Paragraph position="0"> A typical course in computational linguistics starts with finite state methods. Finite state techniques can provide computationally efficient solutions to a wide range of tasks in natural language processing. Therefore, students should be familiar with the basic concepts of automata (states and transitions, recognizers and transducers, properties of automata) and should know how to solve t See www. let. rug. nl/~gosse/tt for a preliminary version of the text and links to the exercises described here.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> File Settings Operations Produce Hs!p </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Edge Angle: \[6\]i5 : X-distance: 1120' &quot;\[' DisPiay Sigma I Display Fa ~ount F~ Figure h FSA. The regular expression and transducer are an approximation of the rule for realizing a final -v in abstract stems as -f if followed by the suffix -t (i.e. leev+t ~ leeft). \[A,B\] denotes t tbllowed by B, {A,B} denotes t or B, '?' denotes any single character, and t - B denotes the string defined by t minus those defined by B. A.'B is the transduction of t into B. '+' is a morpheme boundary, and the hash-sign is the end of word symbol.</Paragraph> <Paragraph position="3"> toy natural language processing problems using automata.</Paragraph> <Paragraph position="4"> However, when solving 'real' problems most researchers use software supporting high-level descriptions of automata, automatic compilation and optimisation, and debugging facilities, packages for two-level morphology, such as PC-KIMMO (Antworth, 1990), are well-known examples. As demonstrated in Karttunen etal. (1997), an even more flexible use of finite state technology can be obtained by using a calculus of regular expressions. A high-level description language suited for language engineering purposes can be obtained by providing, next to the standard regular expression operators, a range of operators intended to facilitate the translation of linguistic analyses into regular expressions. Complex problems can be solved by composing automata defined by simple regular expressions.</Paragraph> <Paragraph position="5"> We have developed a number of exercises in which regular expression calculus is used to solve more or less 'realistic' problems in language technology. Students use the FSA-utilities package 2 (van Noord, 1997), which provides a powerful language for regular expressions and possibilities for adding user-defined operators and macros, compilation into (optimised) automata, and a graphical user-interface. Automata can be displayed graphically, which makes it easy to learn the meaning of various regular expression operators (see figure 1).</Paragraph> <Paragraph position="6"> Exercise I: Dutch Syllable Structure Hyphenation for Dutch (Vosse, 1994) requires that complex words are split into morphemes, and mor- null ! ........ * .....</Paragraph> <Paragraph position="7"> macro(syll, \[ onset-, nucleus, coda ^ \] ).</Paragraph> <Paragraph position="8"> macro(onset, { \[b, {i ,r} ^\] , \[c ,h- ,{l,r}-\] }) .</Paragraph> <Paragraph position="9"> macro(nucleus, { \[a,{\[a,{i,u}^\],u}^\], \[e,{\[e,u ^\] ,i,o,u}-\] }).</Paragraph> <Paragraph position="10"> macro(coda, {\[b, {s,t}^\], \[d,s^,t-\]}).</Paragraph> <Paragraph position="11"> phemes are split into syllables. Each morpheme or syllable boundary is a potential insertion spot for a hyphen. Whereas one would normally define the notion &quot;syllable' in terms of phonemes, it should be defined in terms of character strings for this particular iask. The syllable can easily be defined in terms of a regular expression. For instance, using the regular expression syntax of FSA, a first approximation is given in figure 2. The definition in 2 allows such syllables as \[b, a,d\], \[b,1 ,a,d\] , \[b~r,e,e ,d,s,t\], etc.</Paragraph> <Paragraph position="12"> Students can provide a definition of the Dutch syllable covering all perceived cases in about twenty lines of code. The quality of the solutions could be tested in two ways. First, students could test which words of a list of over 5000 words of ~he form \[C*,V+,C*\] (where C and V are macros for consonants and vowels, respectively) are accepted and rejected by the syllable regular expression. A restrictive definition will reject words which are bisyllabic (geijkt) and foreign words such as crash, sfinx, and jazz. Second, students could test how accurate the definition is in predicting possible hyphenation positions in a list of morphophonemic words. To this end, a list of 12000 morphophonemic words and their hyphenation properties was extracted from the CELEX lexical database (Baayen et al., 1993). 3 Tile best solutions for this task resulted in a 5% error rate (i.e. percentage of words in which a wrongly placed hyphenation point occurs).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Exercise Ih Verbal Inflection </SectionTitle> <Paragraph position="0"> A second exercise concentrated on finite state transducers. Regular expressions had to be conaThe hyphenation task itself was defined as a finite state transducer: macro(hyph, replace(\[\] : - , syll, syll)) The operator replace (Target, LeftContext, RightContext) implements 'leftmost' ( and 'longest match') replacement (Karttunen, 1995). This ensures that in the cases where a consonant could be either final in a coda or initial in tile next onset, it is in fact added to the onset.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Underlying Surface Gloss </SectionTitle> <Paragraph position="0"> a. werk+en werken work\[mF\] b. bak+en bakken bak@NF\] c. raakSen raken hit\[INF\] d. verwen+en verwennenpamper\[INF\] e. teken+en tekenen draw\[lNF\] f. aanpik+en aanpikken catch up\[INF\] g. zanik+en zaniken wine\[INF\] h. leev+en leven liV@NF\] i. leev leef live\[paEs, 1, sa.\] j. leev+t leeft live(s)\[paEs, 2/3, SG.\] k. doe+en doen do\[INe\] h ga+t gaat go(es)\[PRES, 2/3, SO.\] m. zit+t zit sit(s)\[PRES, 2/3, S(~.\] n. werk+Te werkte worked\[PAST, sa\] o. hoor+Te hoorde heard\[PAST, SG\] p. blaf+Te blafte barked\[pAsT, SG\] q. leev+Te leefde lived\[PAST, SG\] structed for computing the surface form of abstract verbal stem forms and combinations of a stem and a verbal inflection suffix (see figure 3). Several spelling rules need to be captured. Examples (b) and (c) show that single consonants following a short vowel are doubled when followed by the '+en' suffix, while long vowels (normally represented by two identical characters) are written as a single character when followed by a single consonant and ' +en' Examples (d-g) illustrate that the rule which requires doubling of a consonant after a short vowel is not applied if the preceding vowel is a schwa. Note that a single 'e' (sometimes ' i') can be either a stressed vowel or a schwa. This makes the rule hard to apply on the basis of the written form of a word. Examples (hj) illustrate the effect of devoicing on spelling. Examples (i-l) illustrate several other irregularities in present tense and infinitive forms that need to be captured. Examples (n-q), finally, illustrate past tense formation of weak verbal stems. Past tenses are formed with either a '+te' or '+de' ending (' +ten'/' +den' for plural past tenses). The form of the past tense is predictable on the basis of the preceding stem, and this a single underlying suffix '+Te' is used. Stems ending with one of the consonants 'c,f,h,k,p,s,t' and 'x' form a past tense with '+te', while all other stems receive a '+de' ending. Note that the spelling rule for devoicing applies to past tenses as well (p-q). In the exercise, only past tenses of weak stems were considered.</Paragraph> <Paragraph position="1"> The implementation of spelling rules as transducers is based on the replace-operator (Kartmacro(verbal_inflection, null shorten o double o past_tense).</Paragraph> <Paragraph position="2"> macro (shorten, replace(\[a,a\]:a ,\[\],\[cons,+,e,n\])).</Paragraph> <Paragraph position="4"/> <Paragraph position="6"> can be implemented in FSA as: replace(Underlying:Surface, Left, Right) An example illustrating the rule format for transducers is given in figure 4. Most solutions to the exercise consisted of a collection of approximately 30 replace-rules which were composed to form a single finite state transducer. The size of this transducer varied between 4.000 and over 16.000 states, indicating that the complexity of the task is well beyond reach of text-book approaches.</Paragraph> <Paragraph position="7"> For testing and evaluation, a list of almost 50.000 pairs of underlying and surface forms was extracted from Celex. 4 i0 % of the data was given to the students as training material. Almost all solutions achieved a high level of accuracy, even for the 'verwennen/tekenen' cases, which can only be dealt with using heuristics. The best solutions }lad less than 0,5% error-rate when tested on the unseen data.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Reliable extraction of this information from Celex </SectionTitle> <Paragraph position="0"> turned out to be non-trivial. Inflected forms are given in the database, and linked to their (abstract) stem by means of an index. However, the distinction between weak and strong past tenses is not marked explicitly in the database and thus we had to use the heuristic that weak past tense singular forms always end in 'te' or 'de', while strong past tense forms do not. Another problem is the fact that different spellings of a word are linked to the same index. Thus, 'scalperen' (to scalp) is linked to the stem 'skalpeer'. For the purposes of this exercise, such variation was largely eliminated by several ad-hoe filters.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Grammar Development </SectionTitle> <Paragraph position="0"> Natural language applications which perform syntactic analysis can be based on crude methods, such as key-word spotting and pattern matching, more advanced but computationally efficient methods, such as finite-state syntactic analysis, or linguistically motivated methods, such as unification-based grammars. At the low-end of the scale are systems which perform partial syntactic analysis of unrestricted text (chunking), for instance for recognition of names or temporal expressions, or NP-constituents in general. At the high-end of the scale are wide-coverage (unification-based) grammars which perform full syntactic analysis, sometimes for unrestricted text. In the exercises below, students develop a simple grammar on the basis of real data and students learn to work with tools for developing sophisticated, linguistically motivated, grammars. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Exercise III: Recognizing temporal expressions </SectionTitle> <Paragraph position="0"> A relatively straightforward exercise in grammar development is to encode the grammar of Dutch temporal expressions in the form of a context-free grammar.</Paragraph> <Paragraph position="1"> In this particular case, the grammar is actually implemented as a Prolog definite clause grammar. While the top-down, backtracking, search strategy of Prolog has certain drawbacks (most notably the fact that it will fail to terminate on left-recursive rules), using DCG has the advantage that its relationship to context-free grammar is relatively transparent, it is easy to use, and it provides some of the concepts also used in more advanced unification-based frameworks. The fact that the non-terminal symbols of the grammar are Prolog terms also provides a natural means for adding annotation in the form of parse-trees or semantics.</Paragraph> <Paragraph position="2"> The task of this exercise was to develop a grammar for Dutch temporal expressions which covers all instances of such expressions found in spoken language. The more trivial part of the lexicon was given and a precise format was defined for semantics. The format of the grammar to be developed is illustrated in figure 5. The top rule rewrites a temp_expr as a weekday, followed by a date, followed by an hour. An hour rewrites as the ad-hoc category approximately (containing several words which are not crucial for semantics but which frequently occur in spontaneous utterances), and an hourl category, which in turn can rewrite as a category hour_lex followed by the word uur, followed</Paragraph> <Paragraph position="4"> by a min_lex. Assuming suitable definitions for the lexical (we-terminal) categories, this will generate such strings as zondag vijf januari omstreeks tien uur vijftien (Sunday, January the fifth, at ten fifteen). A more or less complete grammar of temporal expressions of this sort typically contains between 20 and 40 rules.</Paragraph> <Paragraph position="5"> A test-corpus was constructed by extracting 2.500 utterances containing at least one lexical item signalling a temporal expression (such as a weekday, a month, or words such as uur, minuut, week, morgen, kwart, omstreeks, etc.) from a corpus of dialogues collected from a railway timetable information service. A subset of 200 utterances was annotated.</Paragraph> <Paragraph position="6"> The annotation indicates which part of the utterance is the temporal expression, and its semantics. An example is given below.</Paragraph> <Paragraph position="7"> sentence (42, \[j a, ik,wil ,reizen, op, zesent wint ig, j anuari, s_morgens, om, tien,uur,vertrekken\], \[op, zesentwintig, j anuari, s_morgens, om, tien,uur\], temp(date(_,l,26), day( .... 2) ,hour (I0,_))) .</Paragraph> <Paragraph position="8"> The raw utterances and 100 annotated utterances were made available to students. A grammar can now be tested by evaluating how well it manages to spot temporal phrases within an utterance and assign the correct semantics to it. To this end, a parsing scheme was used which returned the (left-</Paragraph> <Paragraph position="10"> could be parsed as a temporal expression. This result was compared with the annotation, thus providing a measure for 'word accuracy' and 'semantic accuracy' of the grammar. The best solutions achieved over 95 70 word and semantic accuracy.</Paragraph> <Paragraph position="11"> Exercise IV: Unification grammar Linguistically motivated grammars are almost without exception based on some variant of unification grammar (Shieber, 1986). Head-driven phrase structure grammar (HPSG) (Pollard and Sag, 1994) is often taken as the theoretical basis for such grammars. Although a complete introduction into the linguistic reasoning underlying such a framework is beyond the scope of this course, as part of a computational linguistics class students should at least gain familiarity with the core concepts of unification grammar and some of the techniques frequently used to implement specific linguistic analyses (underspecification, inheritance, gap-threading, unary-rules, empty elements, etc.).</Paragraph> <Paragraph position="12"> To this end, we developed a core grammar of Dutch, demonstrating how subject-verb agreement, number and gender agreement within NP's, and subcategorization can be accounted for. Furthermore, it illustrates how a simplified form of gap-threading can be used to deal with unbounded dependencies, how the movement account for the position of the finite verb in main and subordinate clauses can be mimicked using an 'empty verb' and some feature passing, and how auxiliaryparticiple combinations can be described using a 'verbal complex'. The design of the grammar is similar to the ovIs-grammar (van Noord et al., 1999), in that it uses rules with a relatively specific context-free backbone. Inheritance of rules from more general 'schemata' and 'principles' is used to add feature constraints to these rules without redundancy. The schemata and principles, as well as many details of the analysis, are based on HPSG.</Paragraph> <Paragraph position="13"> Figure 6 illustrates the general format of phrase structure schemata and feature constraints.</Paragraph> <Paragraph position="14"> The grammar fragment is implemented using the HDRUG development system 5 (van Noord and Bouma, 1997). HDRUG provides a description language for feature constraints, allows rules, lexical entries, and 'schemata' or 'principles' to be visualised in the form of feature matrices, and provides an environment for processing example sentences which supports the display of derivation trees and partial parse results (chart items). A screen-shot of HDRUG is given in figure 7.</Paragraph> <Paragraph position="15"> As an exercise, students had to extend the core fragment with rules and lexical entries for additional phrasal categories (PP'S), verbal sub-categorization types (verbs selecting for a PPcomplement), NP constructions (determiner-less NP's), verb-clusters (modal+infinitive combinations), and WH-words (wie, wat, welke, wiens, hoeveel, ... (who, what, which, whose, how many, * ..). To test the resulting fragment, students were also given a suite of example sentences which had to be accepted, as well as a suite of ungrammatical sentences. Both test suites were small (consisting of less than 20 sentences each) and constructed by hand. This reflects the fact that this exercise is primarily concerned with the implementation of a sophisticated linguistic analysis*</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Natural Language Interfaces </SectionTitle> <Paragraph position="0"> Practical courses in natural language interfaces or computational semantics (Pereira and Shieber, 1987; Blackburn and Bos, 1998) have used a toy database, such as geographical database or an excerpt of a movie script, as application domain. The growing amount of information available on the internet provides opportunities for accessing much larger databases (such as public transport time-tables or library catalogues), and therefore, for developing more realistic applications. In addition, many web-sites provide information which is essentially dynamic (weather forecasts, stock-market information, etc.), which means that applications can be developed which go beyond querying or summarising pre-defined sets of data. In this section, we describe two exercises in which a natural language interface for web-accessible information is developed. In both cases we used the PILLOW package 6 (Cabeza et al., 1996) to access data on the web and tfhfislate the &quot; resulting HTML-code into Prolog facts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Exercise V: Natural Language Generation </SectionTitle> <Paragraph position="0"> Reiter and Dale (1997) argue that the generation of natural language reports from a database with numerical data can often be based on low-tech processing language engineering techniques such as pattern matching and template filling. Sites which provide access to numerical data which is subject to change over time, such as weather forecasts or stock quotes, provide an excellent application domain for a simple exercise in language generation.</Paragraph> <Paragraph position="1"> For instanc% in one exercise, students were asked to develop a weather forecast generator, which takes the long-term (5 day) forecast of the Dutch meteorological institute, KNMI, and produces a short text describing the weather of the coming days. Students were given a set of pre-collected numerical data as well as the text of the corresponding weather forecasts as produced by the KNMI. These texts served as a 'target corpus', i.e. as an informal definition of what the automatic generation component should be able to produce.</Paragraph> <Paragraph position="2"> To produce a report generator involved the implementation of 'domain knowledge' (a 70% chance of rain means that it is 'rainy', if maximum and minimum temperatures do not vary more than 2 degrees, the temperature remains the same, else there is a change in temperature that needs to be reported, etc.) and rules which apply the domain knowledge to produce a coherent report. The latter rules could be any combination of' format or write instructions and more advanced techniques based on, say, definite clause grammar. The completed system can not only be tested on pre-collected material, but also on the information taken from the current KNMI web-page by using the Prolog-HTTP interface.</Paragraph> <Paragraph position="3"> A similar exercise was developed for the AEX (stock market) application described below. In this case, students we asked to write a report generator which reports the current state of affairs at tile Dutch stock market AEX, using numerical data provided by the web-interface to the Dutch news service 'NOS teietext' and using similar reports on teletext itself as 'target-corpus'.</Paragraph> <Paragraph position="4"> Ohttp://www,clip.dia.fi.upm.es/miscdocs/ pillow/pillow.html</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Exercise VI: Question answering </SectionTitle> <Paragraph position="0"> Most natural language dialogue systems are interfaces to a database. In such situations, the main task of the dialogue system is to answer questions formulated by the user.</Paragraph> <Paragraph position="1"> The construction of a question-answering system using linguistically-motivated techniques, requires (minimally) a domain-specific grammar which performs semantic analysis and a component which evaluates the semantic representations output by the grammar with respect to the database. Once these basic components are working, one can try to extend and refine the system by adding (domain-specific or general) disambiguation, contextual-interpretation (of pronouns, elliptic expressions, etc), linguistically-motivated methods for formulating answers in natural language, and scripts for longer dialogues.</Paragraph> <Paragraph position="2"> In the past, we have used information about railway time-tables as application domain. Recently, a rich application domain was created by constructing a stock-market game, in which participants (the students taking the class and some others) were given an initial sum of money, which could be invested in shares. Participants could buy and sell shares at wish. Stock quotes were obtained on-line from the news service 'NOS teletext'. Stock-quotes and transactions were collected in a database, which, after a few weeks, contained over 3000 facts.</Paragraph> <Paragraph position="3"> The unification-based grammar introduced previously (in exercise IV) was adapted for the current domain. This involved adding semantics and adding appropriate lexical entries. Furthermore, a simple question-answering module was provided, which takes the semantic representation for a question assigned by the grammar (a formula in predicate-logic), transforms this into a clause which can be evaluated as a Prolog-query, calls this query, and returns the answer.</Paragraph> <Paragraph position="4"> The exercise for the students was to extend the grammar with rules (syntax and semantics) to deal with adjectives, with measure phrases (vijf euro/procent (five euro/percent), with date expressions (op vijf januari (on January, 5)), and constructions such as aandelen Philips (Philips shares), and koers van VNU (price of VNU) which were assigned a non-standard semantics Next, the question system had to be extended so as to handle a wider range of questions. This involved mainly the addition of domain-specific translation rules. Upon completion of the exercise, question-answer pairs of the sort illustrated in 8 were possible. null Q: wat is de koers van ABN AMR0 what is the price of ABN AMR0 A: 17,75 Q: is het aandeel KPN gisteren gestegen have the KPN shares gone up yesterday A: ja yes Q: heeft Rob enige aandelen Baan verkocht has Rob sold some Baan shares A: nee no Q: welke spelers bezitten aandelen Baan Which players possess Baan shares A: gb, woutr, pieter, smb Q: hoeveel procent zijn de aandelen kpn How many percent have the KPN shares</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Concluding remarks </SectionTitle> <Paragraph position="0"> Developing realistic and challenging exercises in computational linguistics requires support in the ibrm of development tools and resources. Powerful tools are available for experimenting with finite state technology and unification-based grammars, resources can be made available easily using internet, and current hardware allows students to work comibrtably using these tools and resources.</Paragraph> <Paragraph position="1"> The introduction of such tools in introductory courses has the advantage that it provides a realistic overview of language technology research and development. Interesting application area's for natural language dialogue systems can be obtained by exploiting the fact that the internet provides access to many on-line databases. The resulting applications give access to large amounts of actual and dynamic information. For educational purposes, this has the advantage that it gives a feel for the complexity and amount of work required to develop 'real' applications.</Paragraph> <Paragraph position="2"> The most important problem encountered in developing the course is the relative lack of suitable electronic resources. For Dutch, the CELEX database provides a rich source of lexical information, which can be used to develop interesting exercises in computational morphology. Development of similar, data-oriented, exercises in the area of computational syntax and semantics is hindered, however, by the fact that resources, such as electronic dictionaries proving valence and concept information, and corpora annotated with part of speech, syntactic structure, and semantic information, are missing to a large extent. The development of such resources would be most welcome, not only for the development of language technology for Dutch, but also for educational purposes.</Paragraph> </Section> class="xml-element"></Paper>