File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1028_metho.xml
Size: 24,868 bytes
Last Modified: 2025-10-06 14:12:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-1028"> <Title>TRANSLITERATION MAPPINGS I PHONOLOGICAL RULES at 'Xi:ro'vo, n de 'hsraf,storem I. SENTENCE ACCENT ASSIGNMENT J INTONATION CONTOUR COMPUTATION at 'X i~r a'v&n do 'h~ref,storam</Title> <Section position="4" start_page="133" end_page="133" type="metho"> <SectionTitle> 2. SYLLABIFICATION </SectionTitle> <Paragraph position="0"> Information about the position of syllable boundaries in spelling input strings is needed for several reasons. The most important of these is that most phonological rules in Dutch have the syllable as their domain. E.g. Dutch has a schwainsertion rule inserting a schwa-like sound between a liquid and a nasal or non-coronal consonant if both consonants belong to the same syllable. Compare melk (milk): /mel0k/ to melSken (to milk): /melk0/ ($ indicates a syllable boundary). Without syllable structure this problem can only be resolved in an ad hoc way. Furthermore, stress assignment rules should be described in terms of syllable structure /Berendsen and Don, 1987/.</Paragraph> <Paragraph position="1"> Other rules which are often described as having the morpheme as their domain (such as devoicing of voiced obstruents at morpheme-final position and progressive and regressive assimih~tion), shonld really be described as operating on the syllable level. E.g. hetSze (/hets0/: smear campaign; devoicing of voiced fricative at syllable~final position) and asSbest (/azbest/: asbestos; regressive assimilation). These mono-morphematic words show the effects of the phonological rules at their syllable boundaries. Furthermore, the proper target of these rules is not one phoneme, bnt the complete coda or onset of the syllable, which may consist of more than one phoneme.</Paragraph> <Paragraph position="2"> Although these examples show convincingly that syllable structure is necessary, they do not prove that it is central.</Paragraph> <Paragraph position="3"> However, the following observations seem to suggest the centrality of the syllable in Dutch phonemisation: - The combination of syllable structure and information about word stress seems enough to transform all spelling vowels correctly into phonemes, including Dutch grapheme <e>, which is a traditional stumbling block in Dutch phonemisalion. Usually, many rules or patterns are needed to transcribe this grapheme adequately.</Paragraph> <Paragraph position="4"> -All phonological rules traditionally discussed in the literature in terms of morpheme structure can be defined straight-forwardly in terms of syllable structure without generating errols.</Paragraph> <Paragraph position="5"> These facts led us to incorporate a level of syllable decomposition into the algorithm. This module takes spelling strings as input. Automatic syllabification (or hyphenation) is a notorionsly thorny problem for Dutch language technology.</Paragraph> <Paragraph position="6"> Dutch syllabification is generally guided by a phonological maximal onset principle a principle which states that between two vowels, as many conso~,ants belong to the second syllable as can be pronounced together. This results in syllabifications like groe-nig (greenish), I-na (a name) and bad-stof (terry cloth): However, this principle is sometimes overruled by a morphological principle. Internal word boundaries (to be found after prefixes, between parts of a compound and before some suffixes) always coincide with syllable boundaries. This contradicts the syllable boundary position predicted by the maximal onset principle. E.g. groen-achtig (greenish, groe-nachtig expected), in-enten (inoculate, i-nenten expected) and stads-tuin (city garden, stad-stuin expected). In Dutch (and German and Scandinavian languages), unlike in English and French, compounding happens through concatenation of word forms (e.g. compare Dutch spelfout or German Rechtschreibungsfehler to French faute d'orthographe or English spelling error). Because of this, the default phonological principle fails in many cases (we calculated this number to be on the average 6% of word forms for Dutch).</Paragraph> <Paragraph position="7"> We theretbre need a morphological analysis program to detect internal word boundaries. By incorporating a morphological parser, the syllabification module of GRAFON is able (in principle) to find the correct syllable boundaries in the complete vocabulary of Dutch (i.e. all existing and all possible words). Difficulties remain, however, with foreign words and a pathological class of word forms with more than one possible syllabification, e.g. balletic may be hyphenated ba/let-je (small ballet) and bal-le-tje (small ball). Syllabification in languages with concatenative compounding is discussed in \]_34 more detail in Daelemans (1988, forthcoming).</Paragraph> </Section> <Section position="5" start_page="133" end_page="133" type="metho"> <SectionTitle> 3. LEXICAL DATABASE </SectionTitle> <Paragraph position="0"> We use a word tbrm dictionary instead of a morpheme dictionary. At present, some 10,000 citation forms with their associated inflected forms (computed algoritlmlically) are listed in the lexical database. The entries were collected by the university of Nijmegen from different sonrceso The choice for a word form lexical database was motivated by the following considerations: First, morphological analysis if; reduced to dictionary looknp sometimes combined with compound and affix analysis. Complex word fonaas (i.e. freqneut compounds and word tbrms with affixes) ale stored wifl~ their internal word boundaries. These boundaries can therefore be retrieved instead of computed. Only the structure of complex words not yet listed in the dictionary must be computed.</Paragraph> <Paragraph position="1"> This makes morphological decomposition computatioually less expensive.</Paragraph> <Paragraph position="2"> Second, the number of errors in morphological parsing owing to overacceptance and nonsense analyses is considerably reduced. Traditional erroneous analyses of systems vsing a morpheme-based lexicon like comput+er and under+stand, or for Dutch kwart+el (quainter yard instead of quail) and li+epen (plural past tense of lopen, to run; analysed as 'epics about the Chinese measure li') are avoided this way. Finally, current and forthcoming storage and search technology reduce the overhead involved in using large lexical databases considerably.</Paragraph> <Paragraph position="3"> Notice that the presence of a lexical database suggests a simpler solution to the phonemisation problem: we could sim ply store the transcription with each entl2C/ (This lexicon-based approach is pursued for Dutch by Lammens, 1987). However, we need the algoritlun to compute these transcriptions antomatically, and to compute transcriptions of (new) words not listed in the lexical database. Furthermore, the absence of a detailed rule set makes a lexicon-based approach less attractive from a linguistic point of view. Also, from a tech nological point of view it is a shortcoming that the phonetic detail of the transcription can not be varied for different applications.</Paragraph> <Paragraph position="4"> Our lexical database system can be functionally inter preted as existing of two layers: a static storage level in which word forms are represented as records with fields pointing to other records and fields containing various kinds of information, and a dynamic knowledge level in which word forms are instances of linguistic objects grouped in inheritance hierarchies, and have available to them (through inheritance) various kinds of linguistic knowledge and processes. This way new entries and new information associated with existing entries can be dynamically created, and (after checking by the user) stored in the lexical database.</Paragraph> <Paragraph position="5"> This lexical database architecture is described in more detail in Daelemans (1987a).</Paragraph> </Section> <Section position="6" start_page="133" end_page="133" type="metho"> <SectionTitle> 4. MORPHOLOGICAL ANALYSIS </SectionTitle> <Paragraph position="0"> Morphological analysis consists of two stages: segmentation and parsing. The segmentation routine finds possible ways in which the input string can be partitioned into dictionary entries (working from right to left). In the present application, segmentation stops with the 'longest' solution. Continuing to look for analyses with smaller dictionary entries leads to a considerable loss in processing efficiency and an increased risk at nonsense-analyses. The loss in accuracy is minimal (recall that the internal structure of word forms listed in the lexical database can be retrieved).</Paragraph> <Paragraph position="1"> Some features were incorporated to constrain the number of dictionary lookups necessaly: the most efficient of these are a phonotactic check (strings which do not conform to the morpheme st,'uctnre conditions of Dutch are not lool~ed up), and a speciai memory buffer (snbstrings already looked up are cached with the result of their lookup; during segmentation, the sable substrings arc often looked up more than once).</Paragraph> <Paragraph position="2"> 'l'h(~ par:Sng part of morphological analysis uses a compound grammar and a chat1 parser formalism to accept or reject combinations of dictionary entries. It works from left to right, It also takes into account spelling changes which may occur at the boundary of two pa~s of a compound (these are called linking graphemes, e.g. hemelSblauw; skyblue, eiERdooier; egg-yolk).</Paragraph> <Paragraph position="3"> During dictionaryqookup, word stress is retrieved for the dictionar/ entries (tiffs part of the process could be replaced by additional rides, but as word stress was awfilable in the Icxieal database, we only had to define the rules for stress ass,gument in new compouuds).</Paragraph> </Section> <Section position="7" start_page="133" end_page="133" type="metho"> <SectionTitle> 5. \]?I\]tOl'qOl ,OG~CAL KNOWLEDGE </SectionTitle> <Paragraph position="0"> Knowledge about Dutch phonemes is implemented by nleans of a tT/pe hiel'archy, by inheritance and by associating features to objects, in a standard object-oriented way. Information about a particular phonological object can be available through feature inheritance, by computing a method or by returning the stored value of a feature. However, the exact way informai:iou fl'om the phonological knowledge base is retrieved, is hidden fi'om the user. An independent interface to the knowl,xlgc base is defined consisting of simple LISP-like predicates and (transformation) fhnctions in a uniform format, bJ..g. (obstruent? x), (syllabic? x), (make-voiced x) etc. The arl~:wer call b~3 true, false, a numerical value when a gradation is used a special message (undefined), or in the case of tran:~tbrmation fnnetions, a phoneme or string of phonemes. 'fhese functions and predicates, combined with Boolean OlYerators AND, OR and NOT are used to write the conditions and actions of the phonological rules. The interface allows us to model different theoretical formalisms using rite same knowledge base. E.g. the generative phonology formalism can lye modelled at the level of the interface flmctions. null The morphological aualysis and syllabification stages in the algorithm output a list of syllables in which internal and external word boundaries ,and word stress are marked. Each syllable be.comes an instance of the object type syllable, which has a set of features associated with it: Figure 2 lists these feature~, and their value for one particular syllable.</Paragraph> <Paragraph position="1"> feature values after transcription.</Paragraph> <Paragraph position="2"> The vahye of some of these features is set by means of' information in the input: spelling, closed? (true if tile syllable ends in a consonant), stressed? (1 if the syllable carries primary stress, 2 if it carries secondary stress), previous-syllable and next-syllable (pointers to the neighbouring syllables), external-word-boundary? (true if' an external word boundary follows), internal-word-boundary? (true if an internal word boundary follows). The wdues of' these features are nscd by the transliteration and phonological rules. Of other features, the value must be computed: structure is computed on the basis of the spelling feature. The value of this feature reflects the internal structure of the spelling syllable in terms of' onset, nucleus and coda. Tile features onset, nucleus and coda (this time referring to the phonological syllabl~) are computed by means of the transliteration and phonolopical rules. Their initial values are the spelling, their final vah~es are file transcription. The rules have access to the wdue of these features and may change it. The feature ~r~msc~'iF&~, stands lbr the concatenation of ttle final or intermediate values of onset, mtcleus and coda.</Paragraph> <Paragraph position="3"> Transliteration rules are mappings from elements of syll.</Paragraph> <Paragraph position="4"> able structure to their phonological counterpart. E.g, fi~e sytlable onset <sch> is mapped to /sX/, nucleus <:ie> to /i/, and coda <x> to /ks/. Conditions can be added to make the mapping context-sensitive: onset <c> is mapped to /s/ if a front vowel follows, and to /k/ if a back w)wel folk)ws.</Paragraph> <Paragraph position="5"> There are about forty transliteration mappings.</Paragraph> <Paragraph position="6"> Tbe phonological rnles apply to the output of tbe tra,sliteration mappings (which may be regarded as some kind of twoad transcription). They are sequentially ordered. Each rule is an instance of the object type phonologic~d-ruh'., which has six features: active-p, domain, application, conditi~nJs, actions and examples. A rtde cau tye made active or inactive depending on the value of active-p. If it is true, sendiny an application message to the rule results in checking the co~di.</Paragraph> <Paragraph position="7"> dons on a part of the input string constrained by domai~ (which at present can be syllable, morpheme, word or sen tence). If the conditions return trne, the actions; expression i,; executed. Actions may also involve the triggering of other rules. E.g. shwa-insevtion triggers re-syllabification. Coudi lions and actions are written in. a l~mguage consisting of the phonological functions and predicates mentioned earlier (they access the phonological knowledge base and fi:atures of syllables), Boolean connectors, and simple string-manipulation functions (first, last etc.). After successful application of a rule, the input string to which it was applied is stored in the examples featm'e. This way, interesting data about the opera lion of the rule is available from the rnle itself'. In Figtlrc 3 some examples of rules are shown, l)ill'erent no~ations for this rule are possible, e.g. the similarity betwet~u both rul~-s could be exploited to merge them into one rule.</Paragraph> </Section> <Section position="8" start_page="133" end_page="133" type="metho"> <SectionTitle> 6. RELATED RESEARCH </SectionTitle> <Paragraph position="0"> In the pattern recognition approach advocated by Martin Boot (1984), it is argued that affix-stripping rule~; (without using a dictionary) and a set of context-sensitive pattern matching rules suffice to phonemise spelling input. Bc~ot ,;tates that 'there is no linguistic motivation for a phouenli,;atio, model in which syllabification plays a significant role'. We LET syntax is used for local variable binding, but is not stTietly needed. SYL is bound to the curt'cut syllable.</Paragraph> <Paragraph position="1"> In a rule compiler approach (e.g. Kerkhoff, Wester and Boves, 1984; Berendsen, Langeweg and van Leeuwen, 1986), rules in a particular format (most often generative phonology) are compiled into a program, thereby making a strict distinction between the linguistic and computational parts of the system. None of the existing systems incorporates a full morphological analysis. The importance of morphological boundaries is acknowledged, but actual analysis is restricted to a number of (overgenerating) pattern matching rules. Another serious disadvantage is that the user (the linguist) is restricted in a compiler approach to the particular formalism the compiler knows. I would be impossible, for instance, to incorporate theoretical insights from autosegmental and metrical phonology in a straightforward way into existing prototypes.</Paragraph> <Paragraph position="2"> In GRAFON, on the other hand, the phonological knowledge base can be easily extended with new objects and relations between objects, and even at the level of the fimction and predicate interface, some theoretical modelling can be done.</Paragraph> <Paragraph position="3"> This flexibility is paid, however, by higher demands on the linguist working with the system, as he should be able to write rules in a LISP-like applicative language. However, we hope to have shown fl'om examples of rules in Figure 3 that the complexity is not insurmountable.</Paragraph> </Section> <Section position="9" start_page="133" end_page="136" type="metho"> <SectionTitle> 7. APPLICATIONS </SectionTitle> <Paragraph position="0"> Apart from its evident role as the linguistic part in a texture-speech system, GRAFON has also been used in other applications.</Paragraph> <Section position="1" start_page="133" end_page="136" type="sub_section"> <SectionTitle> 7.1. Linguistic Tool </SectionTitle> <Paragraph position="0"> One advantage of computer models of linguistic phenomena is the framework they present for developing, testing and evaluating linguistic theories. To be used as a linguistic tool, a natural language processing system should at least come up to the following requirements: easy modification of rules should be possible, and traces of rule application should be made visible.</Paragraph> <Paragraph position="1"> In GRAFON, rules can be easily modified both at the macro level (reordering, removing and adding rules) and the micro level (reordering, removing and adding conditions and actions). The scope (domain) of a rule can be varied as well. Possible domains at present are the syllable, the morpheme, the word and the sentence. Furthermore, the application of various rules to an input string is automatically traced and this derivation can be made visible. For each phonologi- null cal rule, GRAFON keeps a list of all input strings to which the rule applies. This is advantageous when complex ~'ale interactions must be studied. Figure 4 shows the user inter face with some output by the program. Apm~t from the changing of rules, the derivation, and the example list for each different rule, the system also offers menu-based facilities for manipulating various parameters used in the hyphenation, parsing and conversion algorithms, and for compiling and showing statistical information on the distribution of allo. phones and diphones in a corpus.</Paragraph> </Section> <Section position="2" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 7.2. Dietionm2C/ Construction </SectionTitle> <Paragraph position="0"> Output of GRAFON was used (after manual checking) by a Dutch lexicographic firm for tile construction of the pronunciation representation of Dutch entries in a Dutch French translation dictionary. The program tarried out to I~ easily adaptable to the requirements by blocking rules which would lead to too much phonetic detail, and by changing the domain of others (e.g. the. scope of assimilation rules was restricted to internal word boundaries). The accuracy of' the program on the 100,000 word corpus was more than 99%, disregarding loan words. The phonemisation system also plays a central role in die dynamical part of the lexical database architecture we have described elsewhere /Daelemarts, 1987a/.</Paragraph> </Section> <Section position="3" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 7.3. Spelling Error Correction </SectionTitle> <Paragraph position="0"> A spelling error correction algorithm based on the idea that people write what they heat' if they do not know the spelling of a word has teen developed by Van Berkel /Van Berkel and De Smedt, 1988/. A dictionary is used in which the word forms have been transformed into phoneme representations with a simplified and adapted version of GRAFON. A possible error is transformed with the same algorithm and matched to the dictionary entries. Combined with a trigram (or rather triphone) method, this system can correct both spelling and typing errors at a reasonable speed.</Paragraph> <Paragraph position="1"> 8. IMPLEMENTATION AND ACCURACY GRAFON was written in ZetaLisp and Flavors and runs on a Symbolics Lisp Machine. The lexical database is stored on a SUN Workstation and organised indexed-sequentially.</Paragraph> <Paragraph position="2"> Accuracy measures (on randomly chosen Dutch text) are encouraging: in a recent test on a I000 word text, 99.26% of phonemes and 97.62% of transcribed word tokens generated by GRAFON were judged correct by an independent linguist.</Paragraph> <Paragraph position="3"> The main source of errors by the program was the presence of foreign words in the text (mostly of English and French origin). Only a marginal number of errors was caused by morphological analysis, syllabification or phonological rule application.</Paragraph> <Paragraph position="4"> There is at present one serious restriction on the system: no syntactic analysis is available and therefore, no sophisticated intonation patterns and sentence accent can be cornputed. Moreover, it is impossible to experiment with the Phi (the phonological phrase, which may restrict sandhi processes in Dutch) as a domain for phonological rules. However, recently a theory has been put forward by Kager and Qnen6 (1987) in which it is claimed that sentence accent, Phi bonndaries and I (intonational phrase) boundaries can be computeA without exhaustive syntactic analysis. The information needed is restricted to the difference between function and content words, the category of fimction words, and the difference between verbs and other content words. All this information is accessible in the current implementation of GRAFON through dictionary-lookup.</Paragraph> <Paragraph position="5"> Gt a fen tlvpher~atlon 0Ol, lons f'honemlse Input, Phonemlse File l'a, ~e,&quot; (lotions I lyphenare lrlpul, 6rafon Loop computed an internal representation and transcription of a seuteuce fragment. A derivation is also printed. In the centre of the display, a menu listing all phonological rules is shown. By clicking on a rule, the user gets a list of input phrases to which the rule has been applied (middle left). The same list of rules is also given in the top right mmm. This time, the application of individual rules can be blocked, and the result of this can be studied. The chart bottom right shows the fi'equency distribution of phonemes for the current session,</Paragraph> </Section> </Section> <Section position="10" start_page="136" end_page="137" type="metho"> <SectionTitle> 9. CONCLUSIONS </SectionTitle> <Paragraph position="0"> \]{t seems that high quality phonemisation for Dutch can be achieved otdy by incorporating enough linguistic knowledge (about syllable boundaries, internal word boundaries etc.). GRAFON is a first step in this direction.</Paragraph> <Paragraph position="1"> Although it lacks some sources of knowledge (notably about sentence accent and syntactic structure), a transcription of high quality and accuracy can already be obtained, and the system was successfully applied in practical tasks like rule testing, dictionary constrnction and spelling error correction.</Paragraph> <Paragraph position="2"> At present, we are working on the integration of a syntactic parser into GRAFON. This would make available the phonological phrase as a domaitt, and would make the computation of natural intonation patterns possible (vsing e.g. the algoritlun developed in Van Wijk and Kempen, 1987). The alternatiw~ approach to the comptttation of phonological phrase boundaries /Kager and Quen6, 1987/ is also being explored.</Paragraph> <Paragraph position="3"> Another (more trivial) extension is the addition of preprocessors for the verbal expansion of abbreviations and numbers, 'rite specifications of a lexical attalyser providing this functionality were provided in Daelemans (1987b). An overview of the system including the modules we are presently working on is given in Figure 5,</Paragraph> </Section> <Section position="11" start_page="137" end_page="137" type="metho"> <SectionTitle> SENTENCE ACCENT ASSIGNMENT J INTONATION CONTOUR COMPUTATION </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>