File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/p88-1001_metho.xml

Size: 20,127 bytes

Last Modified: 2025-10-06 14:12:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P88-1001">
  <Title>ADAPTING AN ENGLISH MORPHOLOGICAL ANALYZER FOR FRENCH</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ADAPTING AN ENGLISH MORPHOLOGICAL ANALYZER FOR
FRENCH
</SectionTitle>
    <Paragraph position="0"> Roy J. Byrd and Evelyne Tzoukermann</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
IBM Research
</SectionTitle>
    <Paragraph position="0"> IBM q~omas J. Watson Research Center Yorktown lleights, New York 10598</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> A word-based morphological analyzer and a dictionary for recognizing inflected forms of French words have been built by adapting the UDICI&amp;quot; system. We describe the adaptations, emphasizing mechanisms developed to handle French verbs. This work lays the groundwork for doing French derivational morphology and morphology for other languages.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Introduction.
</SectionTitle>
    <Paragraph position="0"> UDICT is a dictionary system intended to support the lexical needs of computer programs that do natural language processing (NLP). Its t'u-st version was built for English and has been used in several systems needing a variety of information about English words (Heidorn, et a1.(1982), Sowa(1984), McCord(1986), and Neff and Byrd(1987)). As described in Byrd(1986), UDICT provides a framework for supplying syntactic, semantic, phonological, and morphological information about the words it contains.</Paragraph>
    <Paragraph position="1"> Part of UDICT's apparatus is a morphological analysis subsystem capable of recognizing morphological variants of the words who~ lemma forms are stored in UDICT's dictionary.</Paragraph>
    <Paragraph position="2"> The English version of this analyzer has been described in Byrd(1983) and Byrd, et al. (1986) and allows UDICT to recognize inflectionally and derivationally affixed words, compounds, and collocations. The present paper describes an effort to build a French version of UDICT. It briefly discusses the creation of the dictionary data itself and then focuses on issues ,raised in handling French inflectional morphology.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The Dictionary.
</SectionTitle>
    <Paragraph position="0"> The primary role of the dictionary in an NLP system is to store and retrieve information about words, in order for NLP systems to be effective, their dictionaries must contain a lot of information about a lot of words. Chodorow, et al.(1985) and Byrd, et al.(1987) discuss techniques for building dictionaries with the required scope by extracting lexical information from machine-readable versions of published dictionaries. Besides serving the NLP application, some of the lexicai information supports that part of the dictionary's access mechanism which permits recognition of morphological variants of the stored words. We have build a UDICT dictionary containing such morphological information for French by starting with an existing spelling correction and synonym aid dictionary ~ and by adding words and information from the French-English dictionary in Collins(1978).</Paragraph>
    <Paragraph position="1"> French UDICT contains a data base of over 40,000 lemmata which are stored in a direct access file managed by the Dictionary Access Method (Byrd, et al. (1986)). Each entry in this file has one of the lemmata as its key and contains lexical information about that lemma. Other than the word's part-of-speech, this information is represented as binary features and multi-valued attributes. The feature information relevant for inflectional analysis includes the following:  We are grateful to the Advanced Language Development group of Maryland, for aocess to their French lexical materials. Those materials parts-of-speech and paradigm classes.</Paragraph>
    <Paragraph position="2">  Some of these features are explicitly stored in UDICT's data base. Other features -- including many of the stored ones -- control morphological processing by being tested and set by rules in ways that will be described in the next section.</Paragraph>
    <Paragraph position="3"> Stored features and attributes which are not affected by (and do not affect) morphological processing are called &amp;quot;morphologically neutral.&amp;quot; Morphologically neutral information appears in UDICT's output with its stored values unaltered.</Paragraph>
    <Paragraph position="4"> Such information could include translations from a transfer dictionary in a machine translation system or selectional restrictions used by an NLP system. For French, no such information is stored now, but in other work (Byrd, et al.</Paragraph>
    <Paragraph position="5"> (1987)) we have demonstrated the feasibility of transferring some additional lexical information (for example, semantic features such as \[+human\]) from English UDICT via bilingual dictionaries.</Paragraph>
    <Paragraph position="6"> It may be useful to point out that, given the ability to store such information about words, one way of building a lexical subsystem would be to exhaustively list and store all inflected words in the language with their associated lexical information. There are at least three good reasons for not doing so. First, even with the availability of efficient storage and retrieval mechanisms, the number of inflected forms is prohibitively large.</Paragraph>
    <Paragraph position="7"> We estimate that the ratio of the number of French inflected forms to lemmata is around 5 (a little more for verbs, a little less for adjectives and nouns). This ratio would require our 40,000 lemmata to be stored as 200,000 entries, ~nore than we would like. The second reason is that inflected forms sharing the same lemma also share a great deal of other lexical information: namely the morphologically neutral information mentioned earlier. Redundant storage of that information in many related inflected forms does not make sense linguistically or computationally.</Paragraph>
    <Paragraph position="8"> Furthermore, as new words are added to the dictionary, it would be an unnecessary complication to generate the inflected forms and duplicate the morphologically neutral information. Storing the information only once with the iemma and allowing it to be inherited by derived forms is a more reasonable approach. Third, it is clear that there are many regular processes at work in the formation of inflected forms from their lemmata.</Paragraph>
    <Paragraph position="9"> Discovering generalizations to capture those regularities and building computational mechanisms to handle them is an interesting task in its own right. We now turn to some of the details of that task.</Paragraph>
    <Paragraph position="10">  3. Morphological Processing.</Paragraph>
    <Paragraph position="11"> 3.1. The mechanism. The UDICT morphological analyzer assumes that words are derived from  other words by affixation, following Aronoff(1976) and others. Consequently, UDICVs word grammar contains affix rules which express conditions on the base word and makes assertions about the affixed word. These conditions and assertions are stated in terms of the kinds of lexical information listed in (1).</Paragraph>
    <Paragraph position="12"> An example of an affix rule is the rule for forming French plural nouns shown in Figure 1. This rule -- which, for example, derives chevaux from cheval -- consists of five parts. First, a boundary marker indicates whether the affix is a prefix or a suffix and whether it is inflectional or derivational. (Byrd(1983) describes further possible distinctions which have so far not been exploited in the French system.) Second, the affix name is an identifier which will be used to describe the morphological structure of the input word.</Paragraph>
    <Paragraph position="13"> Third, the pattern expres~s string tests and modifications to be performed on the input word.</Paragraph>
    <Paragraph position="14"> In this case, the string is tested for aux at its right end (since this is a suffix rule), two characters are removed, and the letter / is appended, yielding a potential base word. This base word is looked up via a recursive invocation of the rule application mechanism which includes an attempt to retrieve the form from the dictionary of stored lemmata. The fourth part of the rule, the condition, expresses constraints which must be met by the base word. In this case, it mu~ be a masculine singular (and not plural) noun. The fifth part of the rule, the assertion, expresses modifications to be made to the features of the base in order to -pn: aux21* (noun 4-masc +sing -plur) (noun +plur -sing)</Paragraph>
    <Paragraph position="16"> describe the derived word. For this rule, the singular feature is turned off and the plural feature is turned on. Features not mentioned in the assertion retain their original values; in effect, the derived word contains inherited morphologically neutral lexical information from the base combined with information asserted by the rule.</Paragraph>
    <Paragraph position="17"> For the input chevaux (&amp;quot;hones&amp;quot;), the rule shown in Figure 1 will produce the following analysis: (2) chevaux: cheval(noun plur masc (structure &lt;&lt;*&gt;N -pn&gt;N)) In other words, ehevaux is derived from cheval.</Paragraph>
    <Paragraph position="18"> It is a plural noun by assertion. It is masculine by inheritance. Its structure consists of the base noun chevai (represented by &amp;quot;&lt;*&gt;N&amp;quot;) together with the inflectional suffix deg-pn&amp;quot;.</Paragraph>
    <Paragraph position="19"> In order for rules such as lhese to operate, there is a critical dependance on having reliable and extensive lexical information about words hypothesized as bases. This information comes from three sources: the stored dictionary, redundancy rules, and other recursively applied affix rules.</Paragraph>
    <Paragraph position="20"> While the assumption that affixes derive words from other words seems entirely appropriate for English, it at fast seemed less so for French. An initial temptation was to write affix rules which derived inflected words by adding affixes to non-word stems. This was especially true for verbs where the inflected forms are often shortcr than the infinitives used as lemmata, and where some of the verbs -- particularly in the third group -have very complex paradigms. However, our rules' requirement for testable lexical information on base forms cannot be met by a system in which bases arc not words. The machine-readable sources from which we build UDICT dictionaries do not contain information about non-word stems. It is furthermore difficult to design procedures for eliciting such information from native speakers, since people don't have intuitions about forms that are not words. Conscqucntly, we have maintained the English model in which only words are stored in UDICT's dictionary. null UDICT's word grammar includes redundancy rules which allow the expression of further generalizations about the properties of words. In a sense, they represent an extension of the analysis techniques u~d to populate the dictionary and their output could well be stored in the dictionary. The following example shows two redundancy rules in the French word grammar: (3) : 0 (adJ -masc -fem)(adJ +masc) : e0 (adj +masc) (adJ +fem) The first rule has no boundary or affix name and its pattern does nothing to the input word. It expresses the notion that if an adjective is not explicitly marked as either masculine or feminine (the condition), then it should at least be considered masculine (the assertion). The second rule says that any masculine adjective which ends in e is also feminine. Examples are the adjectives absurde, reliable, and vaste which are both masculine and feminine. Such rules r~duce the burden on dictionary analysis techniques whose job is to dctermine the gcndcrs of adjectives from machine-readable resources.</Paragraph>
    <Paragraph position="21"> For inflectional affixation, we normally derive the inflcctcd form directly from the lemma. Howevcr, rccursivc rule application plays a role in the dcrivation of feminine and plural forms of nouns, adjectives, and participles -- which will be discussed under &amp;quot;noun and adjective morphology&amp;quot; -- and in our method for handling stem morphology of the French verbs belonging to the third group, which will be discussed under &amp;quot;verb morphology&amp;quot;.</Paragraph>
    <Paragraph position="22"> 3.2. Noun and adjective morphology. For nouns and adjectives, where inflectional changes to a word's spelling occur only at its rightmost end, the word-based model was simple to maintain.</Paragraph>
    <Paragraph position="23">  a. -vpres: ent$ (v +inf) (v -Inf +ind +pres +plur +pets3) b. -vsubJ: es$ (v +inf) (v -inf +subj +pres +sing +pers2) c. -vlmpf: ions$ (v +inf) (v -Inf +ind +impf +plur +persl) d. -vpres: e$ (v +Inf) (v -Inf +ind +imp +pres +plur ~persl +pers3) e. -vpres: ons$ (v +inf) (v -inf +ind +imp +pres +plur +pets1)  As shown in Figure 1, the pattern mechanism supports the needed tests and modifications. For recognition of feminine plurals, we treat the feminine-forming affixes as derivational ones (using an appropriate boundary), so that recursive rule application assures that they always occur ~'mside of&amp;quot; the plural inflectional affix. For example heureuses is analyzed as the plural of heureuse which itself is the feminine of heureux (&amp;quot;happy'). Similarly, dlues ('chosen or elected') is the plural of ~lue which, in turn, is the feminine of ~lu itself analyzed as the past participle of the verb ~lire ('to vote'). The final section of the paper mentions another justification for treating feminine-forming affixes as derivational.</Paragraph>
    <Paragraph position="24"> 3.3. Verb morphology. Many French verbs belonging to the first group (i.e., those whose infinitives end in -er, except for aller) show internal spelling changes when certain inflections are applied. Examples are given in (4) where the inflected forms on the right contain spelling al.</Paragraph>
    <Paragraph position="25"> terations of the infinitive forms on the left.</Paragraph>
    <Paragraph position="26"> (&amp;)a. peser - (ils) p~sent  b. cdder - (que tu) c~des c. essuyer - (tu) essules d. Jeter - (Je, il) jette e. placer - (nous) plefons  These spelling changes are predictable and are not directly dependent on the particular affix that is being applied. Rather, they depend on phonological properties of the affix such as whether it is silent, which vowel it begins with, etc. There are seven such spelling rules whose job is to relate the spelling of the word part ~'mside of&amp;quot; the inflectional affix to its infmitive form. These rules are given informally in (5). (The sample patterns should be interpreted as in Figure 1 and are intended to suggest the strategy used to construct infinitive forms from the inflected form. &amp;quot;C&amp;quot; represents an arbitrary consonant, &amp;quot;D&amp;quot; represents t or I, and &amp;quot;=&amp;quot; represents a repeated letter.)  (5) spelling rules: tlyer*- change i to y and add er, as in essuies/essuyer ~lcer* - change C to c and add er, as in plaC/ons/placer ge0r* - add r, as in mangeons/manger ~C2eCer* - remove grave accent from stem vowel and add er, as in p~sent/peser ~C2~Cer* - change grave accent to acute on stem vowel and add er, as in</Paragraph>
    <Paragraph position="28"> with a consonant cluster, as in s~chent/s~cher D=ler* - remove the repeated consonant and add er, as in jette/jeter It would be inappropriate and uneconomical to treat these spcUing rules within the affix rules themselves. If we did so, the same &amp;quot;fact&amp;quot; would be repeated as many times as there were rules to which it applied. Rather, we handle these seven spelling rules with special logic which not only encodes the rules but also captures sequential constraints on their application: if one of them applies for a #oven affix, then none of the others will apply. The spelling rules are invoked from the affix rules by placing a &amp;quot;$&amp;quot; rather than a &amp;quot;*&amp;quot; in the pattern to denote a recursive lookup. In effect, the base form is looked up modulo the set of possible spelling changes. Example affix rules largely responsible for (and corresponding to) the forms shown in (4) are #oven in Figure 2.</Paragraph>
    <Paragraph position="29"> Verbs of the third group are highly irregular.</Paragraph>
    <Paragraph position="30"> Traditional French grammar books usually assign each verb anywhere from one to six stem forms.</Paragraph>
    <Paragraph position="31"> Some examples are #oven in (6).</Paragraph>
    <Paragraph position="32">  (6) stems for third group verbs: a. partir has sterns par-, parta. -vcond: rlons5* (v +stem -inf) (v +cond +pres +plur +persl) b. +vstem: saulvoPSr* (v +inf -stem) (v +stem -PSnf) c. saurlons: savolr(verb cond pres plur persl (structure &lt;&lt;*&gt;V -vcond&gt;V))  b. savoir has stems sai-, say-, sau-, sach-, $.</Paragraph>
    <Paragraph position="33"> c . apercevoir, concevoir, ddcevoir, percevoir, recevoir have stems in -~o/-, -cev-, -~:o/vd. contredire, dddire, dire, interdire, mJdire, maudire, prJdire, redire have stems in -dis-, -di-, -d-Since our derivations are to be based on lemmata, we need a way to associate infinitives with appropriate stem forms. The mechanism we have chosen is to let a special set of verb stem rules perform that association. Recognition of the inflected form of a third group verb thus becomes a two-step process. In the first step, the outermost affix is recognized, and its inner part is tested for being a valid stem. In the second step, a verb stem rule attempts to relate the stem proposed by the inflectional affix rule to an infmitive in the dictionary. If it succeeds, it marks the proposed stem as a valid one and the entire derivation succeeds.</Paragraph>
    <Paragraph position="34"> Consider, as an example, the rules and system output shown in Figure 3. During the analysis of the input saurions (&amp;quot;(we) would know'), the rule in Figure 3(a) will first recognize and remove the ending -rions, and then ask whether the resuiting sau meets the condition &amp;quot;(v +stem -Lnf)&amp;quot;. Application of the verb stem rule in Figure 3(b) will successfully relate sau to savoir and assert its description to include &amp;quot;(v +stem -inf)&amp;quot;, thus meeting the condition of rule (a). The result will be the successful recognition of saurions with the analysis given in Figure 3(c). Note that the structure given does not mention the occurrence of the &amp;quot;+vstem&amp;quot; affix; this is intentionai and reflects our belief that the two-level structural analysis -- inflectional affix plus infinitive lemma -- is the appropriate output for all verbs. The intermediate stem level, while important for our processing, is not shown in the output for verbs of the third group.</Paragraph>
    <Paragraph position="35"> &amp;quot;l~e French word grammar contains 165 verb stem rules and another 110 affix rules for third group verbs. Given the extent of the idiosyncrasy of these verbs and their finite number (there are only about 350 of them), it is natural to wonder whether we might not do just as well by storing the inflected forms. In addition to the arguments given above (about redundant storage of morphologically neutral lexical information, etc.), we can observe that there are generalizations to be made for which treatment by rule is appropriate. The lists of verbs shown in (6c,d) have common stem pattemings. Lexicalization of the derived forms of these words would not allow us to capture these generMiTations or to handle the admittedly rare coinage of new words which fit these patterns.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML