File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0702_metho.xml
Size: 19,029 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0702"> <Title>A finite-state morphological grammar of Hebrew</Title> <Section position="3" start_page="9" end_page="11" type="metho"> <SectionTitle> 2 Finite-state technology </SectionTitle> <Paragraph position="0"> Finite-state technology (Beesley and Karttunen, 2003) solves the three problems elegantly. It provides a language of extended regular expressions which can be used to define very natural linguistically motivated grammar rules. Such expressions can then be compiled into finite-state networks (automata and transducers), on which efficient algorithms can be applied to implement both analysis and generation. Using this methodology, a computational linguist can design rules which closely follow standard linguistic notation, and automatically ob- null tain a highly efficient morphological processor.</Paragraph> <Paragraph position="1"> While the original Two-Level formulation (Koskenniemi, 1983) of finite-state technology for morphology was not particularly well suited to Semitic languages (Lavie et al., 1988), modifications of the Two-Level paradigm and more advanced finite-state implementations have been applied successfully to a variety of Semitic languages, including Ancient Akkadian (Kataja and Koskenniemi, 1988), Syriac (Kiraz, 2000) and Arabic. In a number of works, Beesley (1996; 1998; 2001) describes a finite-state morphological analyzer of Modern Standard Arabic which handles both inflectional and derivational morphology, including interdigitation. In the following section we focus on a particular finite-state toolbox which was successfully used for Arabic.</Paragraph> <Paragraph position="2"> In this work we use XFST (Beesley and Karttunen, 2003), an extended regular expression language augmented by a sophisticated implementation of several finite-state algorithms, which can be used to compactly store and process very large-scale networks. XFST grammars define a binary relation (a transduction) on sets of strings: a grammar maps each member of a (possibly infinite) set of strings, known as the surface, or lower language, to a set of strings (the lexical, or upper language). The idea is that the surface language defines all and only the grammatical words in the language; and each grammatical word is associated with a set of lexical strings which constitutes its analyses. As an example, the surface string ebth may be associated by the grammar with the set of lexical strings, or analyses, depicted in figure 1.</Paragraph> <Paragraph position="3"> XFST enables the definition of variables, whose values, or denotations, are sets of strings, or languages. Grammars can set and use those variables by applying a variety of operators. For example, the concatenation operator (unfortunately indicated by a space) can be used to concatenate two languages: the expression 'A B' denotes the set of strings obtained by concatenating the strings in A with the strings in B. Similarly, the operator '|' denotes set union, '&' denotes intersection, '~' set complement, '-' set difference and '*' Kleene closure; '$A' denotes the set of strings containing at least one instance of a string from A as a substring. The empty string is denoted by '0' and '?' stands for any alphabet symbol. Square brackets are used for bracketing. In addition to sets of strings, XFST enables the definition of binary relations over such sets. By default, every set is interpreted as the identity relation, whereby each string is mapped to itself. But relations can be explicitly defined using a variety of operators. The '.x.' operator denotes cross product: the expression 'A.x.B' denotes the relation in which each string in A is mapped to each string in B.</Paragraph> <Paragraph position="4"> An extremely useful operation is composition: denoted by '.o.', it takes two relations, A and B, and produces a new relation of pairs (a,c) such that there exists some b that (a,b) is a member of A and (b, c) is a member of B.</Paragraph> <Paragraph position="5"> Finally, XFST provides also several replace rules.</Paragraph> <Paragraph position="6"> Expressions of the form 'A->B ||L _ R' denote the relation obtained by replacing strings from A by strings from B, whenever the former occur in the context of strings from L on the left and R on the right. Each of the context markers can be replaced by the special symbol '.#.', indicating a word boundary. For example, the expression '[h]->[t] ||? _ .#.' replaces occurrences of 'h' by 't' whenever the former occurs before the end of a word. Composing this example rule on an (identity) relation whose strings are various words results in replacing final h with final t in all the words, not affecting the other strings in the relation. XFST supports diverse alphabets. In particular, it supports UTF-8 encoding, which we use for Hebrew (although subsequent examples use a transliteration to facilitate readability). Also, the alphabet can include multi-character symbols; in other words, one can define alphabet symbols which consist of several (print) characters, e.g., 'number' or 'tense'. This comes in handy when tags are defined, see below.</Paragraph> <Paragraph position="7"> Characters with special meaning (such as '+' or '[') can be escaped using the symbol '%'. For example, the symbol '%+' is a literal plus sign.</Paragraph> <Paragraph position="8"> Programming in XFST is different from programming in high level languages. While XFST rules are very expressive, and enable a true implementation of some linguistic phenomena, it is frequently necessary to specify, within the rules, information that is used mainly for &quot;book-keeping&quot;. Due to the limited memory of finite-state networks, such information is encoded in tags, which are multi-character symbols attached to strings. These tags can be manipulated by the rules and thus propagate information among rules. For example, nouns are specified for number, and the number feature is expressed as a concatenation of the tag number with the multi-character symbol +singular or +plural. Rules which apply to plural nouns only can use this information: if nouns is an XFST variable denoting the set of all nouns, then the expression $[number %+plural] .o. nouns denotes only the plural nouns. Once all linguistic processing is complete, &quot;book-keeping&quot; tags are erased.</Paragraph> </Section> <Section position="4" start_page="11" end_page="14" type="metho"> <SectionTitle> 3 A morphological grammar of Hebrew </SectionTitle> <Paragraph position="0"> The importance of morphological analysis as a preliminary phase in a variety of natural language processing applications cannot be over-estimated. The lack of good morphological analysis and disambiguation systems for Hebrew is reported as one of the main bottlenecks of a Hebrew to English machine translation system (Lavie et al. (2004)). The contribution of our system is manyfold: * HAMSAH is the broadest-coverage and most accurate publicly available morphological analyzer of Modern Hebrew. It is based on a lexicon of over 20,000 entries, which is constantly being updated and expanded, and its set of rules cover all the morphological, morpho-phonological and orthographic phenomena observed in contemporary Hebrew texts. Compared to Segal (1997), our rules are probably similar in coverage but our lexicon is significantly larger. HAMSAH also supports non-standard spellings which are excluded from the work of Segal (1997).</Paragraph> <Paragraph position="1"> * The system is fully reversible: it can be used both for analysis and for generation.</Paragraph> <Paragraph position="2"> * Due to the use of finite-state technology, the system is highly efficient. While the network has close to 2 million states and over 2 million arcs, its compiled size is approximately 4Mb and analysis is extremely fast (between 50 and 100 words per second).</Paragraph> <Paragraph position="3"> * Morphological knowledge is expressed through linguistically motivated rules. To the best of our knowledge, this is the first formal grammar for the morphology of Modern Hebrew.</Paragraph> <Paragraph position="4"> The system consists of two main components: a lexicon represented in Extensible Markup Language (XML), and a set of finite-state rules, implemented in XFST. The use of XML supports standardization, allows a format that is both human and machine readable, and supports interoperability with other applications. For compatibility with the rules, the lexicon is automatically converted to XFST by dedicated programs. We briefly describe the lexicon in section 3.1 and the rules in section 3.2.</Paragraph> <Section position="1" start_page="11" end_page="13" type="sub_section"> <SectionTitle> 3.1 The lexicon </SectionTitle> <Paragraph position="0"> The lexicon is a list of lexical entries, each with a base (citation) form and a unique id. The base form of nouns and adjectives is the absolute singular masculine, and for verbs it is the third person singular masculine, past tense. It is listed in dotted and undotted script as well as using a one-to-one Latin transliteration. Figure 2 depicts the lexical entry of the word bli &quot;without&quot;. In subsequent examples we retain only the transliteration forms and suppress the Hebrew ones.</Paragraph> <Paragraph position="1"> The lexicon specifies morpho-syntactic features (such as gender or number), which can later be used by parsers and other applications. It also lists several lexical proerties which are specifically targeted at morphological analysis. A typical example is the feminine suffix of adjectives, which can be one of h, it or t, and cannot be predicted from the base form. The lexicon lists information pertaining to non-default behavior with idiosyncratic entries.</Paragraph> <Paragraph position="2"> Adjectives inflect regularly, with few exceptions.</Paragraph> <Paragraph position="3"> Their citation form is the absolute singular masculine, which is used to generate the feminine form, the masculine plural and the feminine plural. An additional dimension is status, which can be absolute or construct. Figure 3 lists the lexicon entry of the adjective yilai &quot;supreme&quot;: its feminine form is obtained by adding the t suffix (hence feminine=&quot;t&quot;). Other features are determined by default. This lexicon entry yields yilai, yilait, yilaiim, yilaiwt etc.</Paragraph> <Paragraph position="4"> Similarly, the citation form of nouns is the absolute singular masculine form. Hebrew has grammatical gender, and the gender of nouns that denote animate entities coincides with their natural gender. The lexicon specifies the feminine suffix via the feminine attribute. Nouns regularly inflect for number, but some nouns have only a plural or only a singular form. The plural suffix (im for masculine, wt for feminine by default) is specified through the plural attribute. Figure 4 demonstrates a masculine noun with an irregular plural suffix, wt.</Paragraph> <Paragraph position="5"> Closed-class words are listed in the lexicon in a similar manner, where the specific category determines which attributes are associated with the citation form. For example, some adverbs inflect for person, number and gender (e.g., lav &quot;slowly&quot;), so this is indicated in the lexicon. The lexicon also specifies the person, number and gender of pronouns, the type of proper names (location, person, organization), etc. The lexical representation of verbs is more involved and is suppressed for lack of space.</Paragraph> <Paragraph position="6"> Irregularities are expressed directly in the lexicon, in the form of additional or alternative lexical entries. This is facilitated through the use of three optional elements in lexicon items: add, replace and remove. For example, the noun chriim &quot;noon&quot; is also commonly spelled chrim, so the additional spelling is specified in the lexicon, along with the standard spelling, using add. As another example, consider Segolate nouns such as bwqr &quot;morning&quot;. Its plural form is bqrim rather than the default bwqrim; such stem changing behavior is specified in the lexicon using replace. Finally, the verb ykwl &quot;can&quot; does not have imperative inflections, which are generated by default for all verbs. To prevent the default behavior, the superfluous forms are removed.</Paragraph> <Paragraph position="7"> The processing of irregular lexicon entries requires some explanation. Lexicon items containing add, remove and replace elements are included in the general lexicon without the add, remove and replace elements, which are listed in special lexicons. The general lexicon is used to build a basic morphological finite-state network. Additional networks are built using the same set of rules for the add, remove and replace lexicons. The final network is obtained by subtracting the remove network from the general one (using the set difference operator), adding the add network (using the set union operator), and finally applying priority union with the replace network. This final finite-state network contains only and all the valid inflected forms.</Paragraph> <Paragraph position="8"> The lexicon is represented in XML, while the morphological analyzer is implemented in XFST, so the former has to be converted to the latter. In XFST, a lexical entry is a relation which holds between the surface form of the lemma and a set of lexical strings. As a surface lemma is processed by the rules, its associated lexical strings are manipulated to reflect the impact of inflectional morphology. The surface string of XFST lexical entries is the citation form specified in the XML lexicon. Figure 5 lists the XFST representation of the lexical entry of the word bli, whose XML representation was listed in figure 2.</Paragraph> <Paragraph position="10"/> </Section> <Section position="2" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 3.2 Morphological and orthographic rules </SectionTitle> <Paragraph position="0"> In this section we discuss the set of rules which constitute the morphological grammar, i.e., the implementation of linguistic structures in XFST. The grammar includes hundreds of rules; we present a small sample, exemplifying the principles that govern the overall organization of the grammar. The linguistic information was collected from several sources (Barkali, 1962; Zdaqa, 1974; Alon, 1995; Cohen, 1996; Schwarzwald, 2001; Schwarzwald, 2002; Ornan, 2003).</Paragraph> <Paragraph position="1"> The grammar consists of specific rules for every part of speech category, which are applied to the appropriate lexicons. For each category, a variable is defined whose denotation is the set of all lexical entries of that category. Combined with the category-specific rules, we obtain morphological grammars for every category (not including idiosyncrasies).</Paragraph> <Paragraph position="2"> These grammars are too verbose on the lexical side, as they contain all the information that was listed in the lexicon. Filters are therefore applied to the lexical side to remove the unneeded information.</Paragraph> <Paragraph position="3"> Our rules support surface forms that are made of zero or more prefix particles, followed by a (possibly inflected) lexicon item. Figure 6 depicts the high-level organization of the grammar (recall from section 2 that '.o.' denotes composition). The variable inflectedWord denotes a union of all the possible inflections of the entire lexicon. Similarly, prefixes is the set of all the possible sequences of prefixes. When the two are concatenated, they yield a language of all possible surface forms, vastly over-generating. On the upper side of this language a prefix particle filter is composed, which enforces linguistically motivated constraints on the possible combinations of prefixes with words. On top of this another filter is composed, which handles &quot;cosmetic&quot; changes, such as removing &quot;book-keeping&quot; tags. A similar filter is applied to the the lower side of the network.</Paragraph> <Paragraph position="4"> As an example, consider the feminine singular form of adjectives, which is generated from the masculine singular by adding a suffix, either h, it or t. Some idiosyncratic forms have no masculine singular form, but do have a feminine singular form, for example hrh &quot;pregnant&quot;. Therefore, as figure 7 shows, singular feminine adjectives are either extracted verbatim from the lexicon or generated from the singular masculine form by suffixation. The rule [ %+feminine <- ? ||%+gender _ ] changes the gender attribute to feminine for the inflected feminine forms. This is a special form of a replace rule which replaces any symbol ('?') by the multi-character symbol '+feminine', in the context of occurring after '+gender'. The right context is empty, meaning anything.</Paragraph> <Paragraph position="5"> variable HE) is used in the inflection. The default is not to add an additional h if the masculine adjective already terminates with it, as in mwrh &quot;male teacher&quot;-mwrh &quot;female teacher&quot;. This means that exceptions to this default, such as gbwh &quot;tall, m&quot; gbwhh &quot;tall, f&quot;, are being improperly treated. Such forms are explicitly listed in the lexicon as idiosyncrasies (using the add/replace/remove mechanism), and will be corrected at a later stage. The suffixes t and it are handled in a similar way.</Paragraph> <Paragraph position="7"> fix are processed. On the lower side some conditional alternations are performed before the suffix is added. The first alternation rule replaces iih with ih at the end of a word, ensuring that nouns wrttent with a spurious i such as eniih &quot;second&quot; are properly inflected as eniwt &quot;seconds&quot; rather than eniiwt. The second alternation rule removes final t to ensure that a singular noun such as meait &quot;truck&quot; is properly inflected to its plural form meaiwt. The third ensures that nouns ending in wt such as smkwt &quot;authority&quot; are properly inflected as smkwiwt. Of course, irregular nouns such as xnit &quot;spear&quot;, whose plural is xnitwt rather than xniwt, are lexically specified and handled separately. Finally, a final h is removed by the fourth rule, and subsequently the plural suffix is concatenated.</Paragraph> <Paragraph position="8"> The above rules only superficially demonstrate the capabilities of our grammar. The bulk of the grammar consists of rules for inflecting verbs, including a complete coverage of the weak paradigms.</Paragraph> <Paragraph position="9"> The grammar also contains rules which govern the possible combinations of prefix particles and the words they combine with.</Paragraph> </Section> </Section> class="xml-element"></Paper>