File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/w05-0702_abstr.xml
Size: 7,861 bytes
Last Modified: 2025-10-06 13:44:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0702"> <Title>A finite-state morphological grammar of Hebrew</Title> <Section position="2" start_page="0" end_page="9" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Morphological analysis is a crucial component of several natural language processing tasks, especially for languages with a highly productive morphology, where stipulating a full lexicon of surface forms is not feasible. We describe HAMSAH (HAifa Morphological System for Analyzing Hebrew), a morphological processor for Modern Hebrew, based on finite-state linguistically motivated rules and a broad coverage lexicon. The set of rules comprehensively covers the morphological, morpho-phonological and orthographic phenomena that are observable in contemporary Hebrew texts. Reliance on finite-state technology facilitates the construction of a highly efficient, completely bidirectional system for analysis and generation. HAMSAH is currently the broadest-coverage and most accurate freely-available system for Hebrew.</Paragraph> <Paragraph position="1"> 1 Hebrew morphology: the challenge Hebrew, like other Semitic languages, has a rich and complex morphology. The major word formation machinery is root-and-pattern, where roots are sequences of three (typically) or more consonants, called radicals, and patterns are sequences of vowels and, sometimes, also consonants, with &quot;slots&quot; into which the root's consonants are being inserted (interdigitation). Inflectional morphology is highly productive and consists mostly of suffixes, but sometimes of prefixes or circumfixes.</Paragraph> <Paragraph position="2"> As an example of root-and-pattern morphology, consider the Hebrew1 roots g.d.l and r.e.m and the patterns hCCCh and CiCwC, where the 'C's indicate the slots. When the roots combine with these patterns the resulting lexemes are hgdlh, gidwl, hremh, riewm, respectively. After the root combines with the pattern, some morpho-phonological alternations take place, which may be non-trivial: for example, the htCCCwt pattern triggers assimilation when the first consonant of the root is t or d: thus, d.r.e+htCCCwt yields hdrewt. The same pattern triggers metathesis when the first radical is s or e: s.d.r+htCCCwt yields hstdrwt rather than the expected htsdrwt. Frequently, root consonants such as w or i are altogether missing from the resulting form. Other weak paradigms include roots whose first radical is n and roots whose second and third radicals are identical. Thus, the roots q.w.m, g.n.n, n.p.l and i.c.g, when combining with the hCCCh pattern, yield the seemingly similar lexemes hqmh, hgnh, hplh and hcgh, respectively.</Paragraph> <Paragraph position="3"> The combination of a root with a pattern produces a base (or a lexeme), which can then be inflected in various forms. Nouns, adjectives and numerals inflect for number (singular, plural and, in rare cases, also dual) and gender (masculine or feminine). In addition, all these three types of nominals have two phonologically distinct forms, known as the absolute and construct states. Unfortunately, in the standard orthography approximately half of the nomi- null nals appear to have identical forms in both states, a fact which substantially increases the ambiguity. In addition, nominals take pronominal suffixes which are interpreted as possessives. These inflect for number, gender and person: spr+h-sprh &quot;her book&quot;, spr+km-sprkm &quot;your book&quot;, etc. As expected, these processes involve certain morphological alternations, as in mlkh+h-mlkth &quot;her queen&quot;, mlkh+km-mlktkm &quot;your queen&quot;. Verbs inflect for number, gender and person (first, second and third) and also for a combination of tense and aspect, which is traditionally analyzed as having the values past, present, future, imperative and infinite. Verbs can also take pronominal suffixes, which in this case are interpreted as direct objects, but such constructions are rare in contemporary Hebrew of the registers we are interested in.</Paragraph> <Paragraph position="4"> These matters are complicated further due to two sources: first, the standard Hebrew orthography leaves most of the vowels unspecified. It does not explicate [a] and [e], does not distinguish between [o] and [u] and leaves many of the [i] vowels unspecified. Furthermore, the single letter a133 w is used both for the vowels [o] and [u] and for the consonant [v], whereas a137 i is similarly used both for the vowel [i] and for the consonant [y]. On top of that, the script dictates that many particles, including four of the most frequent prepositions (b &quot;in&quot;, k &quot;as&quot;, l &quot;to&quot; and m &quot;from&quot;), the definite article h &quot;the&quot;, the coordinating conjunction w &quot;and&quot; and some subordinating conjunctions (such as e &quot;that&quot; and ke &quot;when&quot;), all attach to the words which immediately follow them. Thus, a form such as ebth can be read as a lexeme (the verb &quot;capture&quot;, third per-son singular feminine past), as e+bth &quot;that+field&quot;, e+b+th &quot;that+in+tea&quot;, ebt+h &quot;her sitting&quot; or even as e+bt+h &quot;that her daughter&quot;. When a definite nominal is prefixed by one of the prepositions b, k or l, the definite article h is assimilated with the preposition and the resulting form becomes ambiguous as to whether or not it is definite: bth can be read either as b+th &quot;in tea&quot; or as b+h+th &quot;in the tea&quot;. An added complexity stems from the fact that there exist two main standards for the Hebrew script: one in which vocalization diacritics, known as niqqud &quot;dots&quot;, decorate the words, and another in which the dots are missing, and other characters represent some, but not all of the vowels. Most of the texts in Hebrew are of the latter kind; unfortunately, different authors use different conventions for the undotted script. Thus, the same word can be written in more than one way, sometimes even within the same document, again adding to the ambiguity.</Paragraph> <Paragraph position="5"> In light of the above, morphological analysis of Hebrew forms is a non-trivial task. Observe that simply stipulating a list of surface forms is not a viable option, both because of the huge number of potential forms and because of the complete inability of such an approach to handle out-of-lexicon items; the number of such items in Hebrew is significantly larger than in European languages due to the combination of prefix particles with open-class words such as proper names. The solution must be a dedicated morphological analyzer, implementing the morphological and orthographic rules of the language.</Paragraph> <Paragraph position="6"> Several morphological processors of Hebrew have been proposed, including works by Choueka (1980; 1990), Ornan and Kazatski (1986), Bentur et al.</Paragraph> <Paragraph position="7"> (1992) and Segal (1997); see a survey in Wintner (2004). Most of them are proprietary and hence cannot be fully evaluated. However, the main limitation of existing approaches is that they are ad-hoc: the rules that govern word formation and inflection are only implicit in such systems, usually intertwined with control structures and general code. This makes the maintenance of such systems difficult: corrections, modifications and extensions of the lexicon are nearly impossible. An additional drawback is that all existing systems can be used for analysis but not for generation. Finally, the efficiency of such systems depends on the quality of the code, and is sometimes sub-optimal.</Paragraph> </Section> class="xml-element"></Paper>