File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0702_concl.xml
Size: 2,703 bytes
Last Modified: 2025-10-06 13:54:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0702"> <Title>A finite-state morphological grammar of Hebrew</Title> <Section position="5" start_page="14" end_page="15" type="concl"> <SectionTitle> 4 Conclusion </SectionTitle> <Paragraph position="0"> We described a broad-coverage finite-state grammar of Modern Hebrew, consisting of two main components: a lexicon and a set of rules. The current underlying lexicon includes over 20,000 items. The average number of inflected forms for a lexicon item is 33 (not including prefix sequences). Due to the use of finite-state technology, the grammar can be used for generation or for analysis. It induces a very efficient morphological analyzer: in practice, over eighty words per second can be analyzed on a contemporary workstation.</Paragraph> <Paragraph position="1"> For lack of space we cannot fully demonstrate the output of the analyzer; refer back to figure 1 for an example. HAMSAH is now used for a number of projects, including as a front end for a Hebrew to English machine translation system (Lavie et al., 2004). It is routinely tested on a variety of texts, and tokens with zero analyses are being inspected manually. A systematic evaluation of the quality of the analyzer is difficult due to the lack of available alternative resources. Nevertheless, we conducted a small-scale evaluation experiment by asking two annotators to review the output produced by the analyzer for a randomly chosen set of newspaper articles comprising of approximately 1000 word tokens.</Paragraph> <Paragraph position="2"> The following table summarizes the results of this experiment.</Paragraph> <Paragraph position="3"> number % tokens 959 100.00% no analysis 37 3.86% no correct analysis 41 4.28% correct analysis produced 881 91.86% The majority of the missing analyses are due to out-of-lexicon items, particularly proper names.</Paragraph> <Paragraph position="4"> In addition to maintenance and expansion of the lexicon, we intend to extend this work in two main directions. First, we are interested in automatic methods for expanding the lexicon, especially for named entities. Second, we are currently working on a disambiguation module which will rank the analyses produced by the grammar according to context-dependent criteria. Existing works on part-of-speech tagging and morphological disambiguation in Hebrew (Segal, 1999; Adler, 2004; Bar-Haim, 2005) leave much room for further research. Incorporating state-of-the-art machine learning techniques for morphological disambiguation to the output produced by the analyzer will generate an optimal system which is broad-coverage, effective and accurate.</Paragraph> </Section> class="xml-element"></Paper>