File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0509_metho.xml

Size: 15,494 bytes

Last Modified: 2025-10-06 14:08:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0509">
  <Title>A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech</Title>
  <Section position="4" start_page="4" end_page="6" type="metho">
    <SectionTitle>
2 A Description of the Morfix Architec-
</SectionTitle>
    <Paragraph position="0"> ture and its Application</Paragraph>
    <Section position="1" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
2.1 Architecture
</SectionTitle>
      <Paragraph position="0"> On one hand, as can be expected in the light of the similarities described above, a single NLP system is applicable for both MSA and MH, including code infrastructure, database structures, and methodology. On the other hand, in adapting a previously existing MH system to MSA some minor adaptations are nonetheless needed.</Paragraph>
      <Paragraph position="1"> Morfix is comprised of two lexical databases: a lemma database and an idiom/collocation information database, and two rule databases: a morphological rule database and a syntactical rule database.</Paragraph>
      <Paragraph position="2"> The lemma database contains all crucial information about each lemma, including lexical features such as part of speech, gender, number, meaning, root, verb pattern (Wazn / Binyan) etc.</Paragraph>
      <Paragraph position="3"> Most of these features are common to MH and MSA, and have the same morphological implications. All inflectional forms of a lemma are generated by applying algorithms that process these features. These algorithms make use of the morphological rule database. These rules generate forms by superimposing verb patterns and morphophonemic principles. Exceptions are allowed, i.e. the lexicographer may edit a specific form. The exception mechanism is much less used in MSA than in MH, due to the higher consistency of MSA inflections (but see below 2.2 for the treatment of the MSA Broken Plural in Morfix). By the conclusion of this inflection procedure, the entire 70 million forms inventory is accessible.</Paragraph>
      <Paragraph position="4"> The information for the lemma and collocation databases is gathered by two techniques. In the first phase words are extracted from several dictionaries null  , while the second phase involves analyzing text corpora, mainly through Internet sources, using the dictionary based lexicon. Any unanalyzed word (usually new loan words, neologisms and new conventions of usage), as well as collocations found in the corpora, are the basis for enriching the lexicon. The information for the morphological and syntactical databases is retrieved both from conventional grammar text-books null  and from additional linguistic analysis of the corpora.</Paragraph>
      <Paragraph position="5"> By contrast, derivational morphology is by and large not algorithmic or rule derived. That is, nouns, adjectives and verbs of different patterns that share the same root are each entered as separate lemmas. As mentioned above (1.2), there is a fine line between inflectional morphology and derivational morphological. For example, the decision whether to create a new lemma for a nominal  For MSA: Wright (1896), Holes (1995); for MH: Glinert (1989).</Paragraph>
      <Paragraph position="6"> inflection of verb is left to the lexicographer. Criteria are usually morphological, since semantic criteria are often too vague. For example, the fact that the form ka:tib has two possible plural form: ka:tibuna writing masc. pl. and kutta:b writers indicates that the form should have a lemma of its own, on top of being associated with the verb lemma.</Paragraph>
      <Paragraph position="7"> While the lemma in Morfix is defined as an inflectional lemma, derivational morphology is also accounted for in the database in a mechanism called word families, namely the root-based lemma grouping described above (1.2), whose members also share a semantic field. For example, infija:r explosion and mufajjira:t explosives would be members of the same family, whereas fajr dawn would not belong to this family.</Paragraph>
      <Paragraph position="8"> The idiom/collocation database stores information about co-occurrence of words. Idioms are lexicalized word combinations (e.g. in MSA bunya tahtia infrastructure, or in MH bet sefer a school), while collocations are combinations of words that do not have specific meanings when combined, yet often appear together in texts (e.g. in MSA waqqaa {wq} ala l-ittifa:q to sign the agreement as opposed to waqaa {wq} fi tta:ri:x occured on the date or in MH hamtana {hmtnh} ba-tor to wait on line as opposed to kabalat hamatana {hmtnh}accepting the gift).</Paragraph>
      <Paragraph position="9"> Finally, the syntactical rule database is comprised of rules such as agreement rules and construct formation rules (Ida:fa / Smixut). Some rules are not absolute, but rather reflect statistical information about distribution of syntactical structures in the language. These rules play a major role in the context analysis module.</Paragraph>
      <Paragraph position="10"> Each morphological analysis has a vocalization pattern (Taki:l / Nikud). When analyzing word tokens in context, Morfix produces a best bet for the vocalized text.</Paragraph>
      <Paragraph position="11"> Finally, for text-to-speech purposes, a string of phonemes is created, based on the vocalization patterns. Stress markings are added per word, and a prosody pattern is applied, based on syntactical analysis at the clause level. Prosody patterns are expressed as duration and pitch values per phoneme. null</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.2 Adaptation of the technology to Arabic
</SectionTitle>
      <Paragraph position="0"> Most of the elements of Morfix are common to MSA and MH. However, some features had to be specifically supplemented for MSA database. For example, MH plural markers are few and are usually suffixes. MSA on the other hand, often uses Broken Plural (a plural formed by changing the vocalic pattern of the singular, as opposed to affixation marking, e.g. ka:tib (sing.) ! k:atibu:na (pl.) writing; ka:tib (sing.) ! kutta:b (pl.) writer), which is only partially predictable, and therefore must be included in the lemma records.</Paragraph>
      <Paragraph position="1"> Coding this feature did not require major change in the database, since the MH database had optional coding for exceptional plural forms.</Paragraph>
      <Paragraph position="2"> By contrast, a field in MH lemma records redundant in MSA is stress location, which, as apposed to MH, is always predictable in MSA given the phonemic structure of the form.</Paragraph>
      <Paragraph position="3"> Case inflection in MSA (?ira:b) is entirely predictable, hence depicted by rules in the morphological rule database. However, a field for case had to be created in the database especially for MSA, as case does not occur in MH.</Paragraph>
      <Paragraph position="4"> Dual inflection exists in MH, though usually unproductive. This means that the number category throughout the Morfix database could have one of three values: singular, dual, or plural, so that MSA handling, again, demanded no general change, but only a more widespread application of an existing option in the Hebrew Morfix.</Paragraph>
      <Paragraph position="5"> The number of inflectional forms of a verb entry is larger in MSA than it is in MH, most notably due to the additional mood paradigms (Al-Muda:re Al-Majzu:m and Al-Muda:re Al-Mansu:b). This, however, is of no major consequence to Morfix, apart from the fact that another field had to be added to the morphological analysis structure, namely mood.</Paragraph>
      <Paragraph position="6"> The higher number of inflections per verb, along with the generality of the dual inflection, would have resulted in a larger overall number of tokens in MSA, had it not been for the Ktiv Male orthographical system in MH that results in a 25% increment to the overall number of MH tokens (see also above 1.4).</Paragraph>
      <Paragraph position="7"> The phenomenon of incomplete agreement (see also above 1.6) does not require an actual change in the code of Morfix, since the term AGREEMENT (e.g. between noun and adjective) has an external definition, independent for each language. Syntactical rules in the system refer to the term AGREEMENT, hence, rules that make use of the term AGREEMENT will apply, in many cases, to both languages. In general, while some of the syntactical rules in the system are similar in both languages, other rules are defined specifically for each of the two languages. All rules for both languages are specified using the same mechanism.</Paragraph>
      <Paragraph position="8"> In the MH database there are supplementary placeholders for the semi-vocalized spelling alternatives, which are often redundant for MSA, though they do become useful especially in recent loan words.</Paragraph>
      <Paragraph position="9"> In MSA the verb predicate usually precedes its subject (VSO), while in MH the subject tends to appear first (SVO), though in both languages word order is not fixed. This difference is handled in the contextual analysis for disambiguation purposes.</Paragraph>
      <Paragraph position="10"> MSA is used in various countries, each having its own linguistic idiosyncrasies. This entails lexical differences and a few phonetic variations, as well as some minor writing convention differences. This is handled by the MSA lemma database by assigning an additional field, where the relevant areas are specified.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.3 Software modules
</SectionTitle>
      <Paragraph position="0"> * Morphological analyzer: This is the basic building block of our system. It analyzes an input string, and returns an array of records containing detailed information regarding each analysis: the lemma, part of speech, clitic details, as well as gender, number; person, tense, mood, case, clitics and the like.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="6" end_page="6" type="metho">
    <SectionTitle>
* Lemmatizer
</SectionTitle>
    <Paragraph position="0"> This is a version of the morphological analyzer, the difference being that its output is lemmas, not full morphological descriptions. This means that when several morphological analyses share a single lemma, these analyses are united into a single answer record, each includes just the lemma and its part of speech.</Paragraph>
    <Paragraph position="1"> For example, the string {waldy} has several morphological analyses (dual construct form: the two parents of, dual form with genitive pronominal enclitic: my two parents, or singular form with genitive pronominal enclitic: my father); however, the lemmatizer produces just one lemma for all the above analyses: wa:lid a parent.</Paragraph>
    <Paragraph position="2"> * Context analyzer The input for the context analyzer is a text buffer. It returns a set of morphological analysis record arrays, an array for each token found in the buffer. In the records there is one extra field as compared to the basic morphological analyzer: the score field, which reflects the effect of the context analysis. The answer arrays are sorted according to the declining order of the score.</Paragraph>
  </Section>
  <Section position="6" start_page="6" end_page="6" type="metho">
    <SectionTitle>
* Vocalizer
</SectionTitle>
    <Paragraph position="0"> Given a word and a morphological analysis record as input, this module outputs the input word with its vocalization.</Paragraph>
    <Paragraph position="1"> * Text to phoneme Given a vocalized word, and a morphological analysis record as input, this module produces its phonemic representation, including stress marking. * Text to speech A module on top of the text-to-phoneme module, whose inputs are a text buffer and a morphological analysis per word. The text to phoneme module is called upon to produce the phonemic representation of the buffer. Then a prosody function is called upon to assign duration values and pitch contours to each phoneme, and the output of this function is sent to a diphone based synthesis engine.</Paragraph>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.4 Results and performance
</SectionTitle>
      <Paragraph position="0"> The Hebrew version of Morfix has achieved the following results: Morfix generates exceptionally accurate lemmatization. When indexing for full text search, the matching rate of the lemma receiving the highest score to the correct lemma stands at above 98%. In typical Internet texts, between 1% and 2% of words remain unanalyzed (by and large, these are proper names not included in lexicon; in search engine application, these undergo a morphological soundex algorithm designed to enable the retrieval of proper names with prepositional proclitics).</Paragraph>
      <Paragraph position="1"> Performance depends on hardware and system environments. On a typical (as of date of publication) Intel III 800 MHz CPU, with 256 MB RAM running Windows 2000, Morfix analyzes c. 10,000 words per second.</Paragraph>
      <Paragraph position="2"> In text-to-speech (TTS) applications, the degree of words read correctly (fully correct phonetic transcription and stress location) is also 98%. This number is no different than the number for lemmatization, but is derived differently: on one hand, sometimes an error in lemmatization does not yield an error in phonetization (in case of homonymic tokens); on the other hand, TTS has to deal with phonetization of proper names not in the lexicon, which it carries out according to algorithms. The Hebrew TTS system is successfully implemented in systems for reading e-mail and Internet texts.</Paragraph>
      <Paragraph position="3"> Performance of the TTS system is around 20% slower than lemmatization, due to extra processing that computes the phonetic transcription given the morphological analysis.</Paragraph>
      <Paragraph position="4"> The final equivalent numbers for Arabic are still not available as of date of publication. Nonetheless, because the system is similar, and MSA is quite close to MH in terms of total number of inflections and in degree of ambiguity, it is expected to reach similar results.</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
2.5 Applications
</SectionTitle>
      <Paragraph position="0"> Various modules of the system are used by various applications. Main application beneficiaries include full text search, categorization and textual data mining (where context sensitive morphological analysis and lemmatization are crucial for Semitic languages), screen readers and email-to-voice converters in telephony usage (especially the text-to-speech module), automatic vocalizers for schools and book publishers (especially the vocalization module), and online dictionaries (especially context sensitive lemmatization, to enable the retrieval of the correct entry when clicking on a word in context).</Paragraph>
      <Paragraph position="1"> A special thought was given in order to assist the non-fluent speaker of MSA and MH. Besides the fact that all applications trace the basic forms of words, sparing the process usually done by the speaker himself, additional assistance is given, such as transliteration into Latin script.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML