File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1084_evalu.xml

Size: 5,759 bytes

Last Modified: 2025-10-06 13:59:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1084">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation</Title>
  <Section position="6" start_page="668" end_page="670" type="evalu">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We ran a series of experiments on a Hebrew corpus to compare various approaches to the full morphological disambiguation and PoS tagging tasks. The training corpus is obtained from various newspaper sources and is characterized by the following statistics: 6M word occurrences, 178,580 distinct words, 64,541 distinct lemmas. Overall, the ambiguity level is 2.4 (average number of analyses per word).</Paragraph>
    <Paragraph position="1"> We tested the results on a test corpus, manually annotated by 2 taggers according to the guidelines we published and checked for agreement. The test corpus contains about 30K words. We compared two unsupervised models over this data set: Word model [W], and Morpheme model [M]. We also tested two different sets of initial conditions. Uniform distribution [Uniform]: For each word, each analysis provided by the analyzer is estimated with an equal likelihood. Context Free approximation [CF]: We applied the CF algorithm of Levinger et al.(1995) to estimate the likelihood of each analysis. null Table 3 reports the results of full morphological disambiguation. For each morpheme and word models, three types of models were tested: [1] First-order HMM, [2-] Partial second-order HMM only state transitions were modeled (excluding B2 matrix), [2] Second-order HMM (including the B2 matrix).</Paragraph>
    <Paragraph position="2"> Analysis If we consider the tagger which selects the most probable morphological analysis for each</Paragraph>
    <Paragraph position="4"/>
    <Paragraph position="6"> word in the text, according to Levinger et al. (1995) approximations, with accuracy of 78.2%, as the baseline tagger, four steps of error reduction can be identified. (1) Contextual information: The simplest first-order word-based HMM with uniform initial conditions, achieves error reduction of 17.5% (78.2 - 82.01). (2) Initial conditions: Error reductions in the range: 11.5% - 37.8% (82.01 - 84.08 for word model 1, and 81.53 - 88.5 for morhpeme model 2-) were achieved by initializing the various models with context-free approximations. While this observation confirms Elworthy (1994), the impact of error reduction is much less than reported there for English - about 70% (79 - 94). The key difference (beside the unclear characteristic of Elworthy initial condition - since he made use of an annotated corpus) is the much higher quality of the uniform distribution for Hebrew. (3) Model order: The partial second-order HMM [2-] produced the best results for both word (85.75%) and morpheme (88.5%) models over the initial condition. The full second-order HMM [2] didn't upgrade the accuracy of the partial second-order, but achieved the best results for the uniform distribution morpheme model. This is because the context-free approximation does not take into account the tag of the previous word, which is part of model 2. We believe that initializing the morpheme model over a small set of annotated corpus will set much stronger initial condition for this model. (4) Model type: The main result of this paper is the error reduction of the morpheme model with respect to the word model: about 19.3% (85.75 - 88.5).</Paragraph>
    <Paragraph position="7"> In addition, we apply the above models for the simpler task of segmentation and PoS tagging, as reported in Table 4. The task requires picking the correct morphemes of each word with their correct PoS (excluding all other morphological features).</Paragraph>
    <Paragraph position="8"> The best result for this task is obtained with the morpheme model 2: 92.32%. For this simpler task, the improvement brought by the morpheme model over the word model is less significant, but still consists of a 5% error reduction.</Paragraph>
    <Paragraph position="9"> Unknown words account for a significant chunk of the errors. Table 5 shows the distribution of errors contributed by unknown words (words that cannot be analyzed by the morphological analyzer). 7.5% of the words in the test corpus are unknown: 4% are not recognized at all by the morphological analyzer (marked as [None] in the ta- null ble), and for 3.5%, the set of analyses proposed by the analyzer does not contain the correct analysis [Missing]. We extended the lexicon to include missing and none lexemes of the closed sets. In addition, we modified the analyzer to extract all possible segmentations of unknown words, with all the possible tags for the segmented affixes, where the remaining unknown baseforms are tagged as UK. The model was trained over this set. In the next phase, the corpus was automatically tagged, according to the trained model, in order to form a tag distribution for each unknown word, according to its context and its form. Finally, the tag for each unknown word were selected according to its tag distribution. This strategy accounts for about half of the 7.5% unknown words.</Paragraph>
    <Paragraph position="10">  Table 6 shows the confusion matrix for known words (5% and up). The key confusions can be attributed to linguistic properties of Modern Hebrew: most Hebrew proper names are also nouns (and they are not marked by capitalization) - which explains the PN/N confusion. The verb/noun and verb/adjective confusions are explained by the nature of the participle form in Hebrew (beinoni) participles behave syntactically almost in an identical manner as nouns.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML