File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0706_evalu.xml
Size: 7,058 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0706"> <Title>Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew</Title> <Section position="8" start_page="42" end_page="44" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> In this section we report on an empirical comparison between the two levels of tokenization presented in the previous section. Analysis of the results leads to an improved morpheme-level model, which outperforms both of the initial models.</Paragraph> <Paragraph position="1"> Each architectural configuration was evaluated in split of the corpus, the training set includes 1,598 sentences on average, which on average amount to 28,738 words and 39,282 morphemes. The test set includes 250 sentences. We estimate segmentation accuracy - the percentage of words correctly segmented into morphemes, as well as tagging accuracy - the percentage of words that were correctly segmented for which each morpheme was assigned the correct POS tag.</Paragraph> <Paragraph position="2"> For each parameter, the average over the five folds is reported, with the standard deviation in parentheses. We used two-tailed paired t-test for testing the significance of the difference between the average results of different systems. The significance level (p-value) is reported.</Paragraph> <Paragraph position="3"> The first two lines in Table 1 detail the results obtained for both word (W) and morpheme (M) levels of tokenization. The tagging accuracy of the morpheme tagger is considerably better than what is achieved by the word tagger (difference of 0.79% with significance level p = 0.01). This is in spite of the fact that the segmentation achieved by the word tagger is a little better (and a segmentation error implies incorrect tagging). Our hypothesis is that: Morpheme-level taggers outperform word-level taggers in their tagging accuracy, since they suffer less from data sparseness. However, they lack some word-level knowledge that is required for segmentation.</Paragraph> <Paragraph position="4"> This hypothesis is supported by the number of once-occurring terminals in each level: 8,582 in the word level, versus 5,129 in the morpheme level.</Paragraph> <Paragraph position="5"> Motivated by this hypothesis, we next consider what kind of word-level information is required for the morpheme-level tagger in order to do better in segmentation. One natural enhancement for the morpheme-level model involves adding information about word boundaries to the tag set. In the enhanced tag set, nonterminal symbols include additional features that indicate whether the tagged morpheme starts/ends a word. Unfortunately, we found that adding word boundary information in this way did not improve segmentation accuracy.</Paragraph> <Paragraph position="6"> However, error analysis revealed a very common type of segmentation errors, which was found to be considerably more frequent in morpheme tagging than in word tagging. This kind of errors involves a missing or an extra covert definiteness marker 'h'. For example, the word bbit can be segmented either as b-bit (&quot;in a house&quot;) or as b-h-bit (&quot;in the house&quot;), pronounced bebayit and babayit, respectively. Unlike other cases of segmentation ambiguity, which often just manifest lexical facts about spelling of Hebrew stems, this kind of ambiguity is productive: it occurs whenever the stem's POS allows definiteness, and is preceded by one of the prepositions b/k/l. In morpheme tagging, this type of error was found on average in 1.71% of the words (46% of the segmentation errors). In word tagging, it was found only in 1.36% of the words (38% of the segmentation errors). null Since in Hebrew there should be agreement between the definiteness status of a noun and its related adjective, this kind of ambiguity can sometimes be resolved syntactically. For instance: &quot;bbit hgdwl&quot; implies b-h-bit (&quot;in the big house&quot;) &quot;bbit gdwl&quot; implies b-bit (&quot;in a big house&quot;) By contrast, in many other cases both analyses are syntactically valid, and the choice between them requires consideration of a wider context, or some world knowledge. For example, in the sentence hlknw lmsibh (&quot;we went to a/the party&quot;), lmsibh can be analyzed either as l-msibh (indefinite,&quot;to a party&quot;) or as l-h-mbsibh (definite,&quot;to the party&quot;). Whether we prefer &quot;the party&quot; or &quot;a party&quot; depends on contextual information that is not available for the POS tagger.</Paragraph> <Paragraph position="7"> Lexical statistics can provide valuable information in such situations, since some nouns are more common in their definite form, while other nouns are more common as indefinite. For example, consider the word lmmflh (&quot;to a/the government&quot;), which can be segmented either as l-mmflh or l-h-mmflh. The of tokenization stem mmflh (&quot;government&quot;) was found 25 times in the corpus, out of which only two occurrences were indefinite. This strong lexical evidence in favor of l-h-mmflh is completely missed by the morpheme-level tagger, in which morphemes are assumed to be independent. The lexical model of the word-level tagger better models this difference, since it does take into account the frequencies of l-mmflh and l-h-mmlh, in measuring P(lmmflh|IN-NN) and P(lmmflh|IN-H-NN). However, since the word tagger considers lmmflh, hmmflh (&quot;the government&quot;), and mmflh (&quot;a government&quot;) as independent words, it still exploits only part of the potential lexical evidence about definiteness.</Paragraph> <Paragraph position="8"> In order to better model such situations, we changed the morpheme-level model as follows. In definite words the definiteness article h is treated as a manifestation of a morphological feature of the stem. Hence the definiteness marker's POS tag (H) is prefixed to the POS tag of the stem. We refer by M+h to the resulting model that uses this assumption, which is rather standard in theoretical linguistic studies of Hebrew. The M+h model can be viewed as an intermediate level of tokenization, between morpheme and word tokenization. The different analyses obtained by the three models of tokenization are demonstrated in Table 2.</Paragraph> <Paragraph position="9"> As shown in Table 1, the M+h model shows remarkable improvement in segmentation (0.49%, p < 0.001) compared with the initial morpheme-level model (M). As expected, the frequency of segmentation errors that involve covert definiteness (h) dropped from 1.71% to 1.25%. The adjusted morpheme tagger also outperforms the word level tagger in segmentation (0.31%, p = 0.069). Tagging was improved as well (0.3%, p = 0.068). According to these results, tokenization as in the M+h model is preferable to both plain-morpheme and plain-word tokenization.</Paragraph> </Section> class="xml-element"></Paper>