File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/p05-1071_relat.xml
Size: 4,544 bytes
Last Modified: 2025-10-06 14:15:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1071"> <Title>Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop</Title> <Section position="4" start_page="573" end_page="574" type="relat"> <SectionTitle> 3 Related Work </SectionTitle> <Paragraph position="0"> Our work is inspired by HajiVc (2000), who convincingly shows that for five Eastern European languages with complex inflection plus English, using a morphological analyzer3 improves performance of a tagger. He concludes that for highly inflectional languages &quot;the use of an independent morpholog3HajiVc uses a lookup table, which he calls a &quot;dictionary&quot;. The distinction between table-lookup and actual processing at run-time is irrelevant for us.</Paragraph> <Paragraph position="1"> ical dictionary is the preferred choice [over] more annotated data&quot;. HajiVc (2000) uses a general exponential model to predict each morphological feature separately (such as the ones we have listed in Figure 2), but he trains different models for each ambiguity left unresolved by the morphological analyzer, rather than training general models. For all languages, the use of a morphological analyzer results in tagging error reductions of at least 50%.</Paragraph> <Paragraph position="2"> We depart from HajiVc's work in several respects.</Paragraph> <Paragraph position="3"> First, we work on Arabic. Second, we use this approach to also perform tokenization. Third, we use the SVM-based Yamcha (which uses Viterbi decoding) rather than an exponential model; however, we do not consider this difference crucial and do not contrast our learner with others in this paper. Fourth, and perhaps most importantly, we do not use the notion of ambiguity class in the feature classifiers; instead we investigate different ways of using the results of the individual feature classifiers in directly choosing among the options produced for the word by the morphological analyzer.</Paragraph> <Paragraph position="4"> While there have been many publications on computational morphological analysis for Arabic (see (Al-Sughaiyer and Al-Kharashi, 2004) for an excellent overview), to our knowledge only Diab et al.</Paragraph> <Paragraph position="5"> (2004) perform a large-scale corpus-based evaluation of their approach. They use the same SVM-based learner we do, Yamcha, for three different tagging tasks: word tokenization (tagging on letters of a word), which we contrast with our work in Section 7; POS tagging, which we discuss in relation to our work in Section 8; and base phrase chunking, which we do not discuss in this paper. We take the comparison between our results on POS tagging and those of Diab et al. (2004) to indicate that the use of a morphological analyzer is beneficial for Arabic as column shows on which parts-of-speech this feature can be expressed; the value 'NA' is used for each feature other than POS, Conj, and Part if the word is not of the appropriate POS well.</Paragraph> <Paragraph position="6"> Several other publications deal specifically with segmentation. Lee et al. (2003) use a corpus of manually segmented words, which appears to be a sub-set of the first release of the ATB (110,000 words), and thus comparable to our training corpus. They obtain a list of prefixes and suffixes from this corpus, which is apparently augmented by a manually derived list of other affixes. Unfortunately, the full segmentation criteria are not given. Then a trigram model is learned from the segmented training corpus, and this is used to choose among competing segmentations for words in running text. In addition, a huge unannotated corpus (155 million words) is used to iteratively learn additional stems. Lee et al. (2003) show that the unsupervised use of the large corpus for stem identification increases accuracy. Overall, their error rates are higher than ours (2.9% vs. 0.7%), presumably because they do not use a morphological analyzer.</Paragraph> <Paragraph position="7"> There has been a fair amount of work on entirely unsupervised segmentation. Among this literature, Rogati et al. (2003) investigate unsupervised learning of stemming (a variant of tokenization in which only the stem is retained) using Arabic as the example language. Unsurprisingly, the results are much worse than in our resource-rich approach. Darwish (2003) discusses unsupervised identification of roots; as mentioned above, we leave root identification to future work.</Paragraph> </Section> class="xml-element"></Paper>