File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/n04-4038_relat.xml
Size: 1,867 bytes
Last Modified: 2025-10-06 14:15:46
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4038"> <Title>Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunksa0</Title> <Section position="4" start_page="0" end_page="0" type="relat"> <SectionTitle> 3 Related Work </SectionTitle> <Paragraph position="0"> To our knowledge, there are no systems that automatically tokenize and POS Arabic text as such. The current standard approach to Arabic tokenization and POS tagging -- adopted in theArabic TreeBank-- relies on manually choosing the appropriate analysis from among the multiple analyses rendered by AraMorph, a sophisticated rule based morphological analyzer by Buckwalter.3 Morphological analysis may be characterized as the process of segmenting a surface word form into its component derivational and inflectional morphemes. In a language such as Arabic, which exhibits both inflectional and derivational morphology, the morphological tags tend to be fine grained amounting to a large number of tags -- AraMorphhas 135 distinct morphological labels -- in contrast to POS tags which are typically coarser grained. Using AraMorph, the choice of an appropriate morphological analysis entails clitic tokenization as well assignment of a POS tag. Such morphological labels are potentially useful for NLP applications, yet the necessary manual choice renders it an expensive process.</Paragraph> <Paragraph position="1"> On the other hand, Khoja (Khoja, 2001) reports preliminary results on a hybrid, statistical and rule based, POS tagger, APT. APT yields 90% accuracy on a tag set of 131 tags including both POS and inflection morphology information. APT is a two-step hybrid system with rules and a Viterbi algorithm for statistically determining the appropriate POS tag. Given the tag set, APT is more of a morphological analyzer than a POS tagger.</Paragraph> </Section> class="xml-element"></Paper>