File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1071_intro.xml

Size: 2,696 bytes

Last Modified: 2025-10-06 14:03:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1071">
  <Title>Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop</Title>
  <Section position="3" start_page="0" end_page="573" type="intro">
    <SectionTitle>
2 General Approach
</SectionTitle>
    <Paragraph position="0"> Arabic words are often ambiguous in their morphological analysis. This is due to Arabic's rich system of affixation and clitics and the omission of disambiguating short vowels and other orthographic diacritics in standard orthography (&amp;quot;undiacritized orthography&amp;quot;). On average, a word form in the ATB has about 2 morphological analyses. An example of a word with some of its possible analyses is shown in Figure 1. Analyses 1 and 4 are both nouns. They differ in that the first noun has no affixes, while the second noun has a conjunction prefix (+a0 +w 'and') and a pronominal possessive suffix (a1 + +y 'my').</Paragraph>
    <Paragraph position="1"> In our approach, tokenizing and morphologically tagging (including part-of-speech tagging) are the same operation, which consists of three phases.</Paragraph>
    <Paragraph position="2"> First, we obtain from our morphological analyzer a list of all possible analyses for the words of a given sentence. We discuss the data and our lexicon in  more detail in Section 4.</Paragraph>
    <Paragraph position="3"> Second, we apply classifiers for ten morphological features to the words of the text. The full list of features is shown in Figure 2, which also identifies possible values and which word classes (POS) can express these features. We discuss the training and decoding of these classifiers in Section 5.</Paragraph>
    <Paragraph position="4"> Third, we choose among the analyses returned by the morphological analyzer by using the output of the classifiers. This is a non-trivial task, as the classifiers may not fully disambiguate the options, or they may be contradictory, with none of them fully matching any one choice. We investigate different ways of making this choice in Section 6.</Paragraph>
    <Paragraph position="5"> As a result of this process, we have the original text, with each word augmented with values for all the features in Figure 2. These values represent a complete morphological disambiguation. Furthermore, these features contain enough information about the presence of clitics and affixes to perform tokenization, for any reasonable tokenization scheme. Finally, we can determine the POS tag, for any morphologically motivated POS tagset. Thus, we have performed tokenization, traditional POS tagging, and full morphological disambiguation in one fell swoop.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML