File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1071_metho.xml
Size: 18,544 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1071"> <Title>Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop</Title> <Section position="5" start_page="574" end_page="576" type="metho"> <SectionTitle> 4 Preparing the Data </SectionTitle> <Paragraph position="0"> The data we use comes from the Penn Arabic Tree-bank (Maamouri et al., 2004). Like the English Penn Treebank, the corpus is a collection of news texts.</Paragraph> <Paragraph position="1"> Unlike the English Penn Treebank, the ATB is an on-going effort, which is being released incrementally.</Paragraph> <Paragraph position="2"> As can be expected in this situation, the annotation has changed in subtle ways between the incremental releases. Even within one release (especially the first) there can be inconsistencies in the annotation.</Paragraph> <Paragraph position="3"> As our approach builds on linguistic knowledge, we need to carefully study how linguistic facts are represented in the ATB. In this section, we briefly summarize how we obtained the data in the representation we use for our machine learning experiments.4 We use the first two releases of the ATB, ATB1 and ATB2, which are drawn from different news sources. We divided both ATB1 and ATB2 into de- null velopment, training, and test corpora with roughly 12,000 word tokens in each of the development and test corpora, and 120,000 words in each of the training corpora. We will refer to the training corpora as TR1 and TR2, and to the test corpora as, TE1 and TE2. We report results on both TE1 and TE2 because of the differences in the two parts of the ATB, both in terms of origin and in terms of data preparation. null We use the ALMORGEANA morphological analyzer (Habash, 2005), a lexeme-based morphological generator and analyzer for Arabic.5 A sample output of the morphological analyzer is shown in Figure 1. ALMORGEANA uses the databases (i.e., lexicon) from the Buckwalter Arabic Morphological Analyzer, but (in analysis mode) produces an output in the lexeme-and-feature format (which we need for our approach) rather than the stem-and-affix format of the Buckwalter analyzer. We use the data from first version of the Buckwalter analyzer (Buckwalter, 2002). The first version is fully consistent with neither ATB1 nor ATB2.</Paragraph> <Paragraph position="4"> Our training data consists of a set of all possible morphological analyses for each word, with the unique correct analysis marked. Since we want to learn to choose the correct output using the features generated by ALMORGEANA, the training data must also be in the ALMORGEANA output format. To obtain this data, we needed to match data in the ATB to the lexeme-and-feature representation output by ALMORGEANA. The matching included the use of some heuristics, since the representations and choices are not always consistent in the ATB. For example, a0a2a1a4a3 nHw 'towards' is tagged as AV, N, or V (in the same syntactic contexts). We verified whether we introduced new errors while creating our data representation by manually inspecting 400 words chosen at random from TR1 and TR2. In eight cases, our POS tag differed from that in the for performing machine learning experiments.</Paragraph> <Paragraph position="5"> An important issue in using morphological analyzers for morphological disambiguation is what happens to unanalyzed words, i.e., words that receive no analysis from the morphological analyzer.</Paragraph> <Paragraph position="6"> These are frequently proper nouns; a typical example is a1a5a3a6a0 a7a9a8 a0 a0a11a10a13a12 brlwskwny 'Berlusconi', for which no entry exists in the Buckwalter lexicon. A backoff analysis mode in ALMORGEANA uses the morphological databases of prefixes, suffixes, and allowable combinations from the Buckwalter analyzer to hypothesize all possible stems along with feature sets. Our Berlusconi example yields 41 possible analyses, including the correct one (as a singular masculine PN). Thus, with the backoff analysis, unanalyzed words are distinguished for us only by the larger number of possible analyses (making it harder to choose the correct analysis). There are not many unanalyzed words in our corpus. In TR1, there are only 22 such words, presumably because the Buckwalter lexicon our morphological analyzer uses was developed onTR1. In TR2, we have 737 words without analysis (0.61% of the entire corpus, giving us a coverage of about 99.4% on domainsimilar text for the Buckwalter lexicon).</Paragraph> <Paragraph position="7"> In ATB1, and to a lesser degree in ATB2, some words have been given no morphological analysis.</Paragraph> <Paragraph position="8"> (These cases are not necessarily the same words that our morphological analyzer cannot analyze.) The POS tag assigned to these words is then NO FUNC.</Paragraph> <Paragraph position="9"> In TR1 (138,756 words), we have 3,088 NO FUNC POS labels (2.2%). In TR2 (168,296 words), the number of NO FUNC labels has been reduced to 853 (0.5%). Since for these cases, there is no meaningful solution in the data, we have removed them from the evaluation (but not from training). In contrast, Diab et al. (2004) treat NO FUNC like any other POS tag, but it is unclear whether this is meaningful. Thus, when comparing results from different approaches which make different choices about the data (for example, the NO FUNC cases), one should bear in mind that small differences in performance are probably not meaningful.</Paragraph> </Section> <Section position="6" start_page="576" end_page="577" type="metho"> <SectionTitle> 5 Classifiers for Linguistic Features </SectionTitle> <Paragraph position="0"> We now describe how we train classifiers for the morphological features in Figure 2. We train one classifier per feature. We use Yamcha (Kudo and Matsumoto, 2003), an implementation of support vector machines which includes Viterbi decoding.6 As training features, we use two sets. These sets are based on the ten morphological features in Figure 2, plus four other &quot;hidden&quot; morphological features, for which we do not train classifiers, but which are represented in the analyses returned by the morphological analyzer. The reason we do not train classifiers for the hidden features is that they are only returned by the morphological analyzer when they are marked overtly in orthography, but they are not disambiguated in case they are not overtly marked.</Paragraph> <Paragraph position="1"> The features are indefiniteness (presence of nunation), idafa (possessed), case, and mood. First, for each of the 14 morphological features and for each possible value (including 'NA' if applicable), we define a binary machine learning feature which states whether in any morphological analysis for that word, the feature has that value. This gives us 58 machine learning features per word. In addition, we define a second set of features which abstracts over the first set: for all features, we state whether any morphological analysis for that word has a value other than 'NA'. This yields a further 11 machine learning features (as 3 morphological features never have the value 'NA'). In addition, we use the untokenized word form and a binary feature stating whether there is an analysis or not. This gives us a total of 71 machine learning features per word. We specify a window of two words preceding and following the current word, using all 71 features for each word in this 5-word window. In addition, two dynamic features are used, namely the classification made for the preceding two words. For each of the ten classifiers, Yamcha then returns a confidence value for each possible value of the classifier, and in addition it marks the value that is chosen during subsequent Viterbi decoding (which need not be the value with the highest confidence value because of the inclusion of dynamic features).</Paragraph> <Paragraph position="2"> We train on TR1 and report the results for the ten phological features trained on TR1, and evaluated on TE1 and TE2; BL is the unigram baseline trained on TR1 Yamcha classifiers on TE1 and TE2, using all simple tokens,7 including punctuation, in Figure 3. The baseline BL is the most common value associated in the training corpus TR1 with every feature for a given word form (unigram). We see that the base-line for TE1 is quite high, which we assume is due to the fact that when there is ambiguity, often one interpretation is much more prevelant than the others.</Paragraph> <Paragraph position="3"> The error rates on the baseline approximately double on TE2, reflecting the difference between TE2 and TR1, and the small size of TR1. The performance of our classifiers is good on TE1 (third column), and only slightly worse on TE2 (fifth column). We attribute the increase in error reduction over the base-line for TE2 to successfully learned generalizations. We investigated the performance of the classifiers on unanalyzed words. The performance is generally below the baseline BL. We attribute this to the almost complete absence of unanalyzed words in training data TR1. In future work we could attempt to improve performance in these cases; however, given their small number, this does not seem a priority.</Paragraph> <Paragraph position="4"> 7We use the term orthographic token to designate tokens determined only by white space, while simple tokens are orthographic tokens from which punctuation has been segmented (becoming its own token), and from which all tatweels (the elongation character) have been removed.</Paragraph> </Section> <Section position="7" start_page="577" end_page="577" type="metho"> <SectionTitle> 6 Choosing an Analysis </SectionTitle> <Paragraph position="0"> Once we have the results from the classifiers for the ten morphological features, we combine them to choose an analysis from among those returned by the morphological analyzer. We investigate several options for how to do this combination. In the following, we use two numbers for each analysis. First, the agreement is the number of classifiers agreeing with the analysis. Second, the weighted agreement is the sum, over all classifiers, of the classification confidence measure of that value that agrees with the analysis. The agreement, but not the weighted agreement, uses Yamcha's Viterbi decoding.</Paragraph> <Paragraph position="1"> The majority combiner (Maj) chooses the analysis with the largest agreement.</Paragraph> <Paragraph position="2"> The confidence-based combiner (Con) chooses the analysis with the largest weighted agreement.</Paragraph> <Paragraph position="3"> The additive combiner (Add) chooses the analysis with the largest sum of agreement and weighted agreement.</Paragraph> <Paragraph position="4"> The multiplicative combiner (Mul) chooses the analysis with the largest product of agreement and weighted agreement.</Paragraph> <Paragraph position="5"> We use Ripper (Cohen, 1996) to learn a rule-based classifier (Rip) to determine whether an analysis from the morphological analyzer is a &quot;good&quot; or a &quot;bad&quot; analysis. We use the following features for training: for each morphological feature in Figure 2, we state whether or not the value chosen by its classifier agrees with the analysis, and with what confidence level. In addition, we use the word form. (The reason we use Ripper here is because it allows us to learn lower bounds for the confidence score features, which are real-valued.) In training, only the correct analysis is good. If exactly one analysis is classified as good, we choose that, otherwise we use Maj to choose.</Paragraph> <Paragraph position="6"> The baseline (BL) chooses the analysis most commonly assigned in TR1 to the word in question.</Paragraph> <Paragraph position="7"> For unseen words, the choice is made randomly.</Paragraph> <Paragraph position="8"> In all cases, any remaining ties are resolved randomly. null We present the performance in Figure 4. We see that the best performing combination algorithm on TE1 is Maj, and on TE2 it is Rip. Recall that the Yamcha classifiers are trained on TR1; in addition, Rip is trained on the output of these Yamcha clas- null sifiers on TR2. The difference in performance between TE1 and TE2 shows the difference between the ATB1 and ATB2 (different source of news, and also small differences in annotation). However, the results for Rip show that retraining the Rip classifier on a new corpus can improve the results, without the need for retraining all ten Yamcha classifiers (which takes considerable time).</Paragraph> <Paragraph position="9"> Figure 4 presents the accuracy of tagging using the whole complex morphological tagset. We can project this complex tagset to a simpler tagset, for example, POS. Then the minimum tagging accuracy for the simpler tagset must be greater than or equal to the accuracy of the complex morphological tagset. Even if a combining algorithm chooses the wrong analysis (and this is counted as a failure for the evaluation in this section), the chosen analysis may agree with some of the correct morphological features. We discuss our performance on the POS feature in Section 8.</Paragraph> </Section> <Section position="8" start_page="577" end_page="578" type="metho"> <SectionTitle> 7 Evaluating Tokenization </SectionTitle> <Paragraph position="0"> The term &quot;tokenization&quot; refers to the segmenting of a naturally occurring input sequence of orthographic symbols into elementary symbols (&quot;tokens&quot;) used in subsequent processing steps (such as parsing) as basic units. In our approach, we determine all morphological properties of a word at once, so we can use this information to determine tokenization.</Paragraph> <Paragraph position="1"> There is not a single possible or obvious tokenization scheme: a tokenization scheme is an analytical tool devised by the researcher. We evaluate in this section how well our morphological disambiguation curacy measures for each input word whether it gets tokenized correctly, independently of the number of resulting tokens; the token-based measures refer to the four token fields into which the ATB splits each word determines the ATB tokenization. The ATB starts with a simple tokenization, and then splits the word into four fields: conjunctions; particles (prepositions in the case of nouns); the word stem; and pronouns (object clitics in the case of verbs, possessive clitics in the case of nouns). The ATB does not tokenize the definite article +a0 a2 Al+.</Paragraph> <Paragraph position="2"> We compare our output to the morphologically analyzed form of the ATB, and determine if our morphological choices lead to the correct identification of those clitics that need to be stripped off.8 For our evaluation, we only choose the Maj chooser, as it performed best on TE1. We evaluate in two ways.</Paragraph> <Paragraph position="3"> In the first evaluation, we determine for each simple input word whether the tokenization is correct (no matter how many ATB tokens result). We report the percentage of words which are correctly tokenized in the second column in Figure 5. In the second evaluation, we report on the number of output tokens. Each word is divided into exactly four token fields, which can be either filled or empty (in the case of the three clitic token fields) or correct or incorrect (in the case of the stem token field). We report in Figure 5 accuracy over all token fields for all words in the test corpus, as well as recall, precision, and f-measure for the non-null token fields. The baseline BL is the tokenization associated with the morphological analysis most frequently chosen for the input word in training.</Paragraph> <Paragraph position="4"> 8The ATB generates normalized forms of certain clitics and of the word stem, so that the resulting tokens are not simply the result of splitting the original words. We do not actually generate the surface token form from our deep representation, but this can be done in a deterministic, rule-based manner, given our rich morphological analysis, e.g., by using ALMORGEANA in generation mode after splitting off all separable tokens. While the token-based evaluation is identical to that performed by Diab et al. (2004), the results are not directly comparable as they did not use actual input words, but rather recreated input words from the regenerated tokens in the ATB. Sometimes this can simplify the analysis: for example, a a0 p (ta marbuta) must be word-final in Arabic orthography, and thus a word-medial a0 p in a recreated input word reliably signals a token boundary. The rather high baseline shows that tokenization is not a hard problem. null</Paragraph> </Section> <Section position="9" start_page="578" end_page="578" type="metho"> <SectionTitle> 8 Evaluating POS Tagging </SectionTitle> <Paragraph position="0"> The POS tagset Diab et al. (2004) use is a subset of the tagset for English that was introduced with the English Penn Treebank. The large set of Arabic tags has been mapped (by the Linguistic Data Consortium) to this smaller English set, and the meaning of the English tags has changed. We consider this tagset unmotivated, as it makes morphological distinctions because they are marked in English, not Arabic. The morphological distinctions that the English tagset captures represent the complete morphological variation that can be found in English.</Paragraph> <Paragraph position="1"> However, in Arabic, much morphological variation goes untagged. For example, verbal inflections for subject person, number, and gender are not marked; dual and plural are not distinguished on nouns; and gender is not marked on nouns at all. In Arabic nouns, arguably the gender feature is the more interesting distinction (rather than the number feature) as verbs in Arabic always agree with their nominal subjects in gender. Agreement in number occurs only when the nominal subject precedes the verb. We use the tagset here only to compare to previous work.</Paragraph> <Paragraph position="2"> Instead, we advocate using a reduced part-of-speech tag set,9 along with the other orthogonal linguistic features in Figure 2.</Paragraph> <Paragraph position="3"> We map our best solutions as chosen by the Maj model in Section 6 to the English tagset, and we furthermore assume (as do Diab et al. (2004)) the gold standard tokenization. We then evaluate against the gold standard POS tagging which we have mapped</Paragraph> </Section> class="xml-element"></Paper>