File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1602_metho.xml

Size: 21,495 bytes

Last Modified: 2025-10-06 14:09:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1602">
  <Title>Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Reconciling Treebank annotation with
</SectionTitle>
    <Paragraph position="0"> traditional grammar concepts in Arabic The question we had to face in the early stages of ATB was how to develop a Treebank methodology - an analysis of all the targeted syntactic structures - for MSA represented by unvocalized written text data. Since all Arabic readers - Arabs and foreigners - go through the process of virtually providing/inserting the required grammatical rules which allow them to reach an interpretation of the text and consequent understanding, and since all our recruited annotators are highly educated native Arabic speakers, we accepted going through our first corpus annotation with that premise. Our conclusion was that the two-level annotation was possible, but we noticed that because of the extra time taken hesitating about case markings at the TB level, TB annotation was more difficult and more time-consuming. This led to including all possible/potential case endings in the POS alternatives provided by the morphological analyzer. Our choice was to make the two annotation passes equal in difficulty by transferring the vocalization difficulty to the POS level. We also thought that it is better to localize that difficulty at the initial level of annotation and to try to find the best solution to it. So far, we are happy with that choice. We are aware of the need to have a full and correct vocalization for our ATB, and we are also aware that there will never be an existing extensive vocalized corpus - except for the Koranic text - that we could totally trust. The challenge was and still is to find annotators with a very high level of grammatical knowledge in MSA, and that is a tall order here and even in the Arab region.</Paragraph>
    <Paragraph position="1"> So, having made the change from unvocalized text in the 'AFP Corpus' to fully vocalized text now for the 'ANNAHAR Corpus,' we still need to ask ourselves the question of what is better: (a) an annotated corpus in which the ATB end users are left with the task of providing case endings to read/understand or (b) an annotated ATB corpus displaying case endings with a higher percentage of errors due to a significantly more complex annotation task?</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Training annotators, ATB annotation
</SectionTitle>
      <Paragraph position="0"> characteristics and speed The two main factors which affect annotation speed in our ATB experience are both related to the specific 'stumbling blocks' of the Arabic language.</Paragraph>
      <Paragraph position="1"> 1. The first factor which affects annotation accuracy and consistency pertains to the annotators' educational background (their linguistic 'mindset') and more specifically to their knowledge - often confused and not clear - of traditional MSA grammar. Some of the important obstacles to POS training come from the confusing overlap, which exists between the morphological categories as defined for Western language description and the MSA traditional grammatical framework. The traditional Arabic framework recognizes three major morphological categories only, namely NOUN, VERB, and PARTICLE.</Paragraph>
      <Paragraph position="2"> This creates an important overlap which leads to mistakes/errors and consequent mismatches between the POS and syntactic categories. We have noticed the following problems in our POS training: (a) the difficulty that annotators have in identifying ADJECTIVES as against NOUNS in a consistent way; (b) problems with defining the boundaries of the NOUN category presenting additional difficulties coming from the fact that the NOUN includes adjectives, adverbials, and prepositions, which could be formally nouns in particular functions (e.g., from fawq q NOUN to fawqa aq a PREP &amp;quot;above&amp;quot; and fawqu uq a ADV etc.). In this case, the NOUN category then overlaps with the adverbs and prepositions of Western languages, and this is a problem for our annotators who are linguistically savvy and have an advanced knowledge of English and, most times, a third Western language. (c) Particles are very often indeterminate, and their category also overlaps with prepositions, conjunctions, negatives, etc.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2. The second factor which affects annotation
</SectionTitle>
    <Paragraph position="0"> accuracy and speed is the behemoth of grammatical tests. Because of the frequency of obvious weaknesses among very literate and educated native speakers in their knowledge of the rules of '&lt;iErAb' (i.e., case ending marking), it became necessary to test the grammatical knowledge of each new potential annotator, and to continue occasional annotation testing at intervals in order to maintain consistency.</Paragraph>
    <Paragraph position="1"> While we have been able to take care of the first factor so far, the second one seems to be a very persistent problem because of the difficulty level encountered by Arab and foreign annotators alike in reaching a consistent and agreed upon use of case-ending annotation.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Tools and procedures
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Lexicon and morphological analyzer
</SectionTitle>
      <Paragraph position="0"> The Penn Arabic Treebank uses a level of annotation more accurately described as morphological analysis than as part-of-speech tagging. The automatic Arabic morphological analysis and part-of-speech tagging was performed with the Buckwalter Arabic Morphological Analyzer, an open-source software package distributed by the Linguistic Data Consortium (LDC catalog number LDC2002L49).</Paragraph>
      <Paragraph position="1"> The analyzer consists primarily of three Arabic-English lexicon files: prefixes (299 entries), suffixes (618 entries), and stems (82158 entries representing 38600 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefixstem combinations (1648 entries), stem-suffix combinations (1285 entries), and prefix-suffix combinations (598 entries).</Paragraph>
      <Paragraph position="2"> The Arabic Treebank: Part 2 corpus contains 125,698 Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly.</Paragraph>
      <Paragraph position="3">  Of the 293,035 Arabic word tokens, 289,722 (98.87%) were provided with an accurate morphological analysis and POS tag by the Buckwalter Arabic Morphological Analyzer.</Paragraph>
      <Paragraph position="4"> 3,313 (1.13%) Arabic word tokens were judged to be incorrectly analyzed, and were flagged with a comment describing the nature of the inaccuracy. (Note that 204 of the 3,313 tokens for which no correct analysis was found were typos in the original text).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
ANNAHAR
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Parsing engine
</SectionTitle>
      <Paragraph position="0"> In order to improve the speed and accuracy of the hand annotation, we automatically pre-parse the data after POS annotation and before TB annotation using Dan Bikel's parsing engine (Bikel, 2002). Automatically pre-parsing the data allows the TB annotators to concentrate on the task of correcting a given parse and providing information about syntactic function (subject, direct object, adverbial, etc.).</Paragraph>
      <Paragraph position="1"> The parsing engine is capable of implementing a variety of generative, PCFG-style models (probabilistic context free grammar), including that of Mike Collins. As such, in English, it gets results that are as good if not slightly better than the Collins parser. Currently, this means that, for Section 00 of the WSJ of the English Penn Treebank (the development test set), the parsing engine gets a recall of 89.90 and a precision of 90.15 on sentences of length &lt;= 40 words. The Arabic version of this parsing engine currently brackets AFP data with recall of 75.6 and precision of 77.4 on sentences of 40 words or less, and we are in the process of analyzing and improving the parser results.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 Annotation procedure
</SectionTitle>
      <Paragraph position="0"> Our annotation procedure is to use the automatic tools we have available to provide an initial pass through the data. Annotators then correct the automatic output.</Paragraph>
      <Paragraph position="1"> First, Tim Buckwalter's lexicon and morphological analyzer is used to generate a candidate list of &amp;quot;POS tags&amp;quot; for each word (in the case of Arabic, these are compound tags assigned to each morphological segment for the word). The POS annotation task is to select the correct POS tag from the list of alternatives provided. Once POS is done, clitics are automatically separated based on the POS selection in order to create the segmentation necessary for treebanking. Then, the data is automatically parsed using Dan Bikel's parsing engine for Arabic. Treebank annotators correct the automatic parse and add semantic role information, empty categories and their coreference, and complete the parse. After that is done, we check for inconsistencies between the treebank and POS annotation. Many of the inconsistencies are corrected manually by annotators or automatically by script if reliably safe and possible to do so.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.4 POS annotation quality control
</SectionTitle>
      <Paragraph position="0"> Five files with a total of 853 words (and a varying number of POS choices per word) were each tagged independently by five annotators for a quality control comparison of POS annotators. Out of the total of 853 words, 128 show some disagreement. All five annotators agreed on 85% of the words; the pairwise agreement is at least 92.2%.</Paragraph>
      <Paragraph position="1"> For 82 out of the 128 words with some disagreement, four annotators agreed and only one disagreed. Of those, 55 are items with &amp;quot;no match&amp;quot; having been chosen from among the POS choices, due to one annotator's definition of good-enough match differing from all of the others'. The annotators have since reached agreement on which cases are truly &amp;quot;no match,&amp;quot; and thus the rate of this disagreement should fall markedly in future POS files, raising the rate of overall agreement.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5 Specifications for the Penn Arabic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
Treebank annotation guidelines
5.1 Morphological analysis/Part-of-Speech
</SectionTitle>
      <Paragraph position="0"> The guidelines for the POS annotators are relatively straightforward, since the task essentially involves choosing the correct analysis from the list of alternatives provided by the morphological analyzer and adding the correct case ending. The difficulties encountered by annotators in assigning POS and case endings are somewhat discussed above and will be reviewed by Tim Buckwalter in a separate presentation at COLING 2004.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Syntactic analysis
</SectionTitle>
      <Paragraph position="0"> For the most part, our syntactic/predicateargument annotation of newswire Arabic follows the bracketing guidelines for the Penn English Treebank where possible. (Bies, et al. 1995) Our  verbs is directly shown as NP-OBJ.</Paragraph>
      <Paragraph position="1"> We are also informed by on-going efforts to share data and reconcile annotations with the  TMP, etc.) are shown on the adverbial (PP, ADVP, clausal) modification of predicates.</Paragraph>
      <Paragraph position="2"> * The argument/adjunct distinction within NP is shown for noun phrases and clauses.</Paragraph>
      <Paragraph position="3"> * Empty categories (pro-drop subjects and traces of syntactic movement) are inserted.</Paragraph>
      <Paragraph position="4"> * Apposition is distinguished from other modification of nouns only for proper names.</Paragraph>
      <Paragraph position="5"> In spite of the considerable differences in word order between Modern Standard Arabic and English, we found that for the most part, it was relatively straightforward to adapt the guidelines for the Penn English Treebank to our Arabic Treebank. In the interest of speed in starting annotation and of using existing tools to the greatest extent possible, we chose to adapt as much as possible from the English Treebank guidelines. There exists a long-standing, extensive, and highly valued paradigm of traditional grammar in Classical Arabic. We chose to adapt the constituency approach from the Penn English Treebank rather than keeping to a strict and difficult adherence to a traditional Arabic grammar approach for several reasons: * Compatibility with existing treebanks, processing software and tools, * We thought it would be easier and more efficient to teach annotators, who come trained in Arabic grammar, to use our constituency approach than to teach computational linguists an old and complex Arabic-specific syntactic terminology.</Paragraph>
      <Paragraph position="6"> Nonetheless, it was important to adhere to an approach that did not strongly conflict with the traditional approach, in order to ease the cognitive load on our annotators, and also in order to be taken seriously by modern Arabic grammarians. Since there has been little work done on large data corpora in Arabic under any of the current syntactic theories in spite of the theoretical syntactic work being done (Mohamed, 2000), we have been working out solutions to Arabic syntax by combining the Penn Treebank constituency approach with pertinent insights from traditional grammar as well as modern theoretical syntax. For example, we analyze the underlying basic sentence structure as verb-initial, following the traditional grammar approach. However, since the verb is actually not the first element in many sentences in the data, we adopt a topicalization structure for arguments that are fronted before the verb (as in Example 2, where the subject is fronted) and allow adverbials and conjunctions to appear freely before the verb (as in Example 3, where a prepositional phrase is pre-verbal).  i a a i a un a i u un i i ur id a a a aa ~au' in a i i from another side, well-informed Egyptian sources revealed the truth of the matter For many structures, the traditional approach and the treebank approach come together very easily. The traditional &amp;quot;equational sentence,&amp;quot; for example, is a sentence that consists of a subject and a predicate without an overt verb (kAna or &amp;quot;to be&amp;quot; does not appear overtly in the present tense). This is quite satisfactorily represented in the same way that small clauses are shown in the Penn English Treebank, as in Example 4, since traditional grammar does not have a verb here, and we do not want to commit to the location of any potential verb phrase in these sentences.</Paragraph>
      <Paragraph position="7"> Example 4 (S (NP-SBJ Al-mas&gt;alatu u a a a ) (ADJP-PRD basiyTatuN un a i a )) un a i a u a a a the question is simple</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.3 Current issues and nagging problems
</SectionTitle>
      <Paragraph position="0"> In a number of structures, however, the traditional grammar view does not line up immediately with the structural view that is necessary for annotation. Often these are structures that are known to be problematic in a more general sense for either traditional grammar or theoretical syntax, or both. We take both views into account and reconcile them in the best way that we can.</Paragraph>
      <Paragraph position="1">  The prevalence of cliticization in Arabic sentences of determiners, prepositions, conjunctions, and pronouns led to a necessary difference in tokenization between the POS files and the TB files. Such cliticized constituents are written together with their host constituents in the text (e.g., Al+&lt;inosAn+i in a i &amp;quot;the person&amp;quot; and i@ a i i bi+qirA'ati &amp;quot;with reading&amp;quot;). Clitics that play a role in the syntactic structure are split off into separate tokens (e.g., object pronouns cliticized to verbs, subject pronouns cliticized to complementizers, cliticized prepositions, etc.), so that their syntactic roles can be annotated in the tree. Clitics that do not affect the structure are not separated (e.g., determiners). Since the word boundaries necessary to separate the clitics are taken from the POS tags, and since it is not possible to show the syntactic structure unless the clitics are separated, correct POS tagging is extremely important in order to be able to properly separate clitics prior to the syntactic annotation. In the example below, both the conjunction wa &amp;quot;and&amp;quot; and the direct object hA &amp;quot;it/them/her&amp;quot; are cliticized to the verb and also serve syntactic functions independent of the verb (sentential coordination and direct object).</Paragraph>
      <Paragraph position="2">  and + will + you [masc.pl.] + watch/observe/witness + it/them/her The rest of the verbal inflections are also regarded as clitics in traditional grammar terms. However, for our purposes they do not require independent segmentation as they do not serve independent syntactic functions. The subject inflection, for example, appears readily with full noun phrase subject in the sentence as well (although in this example, the subject is prodropped). The direct object pronoun clitic, in contrast, is in complementary distribution with full noun phrase direct objects. Topicalized direct objects can appear with resumptive pronouns in the post-verbal direct object position. However, resumptive pronouns in this structure should not be seen as problematic full noun phrases, as they are parasitic on the trace of movement - and in fact they are taken to be evidence of the topicalization movement, since resumptive pronouns are common in relative clauses and with other topicalizations.</Paragraph>
      <Paragraph position="3"> Thus, we regard the cliticized object pronoun as carrying the full syntactic function of direct object. As such, we segment it as a separate token and represent it as a noun phrase constituent that is a sister to the verb (as shown in Example 6 below).  and you will observe her  The question of the dual noun/verb nature of gerunds and participles in Arabic is certainly no less complex than for English or other languages. We have chosen to follow the Penn English Treebank practice to represent the more purely nominal masdar as noun phrases (NP) and the masdar that function more verbally as clauses (as S-NOM when in nominal positions). In Example 7, the masdar behaves like a noun in assigning genitive case.</Paragraph>
      <Paragraph position="4">  i a ib ia i@ a i i with the reading of the book of syntax [book genitive] In Example 8, in contrast, the masdar functions more verbally, in assigning accusative case.</Paragraph>
      <Paragraph position="5">  ab i a a i i@ a i i with Fatma's reading the book [book accusative] This annotation scheme to allow for both the nominal and verbal functions of masdar is easily accepted and applied by annotators for the most part. However, there are situations where the functions and behaviors of the masdar are in disagreement. For example, a masdar can take a determiner 'Al-' (the behavior of a noun) and at the same time assign accusative case (the behavior of a verb).</Paragraph>
      <Paragraph position="6">  id u a ))))) id u a ir a a az i i a u i with the (person in) charge of completion (of) the promised report [completion accusative] In this type of construction, the annotators must choose which behaviors to give precedence (accusative case assignment trumps determiners, for example). However, it also brings up the issues and problems of assigning case ending and the annotators' knowledge of Arabic grammar and the rules of '&lt;iErAb.' These examples are complex grammatically, and finding the right answer (even in strictly traditional grammar terms) is often difficult.</Paragraph>
      <Paragraph position="7"> This kind of ambiguity and decision-making necessarily slows annotation speed and reduces accuracy. We are continuing our discussions and investigations into the best solutions for such issues.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML