File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4038_intro.xml

Size: 2,789 bytes

Last Modified: 2025-10-06 14:02:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4038">
  <Title>Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunksa0</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Arabic Language and Data
</SectionTitle>
    <Paragraph position="0"> Arabic is a Semitic language with rich templatic morphology. An Arabic word may be composed of a stem (consisting of a consonantal root and a template), plus affixes and clitics. The affixes include inflectional markers for tense, gender, and/or number. The clitics include some (but not all) prepositions, conjunctions, determiners, possessive pronouns and pronouns. Some are proclitic ( attaching to the beginning of a stem) and some enclitics (attaching to the end of a stem). The following is an example of the different morphological segments in the word a9a11a10a13a12a15a14 a16a18a17a20a19a22a21a15a23 which means and by their virtues. Arabic is read from right to left hence the directional switch in the English gloss.</Paragraph>
    <Paragraph position="1"> enclitic affix stem proclitic proclitic Arabic: a24a26a25 a27a29a28 a30a29a31a33a32 a34 a35 Translit: hm At Hsn b w Gloss: their s virtue by and The set of possible proclitics comprises the prepositions a36 b,l,ka37 , meaning by/with, to, as, respectively, the conjunctions a36 w, fa37 , meaning and, then, respectively, and the definite article or determiner a36 Ala37 , meaning the. Arabic words may have a conjunction and a prepostition and a determiner cliticizing to the beginning of a word. The set of possible enclitics comprises the pronouns and (possessive pronouns) a36 y, nA, k, kmA, km, knA, kn, h, hA, hmA, hnA, hm, hna37 , respectively, my (mine), our (ours), your (yours), your (yours) [masc. dual], your (yours) [masc. pl.], your (yours) [fem. dual], your (yours) [fem.</Paragraph>
    <Paragraph position="2"> pl.], him (his), her (hers), their (theirs) [masc. dual], their (theirs) [fem. dual], their (theirs) [masc. pl], their (theirs) [fem. pl.]. An Arabic word may only have a single enclitic at the end. In this paper, stems+affixes, proclitics, enclitics and punctuation are referred to as tokens. We define a token as a space delimited unit in clitic tokenized text.</Paragraph>
    <Paragraph position="3"> We adopt a supervised learning approach, hence the need for annotated training data. Such data are available from the Arabic TreeBank,2 a modern standard Arabic corpus containing Agence France Presse (AFP) newswire articles ranging over a period of 5 months from July through November of 2000. The corpus comprises 734 news articles (140k words corresponding to 168k tokens after semi-automatic segmentation) covering various topics such as sports, politics, news, etc.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML