File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1001_intro.xml

Size: 3,402 bytes

Last Modified: 2025-10-06 14:06:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1001">
  <Title>Discovering Lexical Information by Tagging Arabic Newspaper Text</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> A lexicon is considered to be the backbone of any natural language application. It is an essential basis for parsing, text generation, and information retrieval systems. We cannot implement any of these applications or others in the natural language area without having a good lexicon. All natural language processing systems need a lexicon full of explicit information \[Ahlswede and Evens, 1988; Byrd et al., 1987 McCawley, 1986\]. The best way to find the necessary lexieal information, we believe, is to extract it automatically from text.</Paragraph>
    <Paragraph position="1"> We are developing a part-of-speech tagger for Arabic newspaper text. We are testing it on a corpus developed by Ahmad Hasnah \[1996\] based on text given to Illinois Institute of Technology, by the newspaper, Al-Raya, published in Qatar. The questions we address here are how to build an efficient techniques for automating the tagger system, what techniques and algorithms can be used in finding the part of speech and extracting the features of the word.</Paragraph>
    <Paragraph position="2"> When it comes to the Arabic language there are problems and challenges that are not present in English or other European languages.</Paragraph>
    <Paragraph position="3"> Newspaper articles are full of proper nouns that need special rules to tag them in the text, because the Arabic language does not distinguish between lower and upper case letters, which leave us with a big problem in recognizing proper nouns in Arabic text.</Paragraph>
    <Paragraph position="4"> The lack of vowels in the text we are using creates big problems of ambiguity. Different vowels change the word from noun to verb and from one t39e of noun to another; they also change the meaning of the word. For example, the following two words have the same letters with the same sequence but with  different vowels. The result is different meanings.</Paragraph>
    <Paragraph position="5"> , &amp;quot;;~ k(a)t(a)b wrote ' &amp;quot;&lt; k(u)t(u)b books  Most published Arabic text is not vowelized with the exception of the Holy Quran and books for children.</Paragraph>
    <Paragraph position="6"> Some words in Arabic text begin with one, two, three, or four extra letters that constitute articles or prepositions. For example, the following word consists of two parts: the particle (a preposition letter) that is attached to the beginning of the noun while it is not part of it and the noun itself.</Paragraph>
    <Paragraph position="7"> (on occasion) &amp;quot;.-q~t~. -~ ~ + &amp;quot;.-q~tz, We need to identify these cases in the text and deal with them in a perceptive way.</Paragraph>
    <Paragraph position="8"> In this paper we are trying to find answers to these challenges through building a tagger system whose main function is to parse an Arabic text, tag the parts of speech, and find out their features to build a lexicon for this language. Three main techniques used in this system for tagging the words are: finding phrases (verb phrases, noun phrases, and proper noun phrases), analyzing the affixes of the word, and analyzing its pattern.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML