File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3090_metho.xml
Size: 11,780 bytes
Last Modified: 2025-10-06 14:12:31
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3090"> <Title>A MATRIX REPRESENTATION OF THE INFLECTIONAL FORMS OF ARABIC WORDS: A STUDY OF CO-OCCURRENCE PATTERNS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> A MATRIX REPRESENTATION OF THE INFLECTIONAL FORMS OF ARABIC WORDS: A STUDY OF CO-OCCURRENCE PATTERNS </SectionTitle> <Paragraph position="0"> H, E. Mahgoub, M.A. Hashish ABSTRACT sequence of three letters, called a triliteral root. A proposed &quot;Matrix&quot; method for the representation of the inflectional paradigms of Arabic words is presented. This representation results in a classification of Arabic words into a tree structure (Fig(l)) whose leaves represent unique conjugational or derivational paradigms, each represented in the proposed &quot;Matrix&quot; form. A study of about 2,500 stems from a high frequency Arabic wordList due to Landau <I> has revealed a systematic set of co-occurrence patterns for the encLitic pronouns of Arabic verbs and for the possessive pronouns attached to Arabic nouns. Each co-occurrence pattern represents a subcategorization frame that reflects the underlying semantic relationship.</Paragraph> <Paragraph position="1"> The key feature that distinguishes these semantic patterns has been observed to be whether the attached suffixes relate to the animate or inanimate. In some cases for verbs, the number of the subject is also a significant feature. These semantic features also extend to non-attached subjects and objects (for verbs) and to possessive noun complements (for nouns). Therefore the semantic classes presented in this paper also assist in syntactic/semantic analysis.</Paragraph> <Paragraph position="2"> The first application that Was developed, based upon the proposed representaion is a stem-based Arabic morphological ans/yser, from which a spell checker (on a PS/2 microcomputer) emerged as a by-product. Currently, the system is being used to interact with an Arabic syntactic parser and there are plans to use it in a machine assisted translation system.</Paragraph> <Paragraph position="3"> i. INTRODUCTION Over the past few years there has been a marked increase in the use of computers in the Arabic speaking countries. Many applications programs in Arabic have been developed, but the field of computations/ linguistics is relatively new in Arabic and presents a unique challenge, due to the highly inflected nature of the Arabic language.</Paragraph> <Paragraph position="4"> In the present work, we have attempted to represent the morphological rules governing the inflections of Arabic words in a compact form which can simplify the processing of Arabic words by computers and which is independent of the a particular application. There have been other attempts to show the conjugations of Arabic verbs <2> but the treatment does not delve into sufficient depth and not all enclitics, which are an essential part of Arabic verbs, are considered. Moreover, the treatment in <2> does not extend to nouns.</Paragraph> <Paragraph position="5"> By studying some 2,500 stems out of a high frequency Arabic wordlist due to Landau <1>, certain systematic co-occurrence patterns governing verb enclitics and noun possessive pronouns have been observed. These patterns are what we call &quot;Matrices&quot; in this paper; each unique &quot;Matrix&quot; reflects a different semantic behaviour.</Paragraph> <Paragraph position="6"> To summarize Arabic morphology in a nutshell, about 80 ~, of Arabic words can be derived from a For example,if we consider the root ~ ,~ (K T B), we can form words such as .l -, .r _..PS~</Paragraph> <Paragraph position="8"> subjecting the root to various &quot;forms&quot; or &quot;moulds&quot; and by undergoing certain morpho-phonemic (and possibly also morpho-graphemic) changes. For a full discussion of traditional Arabic morphology see <9> and <10>. In this paper, we shall define such an inflected form to be a &quot;STEM&quot;.</Paragraph> <Paragraph position="9"> Thus a stem may contain infixes and certain prefixes which are part of the &quot;mould&quot; but may not contain any suffixes. Suffixes for verbs are subject and object pronouns, while for nouns they are possessive pronouns.</Paragraph> <Paragraph position="10"> One further definition which is used in the proposed representaion is the &quot;Core&quot;; this is simply the inflected form with all prefixes and suffixes stripped off. The core may or may not be a valid word.</Paragraph> <Paragraph position="11"> In comparison with other work in the area of traditional Arabic morphology (<3>,<4>), where the concern is with the rules which cause the inflected form to be derived from the ROOT, we have studied the rules governing the derivation of all possible inflected forms from the STEM, as defined above.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. THE MATRIX REPRESENTATION </SectionTitle> <Paragraph position="0"> Sample &quot;MATRIX PARADIGMS&quot; are shown in Fig(2) for verbs and Fig(3) for nouns. Table(1) gives the keys in English to the columns on the Matrix Paradigms. The inflected form for a given Person/Number~Gender/Mode combination for verbs (obtained from the relevant &quot;row&quot; of the Matrix Paradigm) is constructed by concatenating the prefix, core and both subject and object pronoun column entries. The inflected forms for nouns are sinfilarly constructed for a particular Number/Gender/Case combination.</Paragraph> <Paragraph position="1"> The various &quot;cells&quot; of the object pronoun columns indicate whether a particular entry is valid (indicated by &quot;U', an Arabic numeral one). Invalid entries are indicated by a &quot;'&quot;, an Arabic zero. It is due to this matrix of ones and zeros that the representation was named the &quot;Matrix Paradigm&quot;.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. TAXONOMY OF ARABIC WORDS </SectionTitle> <Paragraph position="0"> Fig(l) shows a tree diagram representing the taxonomical classification of Arabic verbs and nouns. There are different &quot;levels&quot; in the tree correspond to different types of variations of the inflected form from one class to another. The first type of variation coincides more or less with the traditional classification and is respresented at levels 2 and 3 for verbs and at level 2 for nouns.</Paragraph> <Paragraph position="1"> Each Matrix Paradigm also reflects two further types of variation, which can be considered separately from one another. The first is the variation in the core with the different rows; this dimension corresponds, for example, to the traditional study of verb conjugations (see <2>).</Paragraph> </Section> <Section position="4" start_page="0" end_page="416" type="metho"> <SectionTitle> - I - </SectionTitle> <Paragraph position="0"> The other type of variation is that in the distribution of the Matrix of ones and zeros, which is essentially a variation in the co-occurrence of object pronouns (for transitive verbs) and possessive pronouns (for nouns). This variation is reflected at level 4 of the taxonomy. In the following sections 3.1 and 3.2, we will discuss the study of these co-occurrence patterns in more detail for verbs and nouns separately.</Paragraph> <Section position="1" start_page="416" end_page="416" type="sub_section"> <SectionTitle> 3.1 CO-OCCURRENCE PATTERNS FOR VERBS </SectionTitle> <Paragraph position="0"> On examination of the Landau <I> high frequency wordlist, the following features seemed to distinguish classes of verbs apart: 1- Whether the subject is human or non-human (for both transitive and intransitive verbs). 2- Whether the object is human or non-human (for transitive verbs only).</Paragraph> <Paragraph position="1"> 3- The number of the subject (for intransitive &quot;verbs only).</Paragraph> <Paragraph position="2"> in Arabic, there is a set of object pronouns which refers to a non-human object: (t.,~,~,a) and this will be denoted by -H. This is a subset of the complete set of pronouns +H, which denotes human and non-human. Below, we will discuss the features for transitive and hitransitive verbs separately: (a) Transitive Verbs: As shown in the table below, there can only be 4 combinations of the features +H and -H. Each of the feature sets in the table has been designated a class cede. Only verbs with features corresponding to the feature sets B,C and D have been found in It was found out that the subject number is an additional distinguishing feature for transitive verbs. Moreover, the subject number is sigmificant only in the case of human subjects. For non-human subjects, this feature is not significant.</Paragraph> <Paragraph position="3"> Based upon the above observations, we will define the distinguishing features for intransitive verbs to be +H(s),+H(dp) and -H, where s denotes singular and dp denotes dual/plural. +H(s) and +H(dp) denote the sets of singular and dual/plural subjects, respectively. By definition +t{(s) U +H(dp) -H, where U denotes the union of the two feature sets. The table below shows the possible combinations of these features; only features designated by A,E and F were found for</Paragraph> </Section> <Section position="2" start_page="416" end_page="416" type="sub_section"> <SectionTitle> 3.2 CO-OCCURRENCE PATTERNS FOR NOUNS </SectionTitle> <Paragraph position="0"> The same set of object pronouns for verbs denotes the possessive pronouns for nouns, with the exception of a slight difference in form of the first person singular. The -H set is exactly the same.</Paragraph> <Paragraph position="1"> Three distinct classes of Matrix patterns (see level 3 of Fig(l)) have been observed for nouns: inanimate (set -H) can be attached.</Paragraph> <Paragraph position="2"> An additional study was made to determine what Number/Gender (NG) combinations are valid for a particular noun stem. These have been found to be an important feature of Arabic nouns, as not all NG combinations are valid for a stem* Each stem needs to be examined separately and this information is put into the lexicon of stem. The NG combinations are represented at level 3 of the taxonomy, for nouns (see Fig(l)).</Paragraph> <Paragraph position="3"> Although there is no systematic, theoretical method for deternfining what all the different NG combinations are for comprehensive coverage of nouns, yet by examining more and more nouns from Landau's <I> wordlist, some form of convergence occurred. For the 2,500 stem shortlist, there were only 17 NG combinations.</Paragraph> <Paragraph position="4"> This curious feature of Arabic nouns can be mainly attributed to the presence of words of foreign origin and to the pragmatics of the noun in question.</Paragraph> </Section> </Section> <Section position="5" start_page="416" end_page="416" type="metho"> <SectionTitle> 4. APPLICATIONS DEVELOPED </SectionTitle> <Paragraph position="0"> As a first application, an Arabic stem-based morphological analyser has been developed on an IBM PS/2 microcomputer. Morphological features of the word analysed are computed.</Paragraph> <Paragraph position="1"> As a by-product of the analyser, an Arabic spelling verifier has been developed, by including unification of the morphological and co-occurrence features of the morphemes.</Paragraph> <Paragraph position="2"> The system is currently being developed for. use in the interaction with an Arabic syntactic parser.</Paragraph> </Section> class="xml-element"></Paper>