File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1026_metho.xml

Size: 25,511 bytes

Last Modified: 2025-10-06 14:07:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1026">
  <Title>A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Waleed AL-FARES
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham (1974), our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology.</Paragraph>
    <Paragraph position="1"> Tests show a successful treatment of infixes and accurate clustering to up to 94.06% for unedited Arabic text samples, without the use of dictionaries.</Paragraph>
    <Paragraph position="2"> Introduction Canonisation of wor ds for indexing is an important and difficult problem for Arabic IR. Arabic is a highly inflectional language with 85% of words derived from tri-lateral roots (Al-Fedaghi and Al-Anzi 1989). Stems are derived from roots through the application of a set of fixed patterns. Addition of affixes to stems yields words. Words sharing a root are semantically related and root indexing is reported to outperform stem and word indexing on both recall and precision (Hmeidi et al 1997). However, Arabic morphology is excr uciatingly complex (the Appendix attempts a brief introduction), and root identification on a scale useful for IR remains problematic.</Paragraph>
    <Paragraph position="3"> Research on Arabic IR tends to treat automatic indexing and stemming separately. Al-Shalabi and Evans (1998) and El-Sadany and Hashish (1989) developed stemming algorithms. Hmeidi et al (1997) developed an information retrieval system with an index, but does not explain the underlying stemming algorithm. In Al-Kharashi and Evans (1994), stemming is done manually and the IR index is built by manual insertion of roots, stems and words.</Paragraph>
    <Paragraph position="4"> Typically, Arabic stemming algorithms operate by &amp;quot;trial and error&amp;quot;. Affixes are stripped away, and stems &amp;quot;undone&amp;quot;, according to patterns and rules, and with reference to dictionaries. Root candidates are checked against a root lexicon. If no match is found, affixes and patterns are readjusted and the new candidate is checked. The process is repeated until a root is found.</Paragraph>
    <Paragraph position="5"> Morpho-syntactic parsers offer a possible alternative to stemming algorithms. Al-Shalabi and Evans (1994), and Ubu-Salem et al (1999) develop independent analysers. Some work builds on established formalisms such a DATR (Al-Najem 1998), or KIMMO. This latter strand produced extensive deep analyses. Kiraz (1994) extended the architecture with multi-level tape, to deal with the typical interruption of root letter sequences caused by broken plural and weak root letter change. Beesley (1996) describes the re-implementation of earlier work as a single finite state transducer between surface and lexical (root and tag) strings. This was refined (Beesley 1998) to the current on-line system capable of analysing over 70 million words.</Paragraph>
    <Paragraph position="6"> So far, these approaches have limited scope for deployment in IR. Even if substantial, their morpho-syntactic coverage remains limited and processing efficiency implications are often unclear. In addition, modern written Arabic presents a unique range of orthographic problems. Short vowels are not normally written (but may be). Different regional spelling conventions may appear together in a single text and show interference with spelling errors.</Paragraph>
    <Paragraph position="7"> These systems, however, assume text to be in perfect (some even vowelised) form, forcing the need for editing prior to processing. Finally, the success of these algorithms depends critically on root, stem, pattern or affix dictionary quality, and no sizeable and reliable electronic dictionaries exist. Beesley (1998) is the exception with a reported 4930 roots encoded with associated patterns, and an additional affix and non-root stem lexicon 1. Absence of large and reliable electronic lexical resources means dictionaries would have to be updated as new words appear in the text, creating a maintenance overhead. Overall, it remains uncertain whether these approaches can be deployed and scaled up cost-effectively to provide the coverage required for full scale IR on unsanitised text.</Paragraph>
    <Paragraph position="8"> Our objective is to circumvent morpho-syntactic analysis of Arabic words, by using clustering as a technique for grouping words sharing a root. In practise, since Arabic words derived from the same root are semantically related, root based clusters can substitute root dictionaries for indexing in IR and furnish alternative search terms. Clustering works without dictionaries, and the approach removes dictionary overheads completely. Clusters can be implemented as a dimension of the index, growing dynamically with text, and without specific maintenance. They will accommodate effortlessly a mixture of regional spelling conventions and even some spelling errors.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Clusterin g and Arabic.
</SectionTitle>
    <Paragraph position="0"> To our knowledge, there is no application of automatic root-based clustering to Arabic, using morphological similarity without dictionary.</Paragraph>
    <Paragraph position="1"> Clustering and stemming algorithms have mainly been developed for Western European languages, and typically rely on simple heuristic rules to strip affixes and conflate strings. For instance, Porter (1980) and Lovins (1968) confine stemming to suffix removal, yet yield acceptable results for English, where roots are relatively inert. Such approaches exploit the morphological frugality of some languages, but do not transfer to heavily inflected languages such as Arabic.</Paragraph>
    <Paragraph position="2"> In contrast, Adamson and Boreham (1974) developed a technique to calculate a similarity co-efficient between words as a factor of the number of shared sub-strings. The approach (which we will call Adamson's algorithm for short) is a promising starting point for Arabic</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Al-Fedaghi and Al-Anzi (1989) estimate there are
</SectionTitle>
    <Paragraph position="0"> around 10,000 independent roots.</Paragraph>
    <Paragraph position="1"> clustering because affix removal is not critical to gauging morphological relatedness.</Paragraph>
    <Paragraph position="2"> In this paper, we explain the algori thm, apply it to raw modern Arabic text and evaluate the result. We explain our Two-stage algorithm, which extends the technique by (a) light stemming and (b) refinements sensitive to Arabic morphology. We show how the adaptation increased successful clustering of both the original and new evaluation data.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Data Description
</SectionTitle>
    <Paragraph position="0"> We focus on IR, so experiments use modern, unedited Arabic text, with unmarked short vowels (Stalls and Knight 1998). In all we constructed five data sets. The first set is controlled, and was designed for testing on a broad spectrum of morphological variation. It contains selected roots with derived words chosen for their problematic structure, featuring infixes, root consonant changes and weak letters. It also includes superficially similar words belonging to different roots, and examples of hamza as a root consonant, an affix and a silent sign. Table 1 gives details.</Paragraph>
    <Paragraph position="1">  Data sets two to four contain articles extracted from Al-Raya (1997), and the fifth from Al-Watan (2000), both newspapers from Qatar.</Paragraph>
    <Paragraph position="2"> Following Adamson, function words have been removed. The sets have domain bias with the second (575 words) and the fourth (232 words) drawn randomly from the economics and the third (750 words) from the sports section. The fifth (314 words) is a commentary on political history. Sets one to three were used to varying extents in refining our Two-stage algorithm. Sets four and five were used for evaluation only.</Paragraph>
    <Paragraph position="3"> Electronically readable Arabic text has only recently become available on a useful scale, hence our experiments were run on short texts.</Paragraph>
    <Paragraph position="4"> On the other hand, the coverage of the data sets allows us to verify our experiments on demanding samples, and their size lets us verify correct clustering manually.</Paragraph>
    <Paragraph position="5">  3. Testing Adamson's Algorithm</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The Algorithm
</SectionTitle>
      <Paragraph position="0"> Adamson and Boreham (1974) developed a technique expressing relatedness of strings as a factor of shared sub-strings. The algorithm drags an n -sized window across two strings, with a 1 character overlap, and removes duplicates. The strings' similarity co-efficient (SC) is calculated by Dice's equation: SC (Dice) = 2*(number of shared unique n-grams)/(sum of unique n-grams</Paragraph>
      <Paragraph position="2"> String 2-grams Unique 2-grams phosphorus ph ho os sp ph ho or ru us ph ho os sp or ru</Paragraph>
      <Paragraph position="4"> After the SC for all word pairs is known, the single link clustering algorithm is applied. A similarity (or dissimilarity) threshold is set. The SC of pairs is collected in a matrix. The threshold is applied to each pair's SC to yield clusters. A cluster absorbs a word as long as its SC to another cluster item exceeds the threshold (van Rijsbergen 1979). Similarity to a single item is sufficient. Cluster size is not pre-set.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Background Assumptions
</SectionTitle>
      <Paragraph position="0"> This experiment tests Adamson's algorithm on Arabic data to assess its ability to cluster words sharing a root. Each of the data sets was clustered manually to provide an ideal benchmark. This task was executed by a native Arabic speaker with reference to dictionaries.</Paragraph>
      <Paragraph position="1"> Since we are working with very small texts, we sought to remove the effects of sampling in the tests. To assess Adamson's algorithm's potential for clustering Arabic words, we preferred to compare instances of optimal performance. We varied the SC to yield, for each data set, the highest number of correct multi-word clusters.</Paragraph>
      <Paragraph position="2"> Note that the higher the SC cut-off, the less likely that words will cluster together, and the more single word clusters will appear. This has the effect of growing the number of correct clusters because the proportion of correct single word clusters will increase. As a consequence, for our purposes, the number of correct multi-word clusters (and not just correct clusters) are an important indicator of success.</Paragraph>
      <Paragraph position="3"> A correct multi-word cluster covers at least two words and is found in the manual benchmark. It contains all and only those words in the data set which share a root. Comparison with a manual benchmark inevitably introduces a subjective element. Also, our evaluation measure is the percentage of correct benchmark clusters retrieved. This is a &amp;quot;recall&amp;quot; type indicator. Together with the strict definition of correct cluster, it cannot measure cluster quality. Finer grained evaluation of cluster quality would be needed in an IR context.</Paragraph>
      <Paragraph position="4"> However, our main concern is comparing algorithms. The current metrics aim for a conservative gauge of how Adamson's algorithm can yield more exact clusters from a full range of problematic data.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Ranges rather than specific values are given where
</SectionTitle>
    <Paragraph position="0"> cut-offs between the lower and higher value do not alter cluster distribution.</Paragraph>
    <Paragraph position="1"> Our interpretation of correct clustering is stringent and therefore conservative, adding to the significance of our results. Cluster quality will be reviewed informally.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Adamson's Arabic Test Results
</SectionTitle>
      <Paragraph position="0"> Table 3 shows results for Adamson's algorithm. The figures for the first data set have to be suitably interpreted. The set deliberately did not include single word clusters.</Paragraph>
      <Paragraph position="1"> The results suggest that the algorithm is very successful at identifying single word clusters but performs poorly on multi-word clusters. The high success rate for single word clusters is partly due to the high SC cut-off, set to yield as many correct multi-word clusters as possible.</Paragraph>
      <Paragraph position="2"> In terms of quality, however, only a small proportion of multi-word clusters were found to contain infix derivations (11.11%, 4.76%, 0.0% 4.35% and 9.09% for each data set respectively), as opposed to other variations. In other words, strings sharing character sequences in middle position cluster together more successfully. Infix recognition is a weak point in this approach.</Paragraph>
      <Paragraph position="3"> Whereas the algorithm is successful for English, it is no surprise that it should not perform equally well on Arabic. Arabic words tend to be short and the chance of words derived from different roots sharing a significant proportion of characters is high (eg K h br ( news ) vs K h bz ( bread )). Dice's equation assumes the ability to identify an uninterrupted sequence of root consonants. The heavy use of infixes runs against this. Similarly, affixes cause interference (see 4.1.1).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Two-Stage Algorithm.
</SectionTitle>
    <Paragraph position="0"> The challenge of root based clustering for Arabic lies in designing an algorithm which will give relevance to root consonants only. Using Adamson's algorithm as a starting point, we devised a solution by introducing and testing a number of successive refinements based on the morphological knowledge and the first three data sets. The rationale motivating these refinements is given below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Refinements
</SectionTitle>
      <Paragraph position="0"> The high incidence of affixes keeps accurate cluster formation low, because it increases the SC among words derived from different roots, and lowers the SC between derivations of the same root using different affixes, as illustrated in tables 4 and 5. Following Popovic and Willet (1992), we introduced stemming to minimise the effect of affixes. We found empirically that light stemming, removing a small number of obvious affixes, gave better results than heavy stemming aimed at full affix stripping. Heavy stemming brought the risk of root consonant loss (eg t'amyn ( insurance ) from root amn ( sheltered ): heavy stemming: t'am, light stemming: t'amn).</Paragraph>
      <Paragraph position="1"> Light stemming, on the other hand, does little more than reducing word size to 3 or 4 characters.</Paragraph>
      <Paragraph position="2">  Weak letters ( alif, waw, ya ) occur freely as root consonants as well as affixes. Under derivation, their form and location may change, or they may disappear. As infixes, they interfere with SC, causing failure to cluster (table 6).</Paragraph>
      <Paragraph position="3"> Their effects were reduced by a method we refer to as &amp;quot;cross&amp;quot;. It adds a bi-gram combining the letters occurring before and after the weak letter.  String Unique 2-grams with affixes Unique 2-grams without affixes</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> weighting: Our objective is to define an algorithm which gives suitable precedence to root consonants. Light stemming, however does not remove all affixes. Whereas fool proof affix detection is problematic due to the overlap between affix and root consonants, affixes belong to a closed class and it is possible to identify &amp;quot;suspect&amp;quot; letters which might be part of an affix.</Paragraph>
      <Paragraph position="8"> Following Harman (1991) we explored the idea of assigning differential weights to substrings. Giving equal weight of 1 to all substrings equates the evidence contributed by all letters, whether they are root consonants or not. Suspected affixes, however, should not be allowed to affect the SC between words on a par with characters contributing stronger evidence.</Paragraph>
      <Paragraph position="9"> We conducted a series of experiments with differential weightings, and determined empirically that 0.25 weight for strings containing weak letters, and 0.50 for strings containing suspected non-weak letter affixes gave the best SC for the first three data sets.</Paragraph>
      <Paragraph position="10">  N-gram size can curtail the significance of word boundary letters (Robertson and Willet 1992). To give them opportunity to contribute fully to the SC, we introduced word boundary blanks (Harman 1991).</Paragraph>
      <Paragraph position="11"> Also, the larger the n-gram, the greater its capacity to mask the shorter substring which can contain important evidence of similarity between word pairs (Adamson and Boreham 1974). Of equal importance is the size of the sliding overlap between successive n-grams (Adams 1991).</Paragraph>
      <Paragraph position="13"> The problem is to find the best setting for n-gram and overlap size to suit the language. We sought to determine settings experimentally. Bi-grams with single character overlap and blank insertion (* in the examples) at word boundaries raised the SC for words sharing a root in our three data sets, and lowered the SC for words belonging to different roots.</Paragraph>
      <Paragraph position="14">  Dice's equation boosts the importance of unique shared substrings between word pairs, by doubling their evidence. As we argued earlier, since Arabic words tend to be short, the relative impact of shared substrings will already be dramatic. We replaced the Dice metric with the Jaccard formula below to reduce this effect (see van Rijsbergen 1979). SC (Jac) = shared unique n-grams/(sum of unique n-grams in each string shared unique n-grams)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The Two-stage Algorithm
</SectionTitle>
      <Paragraph position="0"> The Two-stage algorithm is fully implemented.</Paragraph>
      <Paragraph position="1"> Words are first submitted to light stemming to remove obvious affixes. The second stage is based on Adamson's algorithm, modified as described above. From the original, we retained bi-grams with a one character overlap, but inserted word boundary blanks. Unique bi-grams are isolated and cross is implemented. Each bi-gram is assigned a weight (0.25 for bi-grams containing weak letters; 0.5 for bi-grams containing potential non-weak letter affixes; 1 for all other bi-grams). Jaccard's equation computes a SC for each pair of words. We retained the single-link clustering algorithm to ensure comparability.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Testing the Two-stage Algorithm
</SectionTitle>
      <Paragraph position="0"> Table 8 shows the results of the Two-stage algorithm for our data sets. The maximally effective cut of point for all sets lies closer.</Paragraph>
      <Paragraph position="1"> Figures for the first set have to be treated with caution. The perfect clustering is explained by the text's perfect spelling and by the sample containing exactly those problematic phenomena on which we wanted to concentrate.</Paragraph>
      <Paragraph position="2">  The algor ithm deals with weak letter mutation, and infix appearance and disappearance in words sharing a root (eg the root qwm and its derived words, especially the role of Hamza as an infix in one of its variations). Even though the second and third data sets informed the modifications to a limited extent, their results show that the improvements stood up to free text. For the second data set, the Two-stage algorithm showed 31.5% improvement over Adamson's algorithm. Importantly, it discovered 84.13% of the multi-word clusters containing words with infixes, an improvement of 79.37%.</Paragraph>
      <Paragraph position="3"> The values for single word clustering are close and the modifications preserved the strength of Adamson's algorithm in keeping single word clusters from mixing, because we were able to maintain a high SC threshold.</Paragraph>
      <Paragraph position="4"> On the third data set, the Two-stage algorithm showed an 26.11% overall improvement, with 84% successful multi-word clustering of words with infixes (compare 0% for Adamson). The largest cluster contained 14 words. 10 clusters counted as unsuccessful because they contained one superficially similar variation belonging to a different root (eg TwL ( lengthened ) and bTL ( to be abolished )). If we allow this error margin, the success rate of multi-word clustering rises to 90%. Since our SC cut-off was significantly lower than in Adamson's base line experiment, we obtained weaker results for single word clustering.</Paragraph>
      <Paragraph position="5"> The fourth and fifth data sets played no role in the development of our algorithm and were used for evaluation purposes only. The Two-stage algorithm showed an 23.18% overall improvement in set four. It successfully built all clusters containing words with infixes (100% compare with 4.35% for Adamson's algorithm), an improvement of 95.65%. The two-stage algorithm again preserved the strength of Adamson at distinguishing single word clusters, in spite of a lower SC cut-off.</Paragraph>
      <Paragraph position="6"> The results for the fifth data set are particularly important because the text was drawn from a different source and domain. Again, significant improvements in multi and single word clustering are visible, with a slightly higher SC cut-off. The algorithm performed markedly better at identifying multi-word clusters with infixes (72.72% - compare with 9.09% for Adamson).</Paragraph>
      <Paragraph position="7"> The results suggest that the Two- stage algorithm preserves the strengths of Adamson and Boreham (1994), whilst adding a marked advantage in recognising infixes. The outcome of the evaluation on fourth and fifth data sets are very encouraging and though the samples are small, they give a strong indication that this kind of approach may transfer well to text from different domains on a larger scale.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Two-stage Algorithm Limitations
</SectionTitle>
    <Paragraph position="0"> Weak letters can be root consonants, but our differential weighting technique prevents them from contributing strong evidence, whereas non-weak letters featuring in affixes, are allowed to contribute full weight. Modifying this arrangement would interfere with successful clustering (eg after light stemming: t is a root consonant in ntj ( produced ) and an infix in Ltqy (from root Lqy - encountered ). These limitations are a result of light stemming.</Paragraph>
    <Paragraph position="1"> Although the current results are promising, evaluation was hampered by the lack of a sizeable data set to verify whether our solution would scale up.</Paragraph>
    <Paragraph position="2"> Conclusion We ha ve developed, successfully, an automatic classification algorithm for Arabic words which share the same root, based only on their morphological similarities. Our approach works on unsanitised text. Our experiments show that algorithms designed for relatively uninflected languages can be adapted for highly inflected languages, by using morphological knowledge.</Paragraph>
    <Paragraph position="3"> We found that the Two-stage algorithm gave a significant improvement over Adamson's algorithm for our data sets. It dealt successfully with infixes in multi-word clustering, an area where Adamson's algorithm failed. It matched the strength of Adamson in identifying single word clusters, and sometimes did better. Weak letters and the overlap between root and affix consonants continue to cause interference.</Paragraph>
    <Paragraph position="4"> Nonetheless, the results are promising and suggest that the approach may scale up Future work will concentrate on two issues.</Paragraph>
    <Paragraph position="5"> The light stemming algorithm and the differential weighting may be modified to improve the identification of affixes. The extent to which the algorithm can be scaled up must be tested on a large corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML