File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/p03-1051_relat.xml
Size: 4,179 bytes
Last Modified: 2025-10-06 14:15:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1051"> <Title>Language Model Based Arabic Word Segmentation</Title> <Section position="3" start_page="1" end_page="2" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Our work adopts major components of the algorithm from (Luo & Roukos 1996): language model (LM) parameter estimation from a segmented corpus and input segmentation on the basis of LM probabilities. However, our work diverges from their work in two crucial respects: (i) new technique of computing all possible segmentations of a word into prefix*-stem-suffix* for decoding, and (ii) unsupervised algorithm for new stem acquisition based on a stem candidate's similarity to stems occurring in the training corpus.</Paragraph> <Paragraph position="1"> (Darwish 2002) presents a supervised technique which identifies the root of an Arabic word by stripping away the prefix and the suffix of the word on the basis of manually acquired dictionary of word-root pairs and the likelihood that a prefix and a suffix would occur with the template from which the root is derived. He reports 92.7% segmentation accuracy on a 9,606 word evaluation corpus.</Paragraph> <Paragraph position="2"> His technique pre-supposes at most one prefix and one suffix per stem regardless of the actual number and meanings of prefixes/suffixes associated with the stem. (Beesley 1996) presents a finite-state morphological analyzer for Arabic, which displays the root, pattern, and prefixes/suffixes. The analyses are based on manually acquired lexicons and rules.</Paragraph> <Paragraph position="3"> Although his analyzer is comprehensive in the types of knowledge it presents, it has been criticized for their extensive development time and lack of robustness, cf. (Darwish 2002).</Paragraph> <Paragraph position="4"> marking a prefix with '#&quot; and a suffix with '+' will be adopted throughout the paper.</Paragraph> <Paragraph position="5"> (Yarowsky and Wicentowsky 2000) presents a minimally supervised morphological analysis with a performance of over 99.2% accuracy for the 3,888 past-tense test cases in English. The core algorithm lies in the estimation of a probabilistic alignment between inflected forms and root forms. The probability estimation is based on the lemma alignment by frequency ratio similarity among different inflectional forms derived from the same lemma, given a table of inflectional parts-of-speech, a list of the canonical suffixes for each part of speech, and a list of the candidate noun, verb and adjective roots of the language. Their algorithm does not handle multiple affixes per word.</Paragraph> <Paragraph position="6"> (Goldsmith 2000) presents an unsupervised technique based on the expectation-maximization algorithm and minimum description length to segment exactly one suffix per word, resulting in an F-score of 81.8 for suffix identification in English according to (Schone and Jurafsky 2001). (Schone and Jurafsky 2001) proposes an unsupervised algorithm capable of automatically inducing the morphology of inflectional languages using only text corpora. Their algorithm combines cues from orthography, semantics, and contextual information to induce morphological relationships in German, Dutch, and English, among others. They report F-scores between 85 and 93 for suffix analyses and between 78 and 85 for circumfix analyses in these languages. Although their algorithm captures prefix-suffix combinations or circumfixes, it does not handle the multiple affixes per word we observe in Arabic.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Words Prefixes Stems Suffixes </SectionTitle> <Paragraph position="0"> Arabic Translit. Arabic Translit. Arabic Translit. Arabic Translit.</Paragraph> <Paragraph position="1"> tnullnullnullnullnullnull AlwlAyAt #l Al# yw wlAy t + +At nullnullnullnullnullnull HyAth nullnullnull HyA + t + +t +h lnullnullnullnullnullnullnull llHSwl #l #l l# Al# lnullnull HSwl nullnull AlY nullnull AlY Table 1 Segmentation of Arabic Words into Prefix*-Stem-Suffix*</Paragraph> </Section> </Section> class="xml-element"></Paper>