File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1051_intro.xml
Size: 2,903 bytes
Last Modified: 2025-10-06 14:01:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1051"> <Title>Language Model Based Arabic Word Segmentation</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Morphologically rich languages like Arabic present significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we can improve the performance of natural language systems including machine translation (Brown et al. 1993) and information retrieval (Franz, M. and McCarley, S. 2002). In this paper, we present a general word segmentation algorithm for handling inflectional morphology capable of segmenting a word into a prefix*-stem-suffix* sequence, using a small manually segmented corpus and a table of prefixes/suffixes of the language. We do not address Arabic infix morphology where many stems correspond to the same root with various infix variations; we treat all the stems of a common root as separate atomic units. The use of a stem as a morpheme (unit of meaning) is better suited than the use of a root for the applications we are considering in information retrieval and machine translation (e.g. different stems of the same root translate into different English words.) Examples of Arabic words and their segmentation into prefix*-stem-suffix* are given in Table 1, where '#' indicates a morpheme being a prefix, and '+' a suffix.</Paragraph> <Paragraph position="1"> Arabic is presented in both native and Buckwalter transliterated Arabic whenever possible. All native Arabic is to be read from right-to-left, and transliterated Arabic is to be read from left-to-right. The convention of shown in Table 1, a word may include multiple prefixes, as in nullnull (l: for, Al: the), or multiple suffixes, as in nullnullnull (t: feminine singular, h: his). A word may also consist only of a stem, as in nullnull (AlY, to/towards).</Paragraph> <Paragraph position="2"> The algorithm implementation involves (i) language model training on a morpheme-segmented corpus, (ii) segmentation of input text into a sequence of morphemes using the language model parameters, and (iii) unsupervised acquisition of new stems from a large unsegmented corpus. The only linguistic resources required include a small manually segmented corpus ranging from 20,000 words to 100,000 words, a table of prefixes and suffixes of the language and a large unsegmented corpus.</Paragraph> <Paragraph position="3"> In Section 2, we discuss related work. In Section 3, we describe the segmentation algorithm. In Section 4, we discuss the unsupervised algorithm for new stem acquisition. In Section 5, we present experimental results. In Section 6, we summarize the paper.</Paragraph> </Section> class="xml-element"></Paper>