File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1051_evalu.xml
Size: 1,377 bytes
Last Modified: 2025-10-06 13:59:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1051"> <Title>Language Model Based Arabic Word Segmentation</Title> <Section position="7" start_page="7" end_page="7" type="evalu"> <SectionTitle> 5 Summary and Future Work </SectionTitle> <Paragraph position="0"> We have presented a robust word segmentation algorithm which segments a word into a prefix*-stem-suffix* sequence, along with experimental results. Our Arabic word segmentation system implementing the algorithm achieves around 97% segmentation accuracy on a development test corpus containing 28,449 word tokens. Since the algorithm can identify any number of prefixes and suffixes of a given token, it is generally applicable to various language families including agglutinative languages (Korean, Turkish, Finnish), highly inflected languages (Russian, Czech) as well as semitic languages (Arabic, Hebrew).</Paragraph> <Paragraph position="1"> Our future work includes (i) application of the current technique to other highly inflected languages, (ii) application of the unsupervised stem acquisition technique on about 1 billion word unsegmented Arabic corpus, and (iii) adoption of a novel morphological analysis technique to handle irregular morphology, as realized in Arabic broken plurals bnullnullnullnullnulla (ktAb) 'book' vs. nullnullnullnullnulla (ktb) 'books'.</Paragraph> </Section> class="xml-element"></Paper>