File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1051_evalu.xml

Size: 1,377 bytes

Last Modified: 2025-10-06 13:59:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1051">
  <Title>Language Model Based Arabic Word Segmentation</Title>
  <Section position="7" start_page="7" end_page="7" type="evalu">
    <SectionTitle>
5 Summary and Future Work
</SectionTitle>
    <Paragraph position="0"> We have presented a robust word segmentation algorithm which segments a word into a prefix*-stem-suffix* sequence, along with experimental results. Our Arabic word segmentation system implementing the algorithm achieves around 97% segmentation accuracy on a development test corpus containing 28,449 word tokens. Since the algorithm can identify any number of prefixes and suffixes of a given token, it is generally applicable to various language families including agglutinative languages (Korean, Turkish, Finnish), highly inflected languages (Russian, Czech) as well as semitic languages (Arabic, Hebrew).</Paragraph>
    <Paragraph position="1"> Our future work includes (i) application of the current technique to other highly inflected languages, (ii) application of the unsupervised stem acquisition technique on about 1 billion word unsegmented Arabic corpus, and (iii) adoption of a novel morphological analysis technique to handle irregular morphology, as realized in Arabic broken plurals bnullnullnullnullnulla (ktAb) 'book' vs. nullnullnullnullnulla (ktb) 'books'.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML