File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-0136_abstr.xml

Size: 1,162 bytes

Last Modified: 2025-10-06 13:45:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0136">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics N-gram Based Two-Step Algorithm for Word Segmentation</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper describes an n-gram based reinforcement approach to the closed track of word segmentation in the third Chinese word segmentation bakeoff.</Paragraph>
    <Paragraph position="1"> Character n-gram features of unigram, bigram, and trigram are extracted from the training corpus and its frequencies are counted. We investigated a step-by-step methodology by using the n-gram statistics. In the first step, relatively definite segmentations are fixed by the tight threshold value. The remaining tags are decided by considering the left or right space tags that are already fixed in the first step. Definite and loose segmentation are performed simply based on the bigram and trigram statistics. In order to overcome the data sparseness problem of bigram data, unigram is used for the smoothing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML