File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-1122_abstr.xml

Size: 1,075 bytes

Last Modified: 2025-10-06 13:43:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1122">
  <Title>An Integrated Method for Chinese Unknown Word Extraction 1</Title>
  <Section position="2" start_page="1" end_page="1" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Unknown word recognition is an important problem in Chinese word segmentation systems.</Paragraph>
    <Paragraph position="1"> In this paper, we propose an integrated method for Chinese unknown word extraction for off-line corpus processing, in which both context-entropy (on each side) and frequency ratio against background corpus are introduced to evaluate the candidate words. Both of the measures are computed efficiently on Suffix array with much less space overhead. Our method can also be reinforced when combined with a basic Segmentor by boundary-verification and arbitrary n-gram words can be extracted by our method. We test our method on Chinese novel Xiao Ao Jiang Hu, and obtain satisfactory achievements compared to traditional criteria such as Likelihood Ratio.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML