File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/p98-2206_abstr.xml
Size: 1,142 bytes
Last Modified: 2025-10-06 13:49:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2206"> <Title>Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora.</Paragraph> <Paragraph position="1"> The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the perfomaance(especially in ability to cope with unknown words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.</Paragraph> </Section> class="xml-element"></Paper>