File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-1119_abstr.xml
Size: 1,165 bytes
Last Modified: 2025-10-06 13:43:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1119"> <Title>A Semi-Supervised Approach to Build Annotated Corpus for Chinese Named Entity Recognition</Title> <Section position="1" start_page="0" end_page="1" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper presents a semi-supervised approach to reduce human effort in building an annotated Chinese corpus. One of the disadvantages of many statistical Chinese named entity recognition systems is that training data may be in short supply, and manually building annotated corpus is expensive. In the proposed approach, we construct an 80M hand-annotated corpus in three steps: (1) Automatically annotate training corpus; (2) Manually refine small subsets of the automatically annotated corpus; (3) Combine small subsets and whole corpus in a bootstrapping process. Our approach is tested on a state-of-the-art Chinese word segmentation system (Gao et al., 2003, 2004). Experiments show that only a small subset of hand-annotated corpus is sufficient to achieve a satisfying performance of the named entity component in this system.</Paragraph> </Section> class="xml-element"></Paper>