File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0115_intro.xml
Size: 2,297 bytes
Last Modified: 2025-10-06 14:03:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0115"> <Title>The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many important natural language processing tasks ranging from part of speech tagging to parsing to reference resolution and machine translation assume the ready availability of a tokenization into words. While such tokenization is relatively straight-forward in languages which use whitespace to delimit words, Chinese presents a significant challenge since it is typically written without such separation. Word segmentation has thus long been the focus of significant research because of its role as a necessary pre-processing phase for the tasks above. However, word segmentation remains a significant challenge both for the difficulty of the task itself and because standards for segmentation vary and human segmenters may often disagree.</Paragraph> <Paragraph position="1"> SIGHAN, the Special Interest Group for Chinese Language Processing of the Association for Computational Linguistics, conducted two prior word segmentation bakeoffs, in 2003 and 2005(Emerson, 2005), which established benchmarks for word segmentation against which other systems are judged. The bakeoff presentations at SIGHAN workshops highlighted new approaches in the field as well as the crucial importance of handling out-of-vocabulary (OOV) words.</Paragraph> <Paragraph position="2"> A significant class of OOV words is Named Entities, such as person, location, and organization names. These terms are frequently poorly covered in lexical resources and change over time as new individuals, institutions, or products appear. These terms also play a particularly crucial role in information retrieval, reference resolution, and question answering. As a result of this importance, and interest in expanding the scope of the bakeoff expressed at the Fourth SIGHAN Workshop, in the Winter of 2005 it was decided to hold a new bakeoff to evaluate both continued progress in Word Segmentation (WS) and the state of the art in Chinese Named Entity Recognition (NER).</Paragraph> </Section> class="xml-element"></Paper>