File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1119_intro.xml
Size: 3,313 bytes
Last Modified: 2025-10-06 14:02:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1119"> <Title>A Semi-Supervised Approach to Build Annotated Corpus for Chinese Named Entity Recognition</Title> <Section position="3" start_page="1" end_page="1" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> Traditional statistical approaches use a parametric model with maximum likelihood estimation (MLE), usually with smoothing methods to deal with data sparseness problems. These approaches have been introduced for the task of Chinese word segmentation. According to the training data used (word-segmented or not), the Chinese word segmentation can be achieved in a supervised or unsupervised manner.</Paragraph> <Paragraph position="1"> As an example of unsupervised training, Ge et al.</Paragraph> <Paragraph position="2"> (1999) presents a simple zero-th order Markov model of the words in Chinese text. They developed an efficient algorithm to train their model on an unsegmented corpus. Their basic assumption is that Chinese words are usually 1 to 4 characters long. They however did not take into account a large amount of named entities (e.g. Chinese organization name, transliterate name and some per-son names) most of which are longer than 4 characters (e.g., Wei Ruan Ya Zhou Yan Jiu Yuan Microsoft Research Asia, Jia Li Fu Ni Ya California, Chen Ou Yang Xiao Tong a woman's name which puts her husband's surname ahead).</Paragraph> <Paragraph position="3"> An and Wong used Hidden Markov Models (HMM) for segmentation. Their system is solely trained on a corpus which has been manually annotated with word boundaries and Part-of-Speech tags. Wu (2003) also used the training data to tune the segmentation parameters of their MSR-NLP Chinese system. He used the annotated training data to deal with the morphologically derived words.</Paragraph> <Paragraph position="4"> In this paper we present a semi-supervised training method where we use both an auto-segmented training corpus and a small hand-annotated subset of it. Comparing to unsupervised approaches, our approach leads to a better segmenter that can identify much more named entities which are not in the dictionary. Comparing to supervised approaches, our method requires much less human effort for data annotation.</Paragraph> <Paragraph position="5"> The Chinese word segmenter used in this study is described in Gao et al. (2003). The segmenter provides a unified approach to word segmentation and named entity (NE) recognition. This unified approach is based on the improved source-channel models of Chinese sentence generation, with two components: a source model and a set of channel models. For each word class (e.g. a person name), there is a channel model (referred to as class model afterwards) that estimates the generative probability of a character string given the word type. The source model is used to estimate the generative probability of a word sequence, in which each word belongs to one word class (e.g. a word in a lexicon or a named entity). In another word, it indicates, given a context, how likely a word occurs. So the source model is also referred to as context model afterwards. This paper focuses the discussion on how to create annotated corpus for context model estimation.</Paragraph> </Section> class="xml-element"></Paper>