File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0125_intro.xml
Size: 3,147 bytes
Last Modified: 2025-10-06 14:03:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0125"> <Title>on a context-dependent Mutual Information Independence Model</Title> <Section position="3" start_page="0" end_page="154" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word segmentation and named entity recognition aim at recognizing the implicit word boundaries and proper nouns, such as names of persons, locations and organizations, respectively in plain Chinese text, and are critical in Chinese information processing. However, there exist two problems when developing a practical word segmentation or named entity recognition system for large open applications, i.e. the resolution of ambiguous segmentations and the identification of OOV words or OOV entity names.</Paragraph> <Paragraph position="1"> In order to resolve above problems, we developed a purely statistical Chinese word segmentation system and a named entity recognition system using a three-stage strategy under an unified framework.</Paragraph> <Paragraph position="2"> The first stage is called known word segmentation, which aims to segment an input sequence of Chinese characters into a sequence of known words (called word atoms in this paper). In this paper, all Chinese characters are regarded as known words and a word unigram model is applied to perform this task for efficiency. Also, for convenience, all the English characters are transformed into the Chinese counterparts in preprocessing, which will be recovered just before outputting results.</Paragraph> <Paragraph position="3"> The second stage is the word and/or named entity identification and classification on the sequence of atomic words in the first step. Here, a word chunking strategy is applied to detect words and/or entity names by chunking one or more atomic words together according to the word formation patterns of the word atoms and optional entity name formation patterns for named entity recognition. The problem of word segmentation and/or entity name recognition are re-cast as chunking one or more word atoms together to form a new word and/or entity name, and a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a SVM plus sigmoid model is applied to integrate various types of contexts and implement the discriminative modeling in MIIM.</Paragraph> <Paragraph position="4"> The third step is post processing, which tries to further resolve ambiguous segmentations and unknown word segmentation. Due to time limit, this is only done in Chinese word segmentation.</Paragraph> <Paragraph position="5"> No post processing is done on Chinese named entity recognition.</Paragraph> <Paragraph position="6"> The rest of this paper is as follows: Section 2 describes the context-dependent Mutual Information Independence Model in details while purely statistical post-processing in Chinese word segmentation is presented in Section 3. Finally, we report the results of our system in Chinese word segmentation and named entity recognition in Section 4 and conclude our work in Section 5.</Paragraph> </Section> class="xml-element"></Paper>