File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0116_metho.xml
Size: 8,749 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0116"> <Title>Chinese Named Entity Recognition with Conditional Random Fields</Title> <Section position="4" start_page="0" end_page="118" type="metho"> <SectionTitle> 2 Conditional Random Fields </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="118" type="sub_section"> <SectionTitle> 2.1 The model </SectionTitle> <Paragraph position="0"> Conditional Random Fields(CRFs), a statistical sequence modeling framework, was first introduced by Lafferty et al(Lafferty et al., 2001).</Paragraph> <Paragraph position="1"> The model has been used for chunking(Sha and Pereira, 2003). We only describe the model briefly since full details are presented in the paper(Lafferty et al., 2001).</Paragraph> <Paragraph position="2"> In this paper, we regard Chinese NER as a sequence labeling problem. For our sequence labeling problem, we create a linear-chain CRFs based on an undirected graph G = (V,E), where V is the set of random variables Y = {Yi|1 [?] i [?] n}, for each of n tokens in an input sentence and</Paragraph> <Paragraph position="4"> edges forming a linear chain. For each sentence x, we define two non-negative factors: exp(summationtextKk=1 lkfk(yi[?]1,yi,x)) for each edge exp(summationtextKprimek=1 lprimekfprimek(yi,x)) for each node where fk is a binary feature function, and K and Kprime are the number of features defined for edges and nodes respectively. Following Lafferty et al(Lafferty et al., 2001), the conditional probability of a sequence of tags y given a sequence of tokens x is:</Paragraph> <Paragraph position="6"> where Z(x) is the normalization constant. Given the training data D, a set of sentences (characters with their corresponding tags), the parameters of the model are trained to maximize the conditional log-likelihood. When testing, given a sentence x in the test data, the tagging sequence y is given by ArgmaxyprimeP(yprime|x).</Paragraph> <Paragraph position="7"> CRFs allow us to utilize a large number of observation features as well as different state sequence based features and other features we want to add.</Paragraph> </Section> <Section position="2" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 2.2 CRFs for Chinese NER </SectionTitle> <Paragraph position="0"> Our CRFs-based system has a first-order Markov dependency between NER tags.</Paragraph> <Paragraph position="1"> In our experiments, we do not use feature selection and all features are used in training and testing. We use the following feature functions:</Paragraph> <Paragraph position="3"> where p(x,i) is a predicate on the input sequence x and current position i and q(yi[?]1,yi) is a predicate on pairs of labels. For instance, p(x,i) might be &quot;the char at position i is(and)&quot;.</Paragraph> <Paragraph position="4"> In our system, we used CRF++ (V0.42)1 to implement the CRFs model.</Paragraph> </Section> </Section> <Section position="5" start_page="118" end_page="119" type="metho"> <SectionTitle> 3 Chinese Named Entity Recognition </SectionTitle> <Paragraph position="0"> The training data format is similar to that of the CoNLL NER task 2002, adapted for Chinese. The data is presented in two-column format, where the first column consists of the character and the second is a tag.</Paragraph> <Paragraph position="1"> Table 1 shows the types of Named Entities in the data. Every character is to be tagged with a NE type label extended with B (Beginning character of a NE) and I (Non-beginning character of a NE), or 0 (Not part of a NE).</Paragraph> <Paragraph position="2"> To obtain a good-quality estimation of the conditional probability of the event tag, the observations should be based on features that represent the difference of the two events. In our system, we de-</Paragraph> <Section position="1" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.1 Basic Features </SectionTitle> <Paragraph position="0"> The basic features of our system list as follows:</Paragraph> <Paragraph position="2"> Where C refers to a Chinese character while C0 denotes the current character and Cn(C[?]n) denotes the character n positions to the right (left) of the current character.</Paragraph> <Paragraph position="3"> For example, given a character sequence &quot;f L53&quot;, when considering the character C0 denotes &quot;L&quot;, C[?]1 denotes &quot;&quot;, C[?]1C0 denotes &quot; L&quot;, and so on.</Paragraph> </Section> <Section position="2" start_page="118" end_page="119" type="sub_section"> <SectionTitle> 3.2 Word Boundary Features </SectionTitle> <Paragraph position="0"> The sentences in training data are based on characters. However, there are many features related to the words. For instance, the word &quot;53&quot; can be a important feature for Person Name. We perform word segmentation using the left-to-right maximum matching algorithm, in which we use a word dictionary generated by doing n-gram statistics in training corpus. Then we use the word boundary tags as the features for the model.</Paragraph> <Paragraph position="1"> Firstly, we construct a word dictionary by ex- null tracting N-grams from training corpus as follows: 1. Extract arbitrary N-grams (2 [?] n [?] 10, Frequency [?] 10 ) from training corpus. We get a list W1.</Paragraph> <Paragraph position="2"> 2. Use a tool to perform statistical substring reduction in W1[ described in (Lv et al., 2004)]2. We get a list W2.</Paragraph> <Paragraph position="3"> 3. Construct a character list (CH)3, in which the characters are top 20 frequency in training corpus.</Paragraph> <Paragraph position="4"> 4. Remove the strings from W2, which contain the characters in the list CH. We get final N-grams list W3.</Paragraph> <Paragraph position="5"> Secondly, we use W3 as a dictionary for left-to-right maximum matching word segmentation. We assign word boundary tags to sentences. Each character can be assigned one of 4 possible boundary tags: &quot;B&quot; for a character that begins a word and is followed by another character, &quot;M&quot; for a character that occurs in the middle of a word, &quot;E&quot; for a character that ends a word, and &quot;S&quot; for a character that occurs as a single-character word. The word boundary features of our system list as follows: * . WTn(n = [?]1,0,1) Where WT refers to the word boundary tag while WT0 denotes the tag of current character and WTn(WT[?]n) denotes the tag n positions to the right (left) of the current character.</Paragraph> </Section> <Section position="3" start_page="119" end_page="119" type="sub_section"> <SectionTitle> 3.3 Char Features </SectionTitle> <Paragraph position="0"> If we can use external resources, we often use the lists of surname, suffix of named entity and prefix of named entity for Chinese NER. In our system, we generate these lists automatically from training corpus by the procedure as follows: nization Name. (suffix of Organization Name) * OBSuf: bi-gram characters, the last two characters of Organization Name. (suffix of Organization Name) We remove the items in uni-gram lists if their frequencies are less than 5 and in bi-gram lists if their frequencies are less than 2. Based on these lists, we assign the tags to every character. For instance, if a character is included in PSur list, then we assign a tag &quot;PSur 1&quot;, otherwise assign a tag &quot;PSur 0&quot;. Then we define the char features as follows: null</Paragraph> <Paragraph position="2"> S is the list of sentences, S = {s1,s2,...,sn}.</Paragraph> <Paragraph position="3"> T is m-best results of S, T = {t1,t2,...,tn}, which ti is a set of m-best results of si.</Paragraph> <Paragraph position="4"> pij is the score of tij, that is the jth result in ti.</Paragraph> <Paragraph position="6"/> </Section> </Section> <Section position="6" start_page="119" end_page="119" type="metho"> <SectionTitle> 4 Post-Processing </SectionTitle> <Paragraph position="0"> There are inconsistently results, which are tagged by the CRFs model. Thus we perform a post-processing step to correct these errors.</Paragraph> <Paragraph position="1"> The post-processing tries to assign the correct tags according to n-best results for every sentence.</Paragraph> <Paragraph position="2"> Our system outputs top 20 labeled sequences for each sentence with the confident scores. The post-processing algorithm is shown at Table 2. Firstly, we collect NE list from high confident results.</Paragraph> <Paragraph position="3"> Secondly, we re-assign the tags for low confident results using the NE list.</Paragraph> </Section> class="xml-element"></Paper>