XML Viewer - c04-1081

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1081_metho.xml
Size: 11,091 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1081">
  <Title>Chinese Segmentation and New Word Detection using Conditional Random Fields</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Conditional Random Fields
</SectionTitle>
    <Paragraph position="0"> Conditional random fields (CRFs) are undirected graphical models trained to maximize a conditional probability (Lafferty et al., 2001). A common special-case graph structure is a linear chain, which corresponds to a finite state machine, and is suitable for sequence labeling. A linear-chain CRF with parameters L = f,1;:::g defines a conditional probability for a state (label) sequence y = y1:::yT (for example, labels indicating where words start or have their interior) given an input sequence x = x1:::xT (for example, the characters of a Chinese sentence) to be</Paragraph>
    <Paragraph position="2"> where Zx is the per-input normalization that makes the probability of all state sequences sum to one; fk(yt!1;yt;x;t) is a feature function which is often binary-valued, but can be real-valued, and ,k is a learned weight associated with feature fk. The feature functions can measure any aspect of a state transition, yt!1 ! yt, and the entire observation sequence, x, centered at the current time step, t. For example, one feature function might have value 1 when yt!1 is the state START, yt is the state NOT-START, and xt is a word appearing in a lexicon of people's first names. Large positive values for ,k indicate a preference for such an event; large negative values make the event unlikely.</Paragraph>
    <Paragraph position="3"> The most probable label sequence for an input x,</Paragraph>
    <Paragraph position="5"> can be efficiently determined using the Viterbi algorithm (Rabiner, 1990). An N-best list of labeling sequences can also be obtained using modified Viterbi algorithm and A* search (Schwartz and Chow, 1990).</Paragraph>
    <Paragraph position="6"> The parameters can be estimated by maximum likelihood--maximizing the conditional probability of a set of label sequences, each given their corresponding input sequences. The log-likelihood of training set f(xi;yi) : i = 1;:::Mg is written</Paragraph>
    <Paragraph position="8"> Traditional maximum entropy learning algorithms, such as GIS and IIS (della Pietra et al., 1995), can be used to train CRFs. However, our implementation uses a quasi-Newton gradient-climber BFGS for optimization, which has been shown to converge much faster (Malouf, 2002; Sha and Pereira, 2003).</Paragraph>
    <Paragraph position="9"> The gradient of the likelihood is @PL(yjx)=@,k =</Paragraph>
    <Paragraph position="11"> CRFs share many of the advantageous properties of standard maximum entropy classifiers, including their convex likelihood function, which guarantees that the learning procedure converges to the global maximum.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Regularization in CRFs
</SectionTitle>
      <Paragraph position="0"> To avoid over-fitting, log-likelihood is usually penalized by some prior distribution over the parameters. A commonly used prior is a zero-mean Gaussian. With a Gaussian prior, log-likelihood is penalized as follows.</Paragraph>
      <Paragraph position="2"> where 2k is the variance for feature dimension k.</Paragraph>
      <Paragraph position="3"> The variance can be feature dependent. However for simplicity, constant variance is often used for all features. We experiment an alternate version of Gaussian prior in which the variance is feature dependent. We bin features by frequency in the training set, and let the features in the same bin share the same variance. The discounted value is set to be ,k dck=MePS 2 where ck is the count of features, M is the bin size set by held out validation, and dae is the ceiling function. See Peng and McCallum (2004) for more details and further experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 State transition features
</SectionTitle>
      <Paragraph position="0"> Varying state-transition structures with different Markov order can be specified by different CRF feature functions, as determined by the number of output labels y examined together in a feature function. We define four different state transition feature functions corresponding to different Markov orders.</Paragraph>
      <Paragraph position="1"> Higher-order features capture more long-range dependencies, but also cause more data sparseness problems and require more memory for training.</Paragraph>
      <Paragraph position="2"> The best Markov order for a particular application can be selected by held-out cross-validation.</Paragraph>
      <Paragraph position="3">  1. First-order: Here the inputs are examined in  the context of the current state only. The feature functions are represented as f(yt;x).</Paragraph>
      <Paragraph position="4"> There are no separate parameters for state transitions. null  2. First-order+transitions: Here we add parameters corresponding to state transitions. The feature functions used are f(yt;x);f(yt!1;yt). 3. Second-order: Here inputs are examined in the context of the current and previous states. Feature function are represented as f(yt!1;yt;x). 4. Third-order: Here inputs are examined in  the context of the current, and two previous states. Feature function are represented as f(yt!2;yt!1;yt;x).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 CRFs for Word Segmentation
</SectionTitle>
    <Paragraph position="0"> We cast the segmentation problem as one of sequence tagging: Chinese characters that begin a new word are given the START tag, and characters in the middle and at the end of words are given the NONSTART tag. The task of segmenting new, unsegmented test data becomes a matter of assigning a sequence of tags (labels) to the input sequence of Chinese characters.</Paragraph>
    <Paragraph position="1"> Conditional random fields are configured as a linear-chain (finite state machine) for this purpose, and tagging is performed using the Viterbi algorithm to efficiently find the most likely label sequence for a given character sequence.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Lexicon features as domain knowledge
</SectionTitle>
      <Paragraph position="0"> One advantage of CRFs (as well as traditional maximum entropy models) is its flexibility in using arbitrary features of the input. To explore this advantage, as well as the importance of domain knowledge, we use many open features from external resources. To specifically evaluate the importance of domain knowledge beyond the training data, we divide our features into two categories: closed features and open features, (i.e., features allowed in the competition's &amp;quot;closed test&amp;quot; and &amp;quot;open test&amp;quot; respectively). The open features include a large word list (containing single and multiple-character words), a character list, and additional topic or part-of-speech character lexicons obtained from various sources.</Paragraph>
      <Paragraph position="1"> The closed features are obtained from training data alone, by intersecting the character list obtained from training data with corresponding open lexicons. null Many lexicons of Chinese words and characters are available from the Internet and other sources.</Paragraph>
      <Paragraph position="2"> Besides the word list and character list, our lexicons include 24 lists of Chinese words and characters obtained from several Internet sites1 cleaned and augmented by a local native Chinese speaker independently of the competition data. The list of lexicons used in our experiments is shown in Figure 1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Feature conjunctions
</SectionTitle>
      <Paragraph position="0"> Since CRFs are log-linear models, feature conjunctions are required to form complex, non-linear de- null use feature conjunctions in both the open and closed tests, as listed Figure 2.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Probabilistic New Word Identification
</SectionTitle>
    <Paragraph position="0"> Since no vocabulary list could ever be complete, new word (unknown word) identification is an important issue in Chinese segmentation. Unknown words cause segmentation errors in that these out-of-vocabulary words in input text are often incorrectly segmented into single-character or other overly-short words (Chen and Bai, 1998). Traditionally, new word detection has been considered as a standalone process. We consider here new word detection as an integral part of segmentation, aiming to improve both segmentation and new word detection: detected new words are added to the word list lexicon in order to improve segmentation; improved segmentation can potentially further improve new word detection. We measure the performance of new word detection by its improvements on segmentation. null Given a word segmentation proposed by the CRF, we can compute a confidence in each segment. We detect as new words those that are not in the existing word list, yet are either highly confident segments, or low confident segments that are surrounded by high confident words. A confidence threshold of 0.9 is determined by cross-validation.</Paragraph>
    <Paragraph position="1"> Segment confidence is estimated using constrained forward-backward (Culotta and McCallum, 2004). The standard forward-backward algorithm (Rabiner, 1990) calculates Zx, the total likelihood of all label sequences y given a sequence x. Constrained forward-backward algorithm calculates Z0x, total likelihood of all paths passing through a constrained segment (in our case, a sequence of characters starting with a START tag followed by a few NONSTART tags before the next START tag).</Paragraph>
    <Paragraph position="2"> The confidence in this segment is then Z</Paragraph>
    <Paragraph position="4"> x , a realnumber between 0 and 1.</Paragraph>
    <Paragraph position="5"> In order to increase recall of new words, we consider not only the most likely (Viterbi) segmentation, but the segmentations in the top N most likely segmentations (an N-best list), and detect new words according to the above criteria in all N segmentations.</Paragraph>
    <Paragraph position="6"> Many errors can be corrected by new word detection. For example, person name &amp;quot;xc7x87x95&amp;quot; happens four times. In the first pass of segmentation, two of them are segmented correctly and the other two are mistakenly segmented as &amp;quot;xc7 x87 x95&amp;quot; (they are segmented differently because Viterbi algorithm decodes based on context.). However, &amp;quot;xc7x87x95&amp;quot; is identified as a new word and added to the word list lexicon. In the second pass of segmentation, the other two mistakes are corrected.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML