File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1066_intro.xml
Size: 7,345 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1066"> <Title>Japanese Unknown Word Identification by Character-based Chunking</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 5. 2 Method </SectionTitle> <Paragraph position="0"> We describe our method for unknown word identification. The method is based on the following three steps: 1. A statistical morphological analyzer is applied to the input sentence and produces n-best segmentation candidates with their correspoinding part-of-speech (POS).</Paragraph> <Paragraph position="1"> 2. Features for each character in the sentence are annotated as the character type and multiple POS tag information according to the n-best word candidates.</Paragraph> <Paragraph position="2"> 3. Unknown words are identified by a support vector machine (SVM)-based chunker based on annotated features.</Paragraph> <Paragraph position="3"> Now, we illustrate each of these three steps in more detail.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.1 Japanese Morphological Analysis </SectionTitle> <Paragraph position="0"> Japanese morphological analysis is based on Markov model. The goal is to find the word and POS tag sequences W and T that maximize the following probability:</Paragraph> <Paragraph position="2"> Bayes' rule allows P(T|W) to be decomposed as the product of tag and word probabilities.</Paragraph> <Paragraph position="4"> We introduce approximations that the word probability is conditioned only on the tag of the word, and the tag probability is determined only by the immediately preceding tag. The probabilities are estimated from the frequencies in tagged corpora using Maximum Likelihood Estimation. Using these parameters, the most probable tag and word sequences are determined by the Viterbi algorithm.</Paragraph> <Paragraph position="5"> In practice, we use log likelihood as cost. Maximizing probabilities means minimizing costs. Redundant analysis outputs in our method mean the top n-best word candidates within a certain cost width. The n-best word candidates are picked up for each character in the ascending order of the accumulated cost from the beginning of the sentence.</Paragraph> <Paragraph position="6"> Note that, if the difference between the costs of the best candidate and n-th best candidate exceeds a predefined cost width, we abandon the n-th best candidate. The cost width is defined as the lowest probability in all events which occur in the training data. We use ChaSen as the morphological analyzer. ChaSen induces the n-best segmentation within a user-defined width.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Feature for Chunking </SectionTitle> <Paragraph position="0"> There are two general indicators of unknown words in Japanese texts. First, they have highly ambiguous boundaries. Thus, a morphological analyzer, which is trained only with known words, often produces a confused segmentation and POS assignment for an unknown word. If we inspect the lattice built during the analysis, subgraphs around unknown words are often dense with many equally plausible paths. We intend to reflect this observation as a feature and do this by use of n-best candidates from the morphological analyzer. As shown Figure 1, each character (Char.) in an input sentence is annotated with a</Paragraph> <Paragraph position="2"> feature encoded as a pair of segmentation tag and POS tag. For example, the best POS of the character &quot;&quot; is &quot;GeneralNoun-B&quot;. This renders as the POS is a common noun (General Noun) and its segmentation makes the character be the first one in a multi-character token. The POS tagset is based on IPADIC (Asahara and Matsumoto, 2002) and the segmentation tag is summarized in Table 1. The 3best candidates from the morphological analyzer is used. The second indicator of Japanese unknown words is the character type. Unknown words occur around long Katakana sequences and alphabetical characters. We use character type (Char. Type) as feature, as shown in Figure 1. Seven character types ar defined: Space, Digit, Lowercase alphabet, Uppercase alphabet, Hiragana, Katakana, Other (Kanji). The character type is directly or indirectly used in most of previous work and appears an important feature to characterize unknown words in Japanese texts.</Paragraph> </Section> <Section position="3" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.3 Support Vector Machine-based Chunking </SectionTitle> <Paragraph position="0"> We use the chunker YamCha (Kudo and Matsumoto, 2001), which is based on SVMs (Vapnik, 1998).</Paragraph> <Paragraph position="1"> Suppose we have a set of training data for a binary class problem: (x</Paragraph> <Paragraph position="3"> is a feature vector of the i th sample in the training data and y i [?]{+1,[?]1} is the label of the sample. The goal is to find a decision function which accurately predicts y for an unseen x.An support vector machine classifier gives a decision function f(x)=sign(g(x)) for an input vector x where</Paragraph> <Paragraph position="5"> )+b.</Paragraph> <Paragraph position="6"> K(x,z) is a kernel function which maps vectors into a higher dimensional space. We use a polynomial kernel of degree 2 given by K(x,z)=</Paragraph> <Paragraph position="8"> SVMs are binary classifiers. We extend binary classifiers to an n-class classifier in order to compose chunking rules. Two methods are often used Char. id Char. Char. Type POS(Best) POS(2nd) POS(3rd) unknown word tag</Paragraph> <Paragraph position="10"> for the extension, the &quot;One vs. Rest method&quot; and the &quot;Pairwise method&quot;. In the &quot;One vs. Rest methods&quot;, we prepare n binary classifiers between one class and the remain classes. Whereas in the &quot;Pairwise method&quot;, we prepare</Paragraph> <Paragraph position="12"> binary classifiers between all pairs of classes. We use &quot;Pairwise method&quot; since it is efficient to train than the &quot;One vs. Rest method&quot;.</Paragraph> <Paragraph position="13"> Chunking is performed by deterministically annotating a tag on each character. Table 2 shows the unknown word tags for chunking, which are known as the IOB2 model (Ramshaw and Marcus, 1995).</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Tag Description </SectionTitle> <Paragraph position="0"> B first character in an unknown word I character in an unknown word (except B) O character in a known word We perform chunking either from the beginning or from the end of the sentence. Figure 1 illustrates a snapshot of chunking procedure. Two character contexts on both sides are referred to. Information of two preceding unknown word tags is also used since the chunker has already determined them and they are available. In the example, the chunker uses the features appearing within the solid box to infer the unknown word tag (&quot;I&quot;) at the position i. We perform chunking either from the beginning of a sentence (forward direction) or from the end of a sentence (backward direction).</Paragraph> </Section> </Section> class="xml-element"></Paper>