File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1004_metho.xml

Size: 16,511 bytes

Last Modified: 2025-10-06 14:08:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1004">
  <Title>Fast Methods for Kernel-based Text Analysis</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Kernel Method and Support Vector
Machines
</SectionTitle>
    <Paragraph position="0"> Suppose we have a set of training data for a binary classification problem: (x1;y1);:::;(xL;yL) xj 2&lt;N; yj 2f+1;!1g; where xj is a feature vector of the j-th training sample, and yj is the class label associated with this training sample. The decision function of SVMs is defined by</Paragraph>
    <Paragraph position="2"> where: (A) ` is a non-liner mapping function from &lt;N to &lt;H (N ? H). (B) fij;b 2&lt;; fij , 0.</Paragraph>
    <Paragraph position="3"> The mapping function ` should be designed such that all training examples are linearly separable in &lt;H space. Since H is much larger than N, it requires heavy computation to evaluate the dot products `(xi)C/`(x) in an explicit form. This problem can be overcome by noticing that both construction of optimal parameter fii (we will omit the details of this construction here) and the calculation of the decision function only require the evaluation of dot products `(xi)C/`(x). This is critical, since, in some cases, the dot products can be evaluated by a simple Kernel Function: K(x1;x2) = `(x1)C/`(x2). Substituting kernel function into (1), we have the following decision function.</Paragraph>
    <Paragraph position="5"> (2) One of the advantages of kernels is that they are not limited to vectorial object x, but that they are applicable to any kind of object representation, just given the dot products.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Polynomial Kernel of degree d
</SectionTitle>
    <Paragraph position="0"> For many tasks in NLP, the training and test examples are represented in binary vectors; or sets, since examples in NLP are usually represented in so-called Feature Structures. Here, we focus on such cases 1.</Paragraph>
    <Paragraph position="1"> Suppose a feature set F = f1;2;:::;Ng and training examples Xj(j = 1;2;:::;L), all of which are subsets of F (i.e., Xj F). In this case, Xj can be regarded as a binary vector xj = (xj1;xj2;:::;xjN) where xji = 1 if i 2 Xj, xji = 0 otherwise. The dot product of x1 and x2 is given by x1 C/x2 = jX1 \X2j.</Paragraph>
    <Paragraph position="2"> Definition 1 Polynomial Kernel of degree d Given sets X and Y , corresponding to binary feature vectors x and y, Polynomial Kernel of degree d Kd(X;Y) is given by</Paragraph>
    <Paragraph position="4"> where d = 1;2;3;:::.</Paragraph>
    <Paragraph position="5"> In this paper, (3) will be referred to as an implicit form of the Polynomial Kernel.</Paragraph>
    <Paragraph position="6"> 1In the Maximum Entropy model widely applied in NLP, we usually suppose binary feature functions fi(Xj) 2f0;1g. This formalization is exactly same as representing an example Xj in a set fkjfk(Xj) = 1g.</Paragraph>
    <Paragraph position="7"> It is known in NLP that a combination of features, a subset of feature set F in general, contributes to overall accuracy. In previous research, feature combination has been selected manually. The use of a polynomial kernel allows such feature expansion without loss of generality or an increase in computational costs, since the Polynomial Kernel of degree d implicitly maps the original feature space F into Fd space. (i.e., ` : F ! Fd). This property is critical and some reports say that, in NLP, the polynomial kernel outperforms the simple linear kernel (Kudo and Matsumoto, 2000; Isozaki and Kazawa, 2002).</Paragraph>
    <Paragraph position="8"> Here, we will give an explicit form of the Polynomial Kernel to show the mapping function `(C/). Lemma 1 Explicit form of Polynomial Kernel.</Paragraph>
    <Paragraph position="9"> The Polynomial Kernel of degree d can be rewritten as</Paragraph>
    <Paragraph position="11"> cd(r) will be referred as a subset weight of the Polynomial Kernel of degree d. This function gives a prior weight to the subset s, where jsj = r.</Paragraph>
    <Paragraph position="12"> Example 1 Quadratic and Cubic Kernel Given sets X = fa;b;c;dg and Y = fa;b;d;eg, the Quadratic Kernel K2(X;Y) and the Cubic Kernel K3(X;Y) can be calculated in an implicit form as:</Paragraph>
    <Paragraph position="14"> Using Lemma 1, the subset weights of the Quadratic Kernel and the Cubic Kernel can be calculated as c2(0) = 1; c2(1) = 3; c2(2) = 2 and c3(0)=1; c3(1)=7; c3(2)=12; c3(3)=6.</Paragraph>
    <Paragraph position="15"> In addition, subsets Pr(X\Y) (r = 0;1;2;3) are given as follows: P0(X \ Y) =</Paragraph>
    <Paragraph position="17"/>
    <Paragraph position="19"/>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Fast Classifiers for Polynomial Kernel
</SectionTitle>
    <Paragraph position="0"> In this section, we introduce two fast classification algorithms for the Polynomial Kernel of degree d.</Paragraph>
    <Paragraph position="1"> Before describing them, we give the baseline classifier (PKB):</Paragraph>
    <Paragraph position="3"> The complexity of PKB is O(jXj C/ jSVj), since it takes O(jXj) to calculate (1+jXj \Xj)d and there are a total of jSVj support examples.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 PKI (Inverted Representation)
</SectionTitle>
      <Paragraph position="0"> Given an item i 2 F, if we know in advance the set of support examples which contain item i 2 F, we do not need to calculate jXj \Xj for all support examples. This is a naive extension of Inverted Indexing in Information Retrieval. Figure 1 shows the pseudo code of the algorithm PKI. The function h(i) is a pre-compiled table and returns a set of support examples which contain item i.</Paragraph>
      <Paragraph position="1"> The complexity of the PKI is O(jXjC/B +jSVj), where B is an average of jh(i)j over all item i 2 F.</Paragraph>
      <Paragraph position="2"> The PKI can make the classification speed drastically faster when B is small, in other words, when feature space is relatively sparse (i.e., B ? jSVj).</Paragraph>
      <Paragraph position="3"> The feature space is often sparse in many tasks in NLP, since lexical entries are used as features.</Paragraph>
      <Paragraph position="4"> The algorithm PKI does not change the final accuracy of the classification.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 PKE (Expanded Representation)
4.2.1 Basic Idea of PKE
</SectionTitle>
      <Paragraph position="0"> Using Lemma 1, we can represent the decision function (5) in an explicit form:</Paragraph>
      <Paragraph position="2"> The classification algorithm given by (7) will be referred to as PKE. The complexity of PKE is O(jGd(X)j) = O(jXjd), independent on the number of support examples jSVj.</Paragraph>
      <Paragraph position="3">  To apply the PKE, we first calculate jGd(F)j degree of vectors</Paragraph>
      <Paragraph position="5"> This calculation is trivial only when we use a Quadratic Kernel, since we just project the original feature space F into F PS F space, which is small enough to be calculated by a naive exhaustive method. However, if we, for instance, use a polynomial kernel of degree 3 or higher, this calculation becomes not trivial, since the size of feature space  exponentially increases. Here we take the following strategy: 1. Instead of using the original vector w, we use w0, an approximation of w.</Paragraph>
      <Paragraph position="6"> 2. We apply the Subset Mining algorithm to calculate w0 efficiently.</Paragraph>
      <Paragraph position="7"> 2I(t) returns 1 if t is true,returns 0 otherwise.  Definition 2 w0: An approximation of w An approximation of w is given by w0 = (w0(s1);w0(s2);:::;w0(sjGd(F)j)), where w0(s) is set to 0 if w(s) is trivially close to 0. (i.e., neg &lt; w(s) &lt; pos ( neg &lt; 0; pos &gt; 0), where pos and neg are predefined thresholds).</Paragraph>
      <Paragraph position="8"> The algorithm PKE is an approximation of the PKB, and changes the final accuracy according to the selection of thresholds pos and neg. The calculation of w0 is formulated as the following mining problem: Definition 3 Feature Combination Mining Given a set of support examples and subset weight cd(r), extract all subsets s and their weights w(s) if w(s) holds w(s) , pos or w(s) * neg .</Paragraph>
      <Paragraph position="9"> In this paper, we apply a Sub-Structure Mining algorithm to the feature combination mining problem. Generally speaking, sub-structures mining algorithms efficiently extract frequent sub-structures (e.g., subsets, sub-sequences, sub-trees, or subgraphs) from a large database (set of transactions). In this context, frequent means that there are no less than &gt;&gt; transactions which contain a sub-structure. The parameter &gt;&gt; is usually referred to as the Minimum Support. Since we must enumerate all subsets of F, we can apply subset mining algorithm, in some times called as Basket Mining algorithm, to our task. There are many subset mining algorithms proposed, however, we will focus on the PrefixSpan algorithm, which is an efficient algorithm for sequential pattern mining, originally proposed by (Pei et al., 2001). The PrefixSpan was originally designed to extract frequent sub-sequence (not subset) patterns, however, it is a trivial difference since a set can be seen as a special case of sequences (i.e., by sorting items in a set by lexicographic order, the set becomes a sequence). The basic idea of the PrefixSpan is to divide the database by frequent sub-patterns (prefix) and to grow the prefix-spanning pattern in a depth-first search fashion.</Paragraph>
      <Paragraph position="10"> We now modify the PrefixSpan to suit to our feature combination mining.</Paragraph>
      <Paragraph position="11"> + size constraint We only enumerate up to subsets of size d.</Paragraph>
      <Paragraph position="12"> when we plan to apply the Polynomial Kernel of degree d.</Paragraph>
      <Paragraph position="13"> + Subset weight cd(r) In the original PrefixSpan, the frequency of each subset does not change by its size. However, in our mining task, it changes (i.e., the frequency of subset s is weighted by cd(jsj)).</Paragraph>
      <Paragraph position="14"> Here, we process the mining algorithm by assuming that each transaction (support example Xj) has its frequency Cdyjfij, where Cd = max(cd(1);cd(2);:::;cd(d)). The weight w(s) is calculated by w(s) = !(s) PS cd(jsj)=Cd, where !(s) is a frequency of s, given by the original PrefixSpan.</Paragraph>
      <Paragraph position="15"> + Positive/Negative support examples We first divide the support examples into positive (yi &gt; 0) and negative (yi &lt; 0) examples, and process mining independently. The result can be obtained by merging these two results.</Paragraph>
      <Paragraph position="16"> + Minimum Supports pos; neg In the original PrefixSpan, minimum support is an integer. In our mining task, we can give a real number to minimum support, since each transaction (support example Xj) has possibly non-integer frequency Cdyjfij. Minimum supports pos and neg control the rate of approximation. For the sake of convenience, we just give one parameter , and calculate pos and neg as follows  After the process of mining, a set of tuples Ohm = fhs;w(s)ig is obtained, where s is a frequent subset and w(s) is its weight. We use a TRIE to efficiently store the set Ohm. The example of such TRIE compression is shown in Figure 2. Although there are many implementations for TRIE, we use a Double-Array (Aoe, 1989) in our task. The actual classification of PKE can be examined by traversing the TRIE for all subsets s 2 Gd(X).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> To demonstrate performances of PKI and PKE, we examined three NLP tasks: English BaseNP Chunking (EBC), Japanese Word Segmentation (JWS) and  tailed description of each task, training and test data, the system parameters, and feature sets are presented in the following subsections. Table 1 summarizes the detail information of support examples (e.g., size of SVs, size of feature set etc.).</Paragraph>
    <Paragraph position="1"> Our preliminary experiments show that a Quadratic Kernel performs the best in EBC, and a Cubic Kernel performs the best in JWS and JDP.</Paragraph>
    <Paragraph position="2"> The experiments using a Cubic Kernel are suitable to evaluate the effectiveness of the basket mining approach applied in the PKE, since a Cubic Kernel projects the original feature space F into F3 space, which is too large to be handled only using a naive exhaustive method.</Paragraph>
    <Paragraph position="3"> All experiments were conducted under Linux using XEON 2.4 Ghz dual processors and 3.5 Gbyte of main memory. All systems are implemented in C++.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 English BaseNP Chunking (EBC)
</SectionTitle>
      <Paragraph position="0"> Text Chunking is a fundamental task in NLP - dividing sentences into non-overlapping phrases. BaseNP chunking deals with a part of this task and recognizes the chunks that form noun phrases. Here is an example sentence: [He] reckons [the current account deficit] will narrow to [only $ 1.8 billion] .</Paragraph>
      <Paragraph position="1"> A BaseNP chunk is represented as sequence of words between square brackets. BaseNP chunking task is usually formulated as a simple tagging task, where we represent chunks with three types of tags: B: beginning of a chunk. I: non-initial word. O: outside of the chunk. In our experiments, we used the same settings as (Kudo and Matsumoto, 2002).</Paragraph>
      <Paragraph position="2"> We use a standard data set (Ramshaw and Marcus, 1995) consisting of sections 15-19 of the WSJ corpus as training and section 20 as testing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Japanese Word Segmentation (JWS)
</SectionTitle>
      <Paragraph position="0"> Since there are no explicit spaces between words in Japanese sentences, we must first identify the word boundaries before analyzing deep structure of a sentence. Japanese word segmentation is formalized as a simple classification task.</Paragraph>
      <Paragraph position="1"> Let s = c1c2C/C/C/cm be a sequence of Japanese characters, t = t1t2C/C/C/tm be a sequence of Japanese character types 3 associated with each character, and yi 2 f+1;!1g; (i = (1;2;:::;m!1)) be a boundary marker. If there is a boundary between ci and ci+1, yi = 1, otherwise yi = !1. The feature set of example xi is given by all characters as well as character types in some constant window (e.g., 5): fci!2;ci!1;C/C/C/;ci+2;ci+3;ti!2;ti!1;C/C/C/;ti+2;ti+3g. Note that we distinguish the relative position of each character and character type. We use the Kyoto University Corpus (Kurohashi and Nagao, 1997), 7,958 sentences in the articles on January 1st to January 7th are used as training data, and 1,246 sentences in the articles on January 9th are used as the test data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Japanese Dependency Parsing (JDP)
</SectionTitle>
      <Paragraph position="0"> The task of Japanese dependency parsing is to identify a correct dependency of each Bunsetsu (base phrase in Japanese). In previous research, we presented a state-of-the-art SVMs-based Japanese dependency parser (Kudo and Matsumoto, 2002). We combined SVMs into an efficient parsing algorithm, Cascaded Chunking Model, which parses a sentence deterministically only by deciding whether the current chunk modifies the chunk on its immediate right hand side. The input for this algorithm consists of a set of the linguistic features related to the head and modifier (e.g., word, part-of-speech, and inflections), and the output from the algorithm is either of the value +1 (dependent) or -1 (independent). We use a standard data set, which is the same corpus described in the Japanese Word Segmentation.</Paragraph>
      <Paragraph position="1"> 3Usually, in Japanese, word boundaries are highly constrained by character types, such as hiragana and katakana (both are phonetic characters in Japanese), Chinese characters, English alphabets and numbers.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML