File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2603_metho.xml

Size: 10,789 bytes

Last Modified: 2025-10-06 14:10:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2603">
  <Title>Decomposition Kernels for Natural Language Processing</Title>
  <Section position="3" start_page="17" end_page="17" type="metho">
    <SectionTitle>
2 Decomposition Kernels
</SectionTitle>
    <Paragraph position="0"> An R-decomposition structure (Haussler, 1999; Shawe-Taylor and Cristianini, 2004) on a set X is a triple R = &lt; vectorX,R,vectork&gt; where vectorX = (X1,...,XD) is a D-tuple of non-empty subsets of X, R is a finite relation on X1 x *** x XD x X, and vectork = (k1,...,kD) is a D-tuple of positive definite kernel functions kd : Xd xXd mapsto- IR. R(vectorx,x) is true iff vectorx is a tuple of &amp;quot;parts&amp;quot; for x -- i.e. vectorx is a decomposition of x. Note that this definition of &amp;quot;parts&amp;quot; is very general and does not require the parthood relation to obey any specific mereological axioms, such as those that will be introduced in Section 6. For any x [?] X, let</Paragraph>
    <Paragraph position="2"> note the multiset of all possible decompositions1 of x. A decomposition kernel is then defined as the multiset kernel between the decompositions:</Paragraph>
    <Paragraph position="4"> taking all the contiguous fixed-length substrings or all the possible ways of dividing a string into two contiguous substrings. null where, as an alternative way of combining the kernels, we can use the product instead of a summation: intuitively this increases the feature space dimension and makes the similarity measure more selective. Since decomposition kernels form a rather vast class, the relation R needs to be carefully tuned to different applications in order to characterize a suitable kernel. As discussed in the Introduction, however, taking all possible subpartsintoaccountmayleadtopoorpredictivitybe- null cause of the combinatorial explosion of the feature space.</Paragraph>
  </Section>
  <Section position="4" start_page="17" end_page="18" type="metho">
    <SectionTitle>
3 Weighted Decomposition Kernels
</SectionTitle>
    <Paragraph position="0"> A weighted decomposition kernel (WDK) is characterized by the following decomposition structure: null</Paragraph>
    <Paragraph position="2"> subparts of x called the contexts of s in x. Precise definitions of s and vectorz are domain-dependent. For examplein(Menchettietal., 2005)wepresenttwo formulations, one for comparing whole sequences (where both the selector and the context are subsequences), and one for comparing attributed graphs (where the selector is a single vertex and the context is the subgraph reachable from the selector within a short path). The definition is completed by introducing a kernel on selectors and a kernel on contexts. The former can be chosen to be the exact matching kernel, d, on S x S, defined as d(s,sprime) = 1 if s = sprime and d(s,sprime) = 0 otherwise. The latter, kd, is a kernel on Zd x Zd and provides a soft similarity measure based on attribute frequencies. Several options are available for contextkernels,includingthediscreteversionofprob- null ability product kernels (PPK) (Jebara et al., 2004) and histogram intersection kernels (HIK) (Odone et al., 2005). Assuming there are n categorical attributes, each taking on mi distinct values, the context kernel can be defined as:</Paragraph>
    <Paragraph position="4"> pi(j) the observed frequency of value j in z. Then  ki can be defined as a HIK or a PPK respectively:</Paragraph>
    <Paragraph position="6"> This setting results in the following general form of the kernel:</Paragraph>
    <Paragraph position="8"> where we can replace the summation of kernels</Paragraph>
    <Paragraph position="10"> ber of substructures, the above function weights different matches between selectors according to contextual information. The kernel can be afterwards normalized in [[?]1,1] to prevent similarity to be boosted by the mere size of the structures being compared.</Paragraph>
  </Section>
  <Section position="5" start_page="18" end_page="19" type="metho">
    <SectionTitle>
4 WDK for sequence labeling and
</SectionTitle>
    <Paragraph position="0"> applications to NER In a sequence labeling task we want to map input sequences to output sequences, or, more precisely, we want to map each element of an input sequence that takes label from a source alphabet to an element with label in a destination alphabet.</Paragraph>
    <Paragraph position="1"> Here we cast the sequence labeling task into position specific classification, where different sequence positions give independent examples. This is different from previous approaches in the literature where the sequence labeling problem is solved by searching in the output space (Tsochantaridis et al., 2004; Daum'e III and Marcu, 2005). Although the method lacks the potential for collectively labeling all positions simultaneously, it results in a much more efficient algorithm.</Paragraph>
    <Paragraph position="2"> In the remainder of the section we introduce a specialized version of the weighted decomposition kernel suitable for a sequence transduction task originating in the natural language processing domain: the named entity recognition (NER) problem, where we map sentences to sequences of a reduced number of named entities (see Sec.4.1).</Paragraph>
    <Paragraph position="3"> More formally, given a finite dictionary S of words and an input sentence x [?] S[?], our input objects are pairs of sentences and indices r = (x,t)  where r [?] S[?] x IN. Given a sentence x, two integers b [?] 1 and b [?] e [?] |x|, let x[b] denote the word at position b and x[b..e] the sub-sequence of x spanning positions from b to e. Finally we will denote by x(x[b]) a word attribute such as a morphological trait (is a number or has capital initial, see 4.1) for the word in sentence x at position b.</Paragraph>
    <Paragraph position="4"> We introduce two versions of WDK: one with four context types (D = 4) and one with increased contextual information (D = 6) (see Eq. 5). The relation R depends on two integers t and i [?] {1,...,|x|}, where t indicates the position of the word we want to classify and i the position of a generic word in the sentence. The relation for the first kernel version is defined as:  R = {(s,zLL,zLR,zRL,zRR,r)} such that the selector s = x[i] is the word at position i, the zLL (LeftLeft) part is a sequence defined as x[1..i] if i &lt; t or the null sequence e otherwise and the zLR (LeftRight) part is the sequence x[i + 1..t] if i &lt; t or e otherwise. Informally, zLL is the initial  portion of the sentence up to word of position i, and zLR is the portion of the sentence from word at position i + 1 up to t (see Fig. 1). Note that when we are dealing with a word that lies to the left of the target word t, its zRL and zRR parts are empty. Symmetrical definitions hold for zRL and zRR when i &gt; t. We define the weighted decomposition kernel for sequences as</Paragraph>
    <Paragraph position="6"> where dx(s,sprime) = 1 if x(s) = x(sprime) and 0 otherwise (that is dx checks whether the two selector words have the same morphological trait) and k is Eq. 2 with only one attribute which then boils downtoEq.3orEq.4, thatisakerneloverthehistogram for word occurrences over a specific part. Intuitively, when applied to word sequences, this kernel considers separately words to the left  of the entry we want to transduce and those to its right. The kernel computes the similarity for each sub-sequence by matching the corresponding bag of enriched words: each word is matched only with words that have the same trait as extracted by x and the match is then weighted proportionally to the frequency count of identical words preceding and following it.</Paragraph>
    <Paragraph position="7"> The kernel version with D=6 adds two parts called zLO (LeftOther) and zRO (RightOther) defined as x[t+1..|r|] and x[1..t] respectively; these</Paragraph>
    <Paragraph position="9"> zLL *zLR *zLO and x = zRL *zRR *zRO.</Paragraph>
    <Paragraph position="10"> Note that the WDK transforms the sentence in a bag of enriched words computed in a pre-processing phase thus achieving a significant reductionincomputationalcomplexity(comparedto null the recursive procedure in (Lodhi et al., 2002)).</Paragraph>
    <Section position="1" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
4.1 Named Entity Recognition Experimental
Results
</SectionTitle>
      <Paragraph position="0"> Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. For example in the following sentence: [PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in [ORG Real Madrid].</Paragraph>
      <Paragraph position="1"> we are interested in predicting that Wolff and Del Bosque are people's names, that Argentina is a name of a location and that Real Madrid is a name of an organization.</Paragraph>
      <Paragraph position="2"> The chosen dataset is provided by the shared task of CoNLL-2002 (Saunders et al., 2002) which concerns language-independent named entity recognition. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC), combined with two tags, B to denote the first item ofaphraseandIforanynon-initialword; allother phrases are classified as (OTHER). Of the two available languages (Spanish and Dutch), we run experimentsonlyontheSpanishdatasetwhichisa collection of news wire articles made available by the Spanish EFE News Agency. We select a sub-set of 300 sentences for training and we evaluate the performance on test set. For each category, we evaluate the Fb=1 measure of 4 versions of WDK: word histograms are matched using HIK (Eq. 3) and the kernels on various parts (zLL,zLR,etc) are combined with a summation SUMHIK or product PROHIK; alternatively the histograms are com- null bined with a PPK (Eq. 4) obtaining SUMPPK, PROPPK.</Paragraph>
      <Paragraph position="3"> The word attribute considered for the selector is a word morphologic trait that classifies a word in one of five possible categories: normal word, number, all capital letters, only capital initial and contains non alphabetic characters, while the context histograms are computed counting the exact word frequencies.</Paragraph>
      <Paragraph position="4"> Results reported in Tab. 1 and Tab. 2 show that performance is mildly affected by the different  choicesonhowtocombineinformationonthevariouscontexts, thoughitseemsclearthatincreasing contextual information has a positive influence.</Paragraph>
      <Paragraph position="5"> Note that interesting preliminary results can be obtained even without the use of any refined language knowledge, such as part of speech tagging or shallow/deep parsing.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML