XML Viewer - w03-0422

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0422_metho.xml
Size: 13,449 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0422">
  <Title>Learning a Perceptron-Based Named Entity Chunker via Online Recognition Feedback</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Named-Entity Phrase Chunking
</SectionTitle>
    <Paragraph position="0"> In this section we describe our NERC approach as a phrase chunking problem. First we formalize the problem of NERC, then we propose a NE-Chunker.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Problem Formalization
</SectionTitle>
      <Paragraph position="0"> Let x be a sentence belonging to the sentence space X, formed by n words xi with i ranging from 0 to n[?]1. Let K be the set of NE categories, which in the CoNLL-2003 setting is K = {LOC, PER, ORG, MISC}.</Paragraph>
      <Paragraph position="1"> A NE phrase, denoted as (s,e)k, is a phrase spanning from word xs to word xe, having s [?] e, with category k [?] K. Let NE be the set of all potential NE phrases, expressed as NE = {(s,e)k  |0 [?] s [?] e,k [?] K} .</Paragraph>
      <Paragraph position="2"> We say that two different NE phrases ne1 = (s1,e1)k1 and ne2 = (s2,e2)k2 overlap, denoted as ne1[?]ne2 iff e1 [?] s2 [?] e2 [?] s1. A solution for the NERC problem is a set y formed by NE phrases that do not overlap, also known as a chunking. We define the set Y as the set of all possible chunkings. Formally, it can be expressed as:</Paragraph>
      <Paragraph position="4"> The goal of the NE extraction problem is to identify the correct solution y [?] Y for a given sentence x.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 NE-Chunker
</SectionTitle>
      <Paragraph position="0"> The NE-Chunker is a function which given a sentence</Paragraph>
      <Paragraph position="2"> The NE-Chunker recognizes NE phrases in two layers of processing. In the first layer, a set of NE candidates for a sentence is identified, out of all the potential phrases in NE. To do so, we apply learning at word level in order to perform a Begin-Inside classification. That is, we assume a function hB(w) which decides whether a word w begins a NE phrase or not, and a function hI(w) which decides whether a word is inside a NE phrase or not. Furthermore, we define the predicate BI[?], which tests whether a certain phrase is formed by a starting begin word and subsequent inside words. Formally, BI[?]((s,e)k) = (hB(s) [?] [?]i : s &lt; i [?] e : hI(i)). The recognition will only consider solutions formed by phases in NE which satisfy the BI[?] predicate. Thus, this layer is used to filter out candidates from NE and consequently reduce the size of the solution space Y. Formally, the solution space that is explored can be expressed as</Paragraph>
      <Paragraph position="4"> The second layer selects the best coherent set of NE phrases by applying learning at phrase level. We assume a number of scoring functions, which given a NE phrase produce a real-valued score indicating the plausibility of the phrase. In particular, for each category k [?] K we assume a function scorek which produces a positive score if the phrase is likely to belong to category k, and a negative score otherwise.</Paragraph>
      <Paragraph position="5"> Given this, the NE-Chunker is a function which searches a NE chunking for a sentence x according to the following optimality criterion:</Paragraph>
      <Paragraph position="7"> That is, among the considered chunkings of the sentence, the optimal one is defined to be the one whose NE phrases maximize the summation of phrase scores.</Paragraph>
      <Paragraph position="8"> Practically, there is no need to explicitly enumerate each possible chunking in YBI[?]. Instead, by using dynamic programming the optimal chunking can be found in quadratic time over the sentence length, performing a Viterby-style exploration from left to right (Punyakanok and Roth, 2001).</Paragraph>
      <Paragraph position="9"> Summarizing, the NE-Chunker recognizes the set of NE phrases of a sentence as follows: First, NE candidates are identified in linear time, applying a linear number of decisions. Then, the optimal coherent set of NE phrases is selected in quadratic time, applying a quadratic number of decisions.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning via Recognition Feedback
</SectionTitle>
    <Paragraph position="0"> We now present an online learning strategy for training the learning components of the NE-Chunker, namely the functions hB and hI and the functions scorek, for k [?] K.</Paragraph>
    <Paragraph position="1"> Each function is implemented using a perceptron1 and a representation function.</Paragraph>
    <Paragraph position="2"> A perceptron is a linear discriminant function h-w : Rn - R parametrized by a weight vector -w in Rn.</Paragraph>
    <Paragraph position="3"> Given an instance -x [?] Rn, a perceptron outputs as prediction the inner product between vectors -x and -w,</Paragraph>
    <Paragraph position="5"> 1Actually, we use a variant of the model called the voted perceptron, explained below.</Paragraph>
    <Paragraph position="6"> The representation function Ph : X - Rn codifies an instance x belonging to some space X into a vector inRn with which the perceptron can operate.</Paragraph>
    <Paragraph position="7"> The functions hB and hI predict whether a word begins or is inside a NE phrase, respectively. Each one consists of a perceptron weight vector, -wB and -wI, and a shared representation function Phw, explained in section 4. Each function is computed as hl = -wl *Phw(x), for l [?] {B,I}, and the sign is taken as the binary classification.</Paragraph>
    <Paragraph position="8"> The functions scorek, for k [?] K, compute a score for a phrase (s,e) being a NE phrase of category k. For each function there is a vector -wk, and a shared representation function Php, also explained in section 4. The score is given by the expression scorek(s,e) = -wk *Php(s,e).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Learning Algorithm
</SectionTitle>
      <Paragraph position="0"> We propose a mistake-driven online learning algorithm for training the parameter vectors -w of each perceptron all in one go. The algorithm starts with all vectors initialized to -0, and then runs repeatedly in a number of epochs T through all the sentences in the training set. Given a sentence, it predicts its optimal chunking as specified above using the current vectors. If the predicted chunking is not perfect the vectors which are responsible of the incorrect predictions are updated additively.</Paragraph>
      <Paragraph position="1"> The sentence-based learning algorithm is as follows:  * Input: {(x1,y1),...,(xm,ym)}.</Paragraph>
      <Paragraph position="2"> * Define: W = {-wB, -wI}[?]{-wk|k [?] K}.</Paragraph>
      <Paragraph position="3"> * Initialize: [?]-w [?] W -w = -0; * for t = 1...T , for i = 1...m : 1. ^y = NEchW(xi) 2. learning feedback(W,xi,yi, ^y) * Output: the vectors in W.</Paragraph>
      <Paragraph position="4">  We now describe the learning feedback. Let y[?] be the gold set of NE phrases for a sentence x, and ^y the set predicted by the NE-Chunker. Let goldB(i) and goldI(i) be respectively the perfect indicator functions for the begin and inside classifications, that is, they return 1 if word xi begins or is inside some phrase in y[?] and 0 otherwise. We differentiate three kinds of phrases in order to give feedback to the functions being learned:  * Phrases correctly identified: [?](s,e)k [?] y[?] [?] ^y: - Do nothing, since they are correct.</Paragraph>
      <Paragraph position="5"> * Missed phrases: [?](s,e)k [?] y[?] \ ^y: 1. Update begin word, if misclassified:</Paragraph>
      <Paragraph position="7"> 2. Update misclassified inside words: [?]i : s &lt; i [?] e : such that ( -wI *Phw(xi) [?] 0) -wI = -wI + Phw(xi) 3. Update score function, if it has been applied: if ( -wB *Phw(xs) &gt; 0 [?] [?]i : s &lt; i [?] e : -wI *Phw(xi) &gt; 0) then -wk = -wk + Php(s,e) * Over-predicted phrases: [?](s,e)k [?] ^y \y[?]: 1. Update score function: -wk = -wk [?]Php(s,e) 2. Update begin word, if misclassified :</Paragraph>
      <Paragraph position="9"> This feedback models the interaction between the two layers of the recognition process. The Begin-Inside identification filters out phrase candidates for the scoring layer. Thus, misclassifying words of a correct phrase blocks the generation of the candidate and produces a missed phrase. Therefore, we move the begin or end prediction vectors toward the misclassified words of a missed phrase. When an incorrect phrase is predicted, we move away the prediction vectors of the begin and inside words, provided that they are not in the beginning or inside a phrase in the gold chunking. Note that we deliberately do not care about false positives begin or inside words which do not finally over-produce a phrase.</Paragraph>
      <Paragraph position="10"> Regarding the scoring layer, each category prediction vector is moved toward missed phrases and moved away from over-predicted phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Voted Perceptron and Kernelization
</SectionTitle>
      <Paragraph position="0"> Although the analysis above concerns the perceptron algorithm, we use a modified version, the voted perceptron algorithm, introduced in (Freund and Schapire, 1999).</Paragraph>
      <Paragraph position="1"> The key point of the voted version is that, while training, it stores information in order to make better predictions on test data. Specifically, all the prediction vectors -wj generated after every mistake are stored, together with a weight cj, which corresponds to the number of decisions the vector -wj survives until the next mistake.</Paragraph>
      <Paragraph position="2"> Let J be the number of vector that a perceptron accumulates. The final hypothesis is an averaged vote over the predictions of each vector, computed with the expression h-w(-x) =summationtextJj=1 cj( -wj * -x) .</Paragraph>
      <Paragraph position="3"> Moreover, we work with the dual formulation of the vectors, which allows the use of kernel functions. It is shown in (Freund and Schapire, 1999) that a vector w can be expressed as the sum of instances xj that were added (sxj = +1) or subtracted (sxj = [?]1) in order to create it, as w = summationtextJj=1 sxjxj. Given a kernel function K(x,xprime), the final expression of a dual voted perceptron becomes:</Paragraph>
      <Paragraph position="5"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Feature-Vector Representation
</SectionTitle>
    <Paragraph position="0"> In this section we describe the representation functions Phw and Php, which respectively map a word or a phrase and their local context into a feature vector in Rn, particularly, {0,1}n. First, we define a set of predicates which are computed on words and return one or more values: * Form(w), PoS(w): The form and PoS of word w.</Paragraph>
    <Paragraph position="1"> * Orthographic(w): Binary flags of word w with regard to how is it capitalized (initial-caps, all-caps), the kind of characters that form the word (containsdigits, all-digits, alphanumeric, Roman-number), the presence of punctuation marks (containsdots, contains-hyphen, acronym), single character patterns (lonely-initial, punctuation-mark, singlechar), or the membership of the word to a predefined class (functional-word2), or pattern (URL).</Paragraph>
    <Paragraph position="2">  * Affixes(w): The prefixes and suffixes of the word w (up to 4 characters).</Paragraph>
    <Paragraph position="3"> * Word Type Patterns(ws ...we): Type pattern of  consecutive words ws ...we. The type of a word is either functional (f), capitalized (C), lowercased (l), punctuation mark (.), quote (') or other (x).</Paragraph>
    <Paragraph position="4"> For instance, the word type pattern for the phrase &amp;quot;John Smith payed 3 euros&amp;quot; would be CClxl. For the function Phw(xi) we compute the predicates in a window of words around xi, that is, words xi+l with l [?] [[?]Lw,+Lw]. Each predicate label, together with each relative position l and each returned value forms a final binary indicator feature. The word type patterns are evaluated in all sequences within the window which include the central word i.</Paragraph>
    <Paragraph position="5"> For the function Php(s,e) we represent the context of the phrase by evaluating a [[?]Lp,0] window of predicates at the s word and a separate [0,+Lp] window at the e word. At the s window, we also codify the named entities already recognized at the left context, capturing their category and relative position. Furthermore, we represent the (s,e) phrase by evaluating the predicates without capturing the relative position in the features. In particular, 2Functional words are determiners and prepositions which typically appear inside NEs.</Paragraph>
    <Paragraph position="6"> for the words within (s,e) we evaluate the form, affixes and type patterns of sizes 2, 3 and 4. We also evaluate the complete concatenated form of the phrase and the word type pattern spanning the whole phrase. Finally, we make use of a gazetteer to capture possible NE categories of the whole NE form and each single word within it.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML