File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0851_metho.xml

Size: 7,313 bytes

Last Modified: 2025-10-06 14:09:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0851">
  <Title>Regularized Least-Squares Classification for Word Sense Disambiguation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Knowledge Sources and Feature
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Space
</SectionTitle>
      <Paragraph position="0"> We follow the common practice (Yarowsky, 1993; Florian and Yarowsky, 2002; Lee and Ng, 2002) to represent the training instances as feature vectors. This features are derived from various knowledge sources. We used the following  knowledge sources: * Local information: - the word form of words that appear Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems near the target word in a window of size 3 - the part-of-speech (POS) tags that appear near the target word in a window of size 3 - the lexical form of the target word - the POS tag of the target word * Broad context information: - the lemmas of all words that appear  in the provided context of target word (stop words are removed) In the case of broad context we use the bag-of-words representation with two weighting schema. Binary weighting for RLSC-LIN and term frequency weighting1 for RLSC-COMB. For stemming we used Porter stemmer (Porter, 1980) and for tagging we used Brill tagger (Brill, 1995).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 RLSC
</SectionTitle>
    <Paragraph position="0"> RLSC (Rifkin, 2002; Poggio and Smale, 2003) is a learning method that obtains solutions for binary classification problems via Tikhonov regularization in a Reproducing Kernel Hilbert Space using the square loss.</Paragraph>
    <Paragraph position="1"> Let S = (x1,y1), ..., (xn,yn) be a training sample with xi [?] Rd and yi [?] {[?]1,1} for all i. The hypothesis space H of RLSC is the set of functions f : Rd - R of the form:</Paragraph>
    <Paragraph position="3"> with ci [?] R for all i and k : Rd x Rd - R a kernel function (a symmetric positive definite function) that measures the similarity between two instances.</Paragraph>
    <Paragraph position="4"> RLSC tries to find a function from this hypothesis space that simultaneously has small empirical error and small norm in Reproducing Kernel Hilbert Space generated by kernel k. The resulting minimization problem is:</Paragraph>
    <Paragraph position="6"> In spite of the complex mathematical tools used, the resulted learning algorithm is a very</Paragraph>
    <Paragraph position="8"> The sign(f(x)) will be interpreted as the predicted label ([?]1 or +1) to be assigned to instance x, and the magnitude |f(x) |as the confidence in this prediction.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Applying RLSC to Word Sense
</SectionTitle>
    <Paragraph position="0"> Disambiguation To apply the RLSC learning method we must take care about some details.</Paragraph>
    <Paragraph position="1"> First, RLSC produces a binary classifier and word sense disambiguation is a multi-class classification problem. There are a lot of approaches for combining binary classifiers to solve multi-class problems. We used one-vs-all scheme. We trained a different binary classifier for each sense. For a word with m senses we train m different binary classifiers, each one being trained to distinguish the examples in a single class from the examples in all remaining classes. When a new example had to be classified, the m classifiers are run, and the classifier with the highest confidence, which outputs the largest (most positive) value, is chosen. If more than one such classifiers exists, than from the senses output by these classifiers we chose the one that appears most frequently in the training set. One advantage of one-vs-all combining scheme is the fact that it exploits the confidence (real value) of classifiers producedby RLSC.For more arguments infavor of one-vs-all see (Rifkin and Klautau, 2004).</Paragraph>
    <Paragraph position="2"> Second, RLSC needs a kernel function.</Paragraph>
    <Paragraph position="3"> Preliminary experiments with Senseval-1 and Senseval-2 data show us that the best performance is obtained by linear kernel. This observation agrees with the Lee and Ng results (Lee and Ng, 2002), that in the case of SVM also have obtained the best performance with linear kernel. One-vs-all combining scheme requires comparison of confidences output by different classifiers, and for an unbiased comparison the real values produced by classifiers corresponding to different senses of the target word must be on the same scale. To achieve this goal we need a normalized version of linear kernel.</Paragraph>
    <Paragraph position="4"> Our first system RLSC-LINused the follow-</Paragraph>
    <Paragraph position="6"> where x and y are two instances (feature vectors), &lt; *,* &gt; is the dot product on Rd and bardbl*bardbl is the L2 norm on Rd.</Paragraph>
    <Paragraph position="7"> In the case of RLSC-LIN we used a binary weighting scheme for coding broad context. In the RLSC-COMB we tried to obtain more information from broad context and we used a term frequency weighting scheme. Now, the feature vectors will have apart form 0 two kind of values: 1 for features that encode local information and much small values (of order of 10[?]2) for features encoding broad context. A simple linear kernel will not work in this case because its value will be dominated by the similarity of local contexts. To solve this problem we split the kernel in two parts:</Paragraph>
    <Paragraph position="9"> where kl is a linear normalized kernel that uses only the components of the feature vectors that encode local information (and have 0/1 values) and kb is a normalized kernel that uses only the components of the feature vectors that encode broad context.</Paragraph>
    <Paragraph position="10"> The last detail concerning application of RLSC is the value of regularization parameter l. Experimenting on Senseval-1 and Senseval-2 data sets we establish that small values of l achieve best performance. In all reported results we used l = 10[?]9.</Paragraph>
    <Paragraph position="11"> The results2 of RLSC-LIN and RLSC- null Because RLSC has many points in common with the well-known Support Vector Machine (SVM), we list in Table 2 for comparison the results obtained by SVM with the same kernels.  The results are competitive with the state of the art results reported until now. For example the best two results reported until now on Senseval-2 data are 0.654 (Lee and Ng, 2002) obtained with SVM and 0.665 (Florian and Yarowsky, 2002) obtained by classifiers combination. null The results are especially good if we take into account the fact that our systems do not use syntactic information3 while the others do. Lee and Ng (Lee and Ng, 2002) report a fine-grained score for SVM of only 0.648 if they do not use syntactic knowledge source.</Paragraph>
    <Paragraph position="12"> These results encourage us to participate with RLSC-LIN and RLSC-COMB to the Senseval-3 competition.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML