File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0831_metho.xml

Size: 10,600 bytes

Last Modified: 2025-10-06 14:09:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0831">
  <Title>Finding optimal parameter settings for high performance word sense disambiguation</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction. RLSC algorithm, input
</SectionTitle>
    <Paragraph position="0"> and output.</Paragraph>
    <Paragraph position="1"> This paper is not self-contained. The reader should read first the paper of Marius Popescu (Popescu, 2004), paper that contains the full description the base algorithm, Regularized Least Square Classification (RLSC) applied to WSD.</Paragraph>
    <Paragraph position="2"> Our systems used the feature extraction described in (Popescu, 2004), with some differences.</Paragraph>
    <Paragraph position="3"> Let us fix a word that is on the list of words we must be able to disambiguate. Let a0 be the number of possible senses of this word .</Paragraph>
    <Paragraph position="4"> Each instance of the WSD problem for this fixed word is represented as an array of binary values (features), divided by its Euclidian norm. The number of input features is different from one word to another. The desired output for that array is another binary array, having the length a0 .</Paragraph>
    <Paragraph position="5"> After the feature extraction, the WSD problem is regarded as a linear regression problem. The equation of the regression is a1a3a2a5a4a7a6 where each of the lines of the matrix a1 is an example and each line of a6 is an array of length a0 containing a0a9a8a11a10 zeros and a single a10 . The output a12a13a2 of the trained model a2 on some particular input a12 is an array of values that ideally are just 0 or 1. Actually those values are never exactly 0 and 1, so we are prepared to consider them as an &amp;quot;activation&amp;quot; of the sense recognizers and consider that the most &amp;quot;activated&amp;quot; (the sense with highest value) wins and gives the sense we decide on. In other words, we consider the a12a13a2 values an approximation of the true probabilities associated with each sense.</Paragraph>
    <Paragraph position="6"> The RLSC solution to this linear regression problem is a2a14a4a15a1a3a16a18a17a19a1a20a1a3a16a22a21a24a23a26a25a28a27a30a29a32a31a34a33a32a6 ; The first difference between our system and Marius Popescu's RLSC-LIN is that two of the systems (HTSA3 and HTSA4) use supplementary features, obtained by multiplying up to three of the existing features, because they improved the accuracy on Senseval-2.</Paragraph>
    <Paragraph position="7"> Another difference is that the targets a6 have values 0 and 1, while in the Marius Popescu's RLSC-LIN the targets have values -1 and 1. We see the output values of the trained model as approximations of the true probabilities of the senses.</Paragraph>
    <Paragraph position="8"> The main difference is the postprocessing we apply after obtaining a2 . It is explained below.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Adding parameters
</SectionTitle>
    <Paragraph position="0"> The obviously single parameter of the RLSC is a23 .</Paragraph>
    <Paragraph position="1"> Some improvement can be obtained using larger a23 values. After dropping the parser information from features (when it became clear that we won't have those for Senseval-3) the improvements proved to be too small. Therefore we fixed a23a35a4 a10a28a36 a31a13a37 . During the tests we performed it has been observed that normalizing the models for each sense (the columns of a2 ) - that is dividing them by their Euclidian norm - gives better results, at least on Senseval-2 and don't give too bad results on Senseval-1 either. When you have a yes/no parameter like this one (that is normalizing or not the columns of a2 ), you don't have too much room for fine tuning. After some experimentation we decided that the most promising way to convert this new discrete parameter to a continuous one was to consider that in both cases it was a division by a38a2a20a39a32a38a40 , where</Paragraph>
    <Paragraph position="3"> a10 when we normalize the model columns.</Paragraph>
    <Paragraph position="4"> 3 Choosing the best value of the parameters This is the procedure that has been employed to tune the parameter a41 until the recognition rate achieved the best levels on SENSEVAL-1 and 2 data.</Paragraph>
    <Paragraph position="5">  1. preprocess the input data - obtain the features Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems 2. compute a2 a4a15a1a20a16a18a17a19a1a20a1a3a16a22a21 a23a26a25a28a27 a29a32a31a34a33a32a6 3. for each a41 from 0 to 1 with step 0.1 4. test the model (using a41 in the post-processing phase and then the scoring python script)  At this point we were worried by the lack of any explanation (and therefore the lack of any guarantee about performance on SENSEVAL-3). After some thinking on the strengths and weaknesses of RLSC it became apparent that RLSC implicitly incorporates a Bayesian style reasoning. That is, the senses most frequent in the training data lead to higher norm models, having thus a higher aposteriori probability. Experimental evidence was obtained by plotting graphs with the sense frequencies near graphs with the norms of the model's columns. If you consider this, then the correction done was more or less dividing implicitly by the empiric frequency of the senses in the training data. So, we switched to dividing the columns a2a20a39 by the observed frequency a1 a39 of the a2 -th sense instead of the norm a38a2 a39a18a38 . This lead to an improvement on SENSEVAL2, so this is our base system HTSA1: Test procedure for HTSA1:  1. Postprocessing: correct for a2 =1..a0 the model</Paragraph>
    <Paragraph position="7"> 2. Compute the output a5 a4 a12a13a2 for the input a12 3. Find the maximum component of a5 . Its posi null tion is the label returned by the algorithm for the the input a12 .</Paragraph>
    <Paragraph position="8"> Please observe that, because of the linearity, the correction can be performed on a5 instead of a2 , just after the step 2 : a5 a39 a4a6a5a30a39a4a3  a39 . For this reason we call this correction &amp;quot;postprocessing&amp;quot;.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Description of the systems.
</SectionTitle>
    <Paragraph position="0"> Performance.</Paragraph>
    <Paragraph position="1"> Here is a very short description of our systems. It describes what they have in common and what is different, as well which is their performance level (recognition rate).</Paragraph>
    <Paragraph position="2"> There are four flavors of the same algorithm, based on RLSC. They differ by the preprocessing and by the postprocessing done (name and explanation is under each graphic).</Paragraph>
    <Paragraph position="3">  HTSA1: implicit correction of the frequencies, by dividing the output confidences of the senses by the a1a8a7a10a9a12a11a14a13a15a9a17a16a19a18 a5 a40 ; The graphic shows how the recognition rate depends on a41 on SENSEVAL-1.  HTSA2: explicit correction of the frequencies, by multiplying the output confidences by a certain decreasing function of frequency, that tries to approximate the effect of the postprocessing done by HTSA1; here the performance on SENSEVAL-2 as  HTSA3: like HTSA1, with a preprocessing that adds supplementary features by multiplying some of the existing ones; here the performance on SENSEVAL-2 as a function of a41 .</Paragraph>
    <Paragraph position="4"> The supplementary features added to HTSA3 and HTSA4 are all products of two and three local context features. This was meant to supply the linear regression with some nonlinear terms, giving thus the algorithm the possibility to use conjunctions.  Was our best result lucky? Here is the performance graph of HTSA3 on SENSEVAL-3 as a function of a41 . As we can see, any a41 between 0.2 and 0.3 would have given accuracies between a0a2a1a4a3 a5a7a6 and  was not such a good choice for SENSEVAL-3. Instead, a41 a4 a36 a3 a9a11a10 would have achieved a recognition rate of a0a2a1a4a3 a8a7a6 . In other words, the best value of a41 on SENSEVAL-2 is not necessary the best on SENSEVAL-3. The next section discusses alternative ways of &amp;quot;guessing&amp;quot; the best values of the parameters, as well as why they won't work in this case.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Cross-validation. Possible explanations
</SectionTitle>
    <Paragraph position="0"> of the results The common idea of HTSA 1,2,3 and 4 is that a slight departure from the Bayes apriori frequencies improves the accuracy. This is done here by post-processing and works on any method that produces probabilities/credibilities for all word senses. The degree of departure from the Bayes apriori frequencies can be varied and has been tuned on Senseval-1 and Senseval-2 data until the optimum value a41 a4  a3 a1 has been determined.</Paragraph>
    <Paragraph position="1"> Of course, there was still no guarantee on how good will be the performance on SENSEVAL-3.</Paragraph>
    <Paragraph position="2"> The natural idea is to apply cross-validation to determine the best a41 using the current training set. We tried that, but a very strange thing could be observed. On both SENSEVAL-1 and SENSEVAL-2 the cross-validation indicated that values of a41 around a36 should have been better than a36 a3 a1 . We see this as an indication that the distribution of frequencies on the test set does not fully match with the one of the train set. This could be an explanation about why it is better to depart from the Bayesian style and to go toward the maximum verosimility method. We think that this is exactly what we did.</Paragraph>
    <Paragraph position="3"> Initially we only had HTSA1 and HTSA3. By looking at the graph of the correction done by dividing by a1a8a7a10a9a12a11a14a13a15a9a17a16a19a18 a5a1a0a3a2a4 , reproduced below in red, we observed that it tends to give more chances to the weakly represented senses. To test this hypothesis we built an explicit correction, piecewise linear, also reproduced below on the same graphic. Thus we have obtained HTSA2 and HTSA4. In their case, a41 is the position of the joining point. Those performed close to HTSA1 and HTSA3, so we have experimental evidence that increasing the apriori probabilities of the lower frequency senses gives better recognition rates.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML